proof:9ecb09af3b01bc18.tex

1: \begin{proof}

2:     Convergence in probability of $\alpha_k := \alpha(\phi_k)$ to $\alpha^*$ has been shown in Proposition 3.4; it is straightforward to upgrade this to almost-sure convergence using standard stochastic approximation theory. The intuition for the remainder of the proof is that when $\alpha_k$ is close to $\alpha^*$, the C-trace updates are close to those of standard Retrace targeting the policy $\alpha^* \pi + (1-\alpha^*)\mu$, which are known to converge under the conditions of the theorem. This is made rigorous by decomposing the update on the Q-function from the $(k+1)$\textsuperscript{th} trajectory as

3:     \begin{align*}

4:         Q_{k+1}  =  \overbrace{(\mathbf{1} - \widetilde{\varepsilon}_k) \odot Q_k  +  \widetilde{\varepsilon}_k \odot \mathcal{R}^{\alpha^*} Q_k}^{\text{Desired update}} + \overbrace{(Q_{k+1} - (\mathbf{1}-\widetilde{\varepsilon}_k) \odot Q_k - \widetilde{\varepsilon}_k \odot \mathcal{R}^{\alpha_k} Q_k)}^{\text{Martingale noise}} + \widetilde{\varepsilon}_k\odot  \overbrace{(\mathcal{R}^{\alpha_k} Q_k - \mathcal{R}^{\alpha^*} Q_k)}^{\text{Perturbation}}\, ,

5:     \end{align*}

6:     where $\mathcal{R}^\alpha$ denotes the Retrace operator targeting $\alpha\pi + (1-\alpha)\mu$,

7:     and with $\widetilde{\varepsilon}_k(x, a) = \varepsilon_k \mathbb{E}[\sum_{t} \mathbbm{1}_{(x_t,a_t)=(x, a)}|(x_0,a_0)=(x,a)]$, and $\odot$ the Hadamard product and $\mathbf{1}$ the vector of 1's. It is then possible to appeal to Proposition 4.5 of \citet{bertsekas1996neuro} that $Q_k \rightarrow Q^{\alpha^*\pi + (1-\alpha^*)\mu}$ almost surely, using the assumptions of theorem.

8: \end{proof}

9: