1: \begin{abstract}
2: We present for the first time an asymptotic convergence analysis of two time-scale stochastic approximation driven by
3: `controlled' Markov noise. In particular, both the faster and slower recursions have non-additive controlled Markov noise components in
4: addition to martingale difference noise. We analyze the asymptotic behavior of our framework
5: by relating it
6: to limiting differential inclusions in both time-scales that are defined in terms
7: of the ergodic occupation measures associated with the controlled
8: Markov processes.
9: %We also point out that some additional assumptions are needed to complete the analysis of single time-scale controlled Markov noise framework of Borkar which
10: %motivates us to take the range of the controlled Markov processes as compact.
11: Finally, we present a solution to the off-policy convergence problem for temporal difference
12: learning with linear function approximation, using
13: our results.
14: \end{abstract}
15: