1: \begin{abstract}
2:
3: The stochastic approximation (SA) algorithm is a widely used probabilistic
4: method for finding a zero or a fixed point of a vector-valued funtion,
5: when only noisy measurements of the function are available.
6: In the literature to date, one makes a distinction between ``synchronous''
7: updating, whereby every component of the current guess
8: is updated at each time,
9: and ``asynchronous'' updating, whereby only one component is updated.
10: In this paper, we study an intermediate situation that we call
11: ``batch asynchronous stochastic approximation'' (BASA), in which,
12: at each time instant,
13: \textit{some but not all} components of the current estimated solution
14: are updated.
15: BASA allows the user to trade off memory requirements against time
16: complexity.
17: We develop a general methodology for proving that such algorithms converge
18: to the fixed point of the map under study.
19: These convergence proofs make use of weaker hypotheses than existing results.
20: Specifically, existing convergence proofs
21: require that the measurement noise is a zero-mean i.i.d\ sequence
22: or a martingale difference sequence.
23: In the present paper, we permit biased measurements, that is,
24: measurement noises that have nonzero conditional mean.
25: Also, all convergence results to date assume that the stochastic step sizes
26: satisfy a probabilistic analog of the well-known Robbins-Monro conditions.
27: We replace this assumption by a purely deterministic
28: condition on the irreducibility of the underlying Markov processes.
29:
30: As specific applications to Reinforcement Learning,
31: we analyze the temporal difference algorithm $\TDl$ for value iteration,
32: and the $Q$-learning algorithm for finding the optimal action-value function.
33: In both cases, we establish the convergence of these algorithms,
34: under milder conditions than in the existing literature.
35:
36: \end{abstract}
37: