abstract:d7529ab42bfb8fcd.tex

1: \begin{abstract}

2:

3: The stochastic approximation (SA) algorithm is a widely used probabilistic

4: method for finding a zero or a fixed point of a vector-valued funtion,

5: when only noisy measurements of the function are available.

6: In the literature to date, one makes a distinction between ``synchronous''

7: updating, whereby every component of the current guess

8: is updated at each time,

9: and ``asynchronous'' updating, whereby only one component is updated.

10: In this paper, we study an intermediate situation that we call

11: ``batch asynchronous stochastic approximation'' (BASA), in which,

12: at each time instant,

13: \textit{some but not all} components of the current estimated solution

14: are updated.

15: BASA allows the user to trade off memory requirements against time

16: complexity.

17: We develop a general methodology for proving that such algorithms converge

18: to the fixed point of the map under study.

19: These convergence proofs make use of weaker hypotheses than existing results.

20: Specifically, existing convergence proofs

21: require that the measurement noise is a zero-mean i.i.d\ sequence

22: or a martingale difference sequence.

23: In the present paper, we permit biased measurements, that is,

24: measurement noises that have nonzero conditional mean.

25: Also, all convergence results to date assume that the stochastic step sizes

26: satisfy a probabilistic analog of the well-known Robbins-Monro conditions.

27: We replace this assumption by a purely deterministic

28: condition on the irreducibility of the underlying Markov processes.

29:

30: As specific applications to Reinforcement Learning,

31: we analyze the temporal difference algorithm $\TDl$ for value iteration,

32: and the $Q$-learning algorithm for finding the optimal action-value function.

33: In both cases, we establish the convergence of these algorithms,

34: under milder conditions than in the existing literature.

35:

36: \end{abstract}

37: