abstract:f1db792bd7992297.tex

1: \begin{abstract}

2: Actor-Critic based approaches were among the first to address reinforcement

3: learning in a general setting. Recently, these algorithms have gained

4: renewed interest due to their generality, good convergence properties,

5: and possible biological relevance. In this paper, we introduce an

6: online temporal difference based actor-critic algorithm which is proved

7: to converge to a neighborhood of a local maximum of the average reward.

8: Linear function approximation is used by the critic in order estimate

9: the value function, and the temporal difference signal, which is passed

10: from the critic to the actor. The main distinguishing feature of the

11: present convergence proof is that both the actor and the critic operate

12: on a similar time scale, while in most current convergence proofs

13: they are required to have very different time scales in order to converge.

14: Moreover, the same temporal difference signal is used to update the

15: parameters of both the actor and the critic. A limitation of the proposed

16: approach, compared to results available for two time scale convergence,

17: is that convergence is guaranteed only to a neighborhood of an optimal

18: value, rather to an optimal value itself. The single time scale and

19: identical temporal difference signal used by the actor and the critic,

20: may provide a step towards constructing more biologically realistic

21: models of reinforcement learning in the brain.

22: \end{abstract}

23: