1: \begin{abstract}
2: Actor-Critic based approaches were among the first to address reinforcement
3: learning in a general setting. Recently, these algorithms have gained
4: renewed interest due to their generality, good convergence properties,
5: and possible biological relevance. In this paper, we introduce an
6: online temporal difference based actor-critic algorithm which is proved
7: to converge to a neighborhood of a local maximum of the average reward.
8: Linear function approximation is used by the critic in order estimate
9: the value function, and the temporal difference signal, which is passed
10: from the critic to the actor. The main distinguishing feature of the
11: present convergence proof is that both the actor and the critic operate
12: on a similar time scale, while in most current convergence proofs
13: they are required to have very different time scales in order to converge.
14: Moreover, the same temporal difference signal is used to update the
15: parameters of both the actor and the critic. A limitation of the proposed
16: approach, compared to results available for two time scale convergence,
17: is that convergence is guaranteed only to a neighborhood of an optimal
18: value, rather to an optimal value itself. The single time scale and
19: identical temporal difference signal used by the actor and the critic,
20: may provide a step towards constructing more biologically realistic
21: models of reinforcement learning in the brain.
22: \end{abstract}
23: