abstract:fc72b658a42bd110.tex

1: \begin{abstract}

2: Tabular average reward Temporal Difference (TD) learning

3: is perhaps the simplest and the most fundamental policy evaluation algorithm in average reward reinforcement learning.

4: After at least 25 years since its discovery,

5: we are finally able to provide a long-awaited almost sure convergence analysis.

6: Namely,

7: we are the first to prove that, under very mild conditions,

8: tabular average reward TD converges almost surely to a sample-path dependent fixed point.

9: Key to this success is a new general stochastic approximation result concerning nonexpansive mappings with Markovian and additive noise,

10: built on recent advances in stochastic Krasnoselskii-Mann iterations.

11: \end{abstract}

12: