1: \begin{abstract}
2: Tabular average reward Temporal Difference (TD) learning
3: is perhaps the simplest and the most fundamental policy evaluation algorithm in average reward reinforcement learning.
4: After at least 25 years since its discovery,
5: we are finally able to provide a long-awaited almost sure convergence analysis.
6: Namely,
7: we are the first to prove that, under very mild conditions,
8: tabular average reward TD converges almost surely to a sample-path dependent fixed point.
9: Key to this success is a new general stochastic approximation result concerning nonexpansive mappings with Markovian and additive noise,
10: built on recent advances in stochastic Krasnoselskii-Mann iterations.
11: \end{abstract}
12: