e85f266d6b27134e.tex
1: \begin{abstract}
2:   We discuss the approximation of the value function for
3:   infinite-horizon discounted Markov Reward Processes (\abbr{MRP}) with
4:   nonlinear functions trained with the Temporal-Difference (\abbr{TD}) learning
5:   algorithm. We first consider this problem under a certain scaling of the
6:   approximating function, leading to a regime called lazy training. In
7:   this regime, the parameters of the model vary only slightly during
8:   the learning process, a feature that has recently been observed in
9:   the training of neural networks, where the scaling we study arises
10:   naturally, implicit in the initialization of their parameters. Both
11:   in the under- and over-parametrized frameworks, we prove exponential
12:   convergence to local, respectively global minimizers of the above
13:   algorithm in the lazy training regime. We then compare this scaling
14:   of the parameters to the \emph{mean-field} regime, where the approximately
15:   linear behavior of the model is lost. Under this alternative scaling we prove
16:  that all fixed points of the dynamics in parameter space are global minimizers.
17:   We finally give examples of our convergence results in the case of models that diverge if trained
18:   with non-lazy \abbr{TD} learning, and in the case of neural networks.
19: % lazy training
20: %prove convergence
21: % give examples
22: 
23:  % Under a given scaling of the initial condition which is similar to the widely used LeCun initialization, large enough single layer neural networks used for value function approximation and trained through temporal difference perform at least as well as a randomly initialized network. In particular, they converge exponentially fast to a local minimizer.
24: \end{abstract}
25: