abstract:e85f266d6b27134e.tex

1: \begin{abstract}

2:   We discuss the approximation of the value function for

3:   infinite-horizon discounted Markov Reward Processes (\abbr{MRP}) with

4:   nonlinear functions trained with the Temporal-Difference (\abbr{TD}) learning

5:   algorithm. We first consider this problem under a certain scaling of the

6:   approximating function, leading to a regime called lazy training. In

7:   this regime, the parameters of the model vary only slightly during

8:   the learning process, a feature that has recently been observed in

9:   the training of neural networks, where the scaling we study arises

10:   naturally, implicit in the initialization of their parameters. Both

11:   in the under- and over-parametrized frameworks, we prove exponential

12:   convergence to local, respectively global minimizers of the above

13:   algorithm in the lazy training regime. We then compare this scaling

14:   of the parameters to the \emph{mean-field} regime, where the approximately

15:   linear behavior of the model is lost. Under this alternative scaling we prove

16:  that all fixed points of the dynamics in parameter space are global minimizers.

17:   We finally give examples of our convergence results in the case of models that diverge if trained

18:   with non-lazy \abbr{TD} learning, and in the case of neural networks.

19: % lazy training

20: %prove convergence

21: % give examples

22:

23:  % Under a given scaling of the initial condition which is similar to the widely used LeCun initialization, large enough single layer neural networks used for value function approximation and trained through temporal difference perform at least as well as a randomly initialized network. In particular, they converge exponentially fast to a local minimizer.

24: \end{abstract}

25: