1afd55961fd1f014.tex
1: \begin{abstract}
2: We analyze recurrent neural networks with {diagonal hidden-to-hidden weight matrices}, trained with gradient descent in the supervised learning setting, and prove that gradient descent can achieve optimality \emph{without} massive overparameterization. Our in-depth nonasymptotic analysis (i) provides {improved} bounds on the network size $m$ in terms of the sequence length $T$, sample size $n$ and ambient dimension $d$, and (ii) identifies the significant impact of long-term dependencies in the dynamical system on the convergence and network width bounds characterized by a cutoff point that depends on the Lipschitz continuity of the activation function. Remarkably, this analysis reveals that an appropriately-initialized recurrent neural network trained with $n$ samples can achieve optimality with a network size $m$ that scales only logarithmically with $n$. This sharply contrasts with the prior works that require high-order polynomial dependency of $m$ on $n$ to establish strong regularity conditions. Our results are based on an explicit characterization of the class of dynamical systems that can be approximated and learned by recurrent neural networks via norm-constrained transportation mappings, and establishing local smoothness properties of the hidden state with respect to the learnable parameters.
3: \end{abstract}
4: