1: \begin{abstract}
2: Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning.
3: The scale of initialisation is a crucial factor, as small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards \textit{simple} solutions. This work provides a general and quantitative description of the early alignment phase, originally introduced by \citet{maennel2018gradient}. For small initialisation and one hidden ReLU layer networks, the early stage of the training dynamics leads to an alignment of the neurons towards key directions. This alignment induces a sparse representation of the network, which is directly related to the implicit bias of gradient flow at convergence.
4: This sparsity inducing alignment however comes at the expense of difficulties in minimising the training objective: we also provide a simple data example for which overparameterised networks fail to converge towards global minima and only converge to a spurious stationary point instead.
5: \end{abstract}
6: