652049abdbe7fc95.tex
1: \begin{abstract}
2:   We consider alternating gradient descent (AGD) with fixed step size
3:   $\eta > 0$, applied to the asymmetric matrix factorization objective.
4:   We show that, for a rank-$r$ matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$,
5:   $T = \left( \left(\frac{\sigma_1(\mathbf{A})}{\sigma_r(\mathbf{A})}\right)^2 \log(1/\epsilon)\right)$
6:   iterations of alternating gradient descent suffice to reach an $\epsilon$-optimal factorization 
7:   $\| \mathbf{A} - \mathbf{X}_T^{\vphantom{\intercal}} \mathbf{Y}_T^{\intercal} \|_{\rm F}^2 \leq \epsilon \| \mathbf{A} \|_{\rm F}^2$
8:   with high probability
9:   starting from an atypical random initialization. The
10:   factors have rank $d>r$ so that $\mathbf{X}_T\in\mathbb{R}^{m \times d}$
11:   and $\mathbf{Y}_T \in\mathbb{R}^{n \times d}$.
12:   Experiments suggest that our proposed initialization is not merely
13:   of theoretical benefit, but rather significantly improves
14:   convergence of gradient descent in practice. Our proof is
15:   conceptually simple: a uniform PL-inequality and uniform Lipschitz
16:   smoothness constant are guaranteed for a sufficient number of
17:   iterations, starting from our random initialization.  Our proof
18:   method should be useful for extending and simplifying convergence
19:   analyses for a broader class of nonconvex low-rank factorization
20:   problems.
21: \end{abstract}
22: