1cc660897625e4fe.tex
1: \begin{abstract}
2: Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. 
3: To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, 
4: largely accredited to their token-dependent learning rate. 
5: However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored.
6: We show that incorporating frequency information of tokens in the embedding learning problems  leads to provably  efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. 
7: Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. 
8: Empirically, we show the proposed algorithms are able to  improve or match  adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system,  closing the performance gap between SGD and adaptive algorithms. 
9: Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.
10: \end{abstract}
11: