1: \begin{abstract}
2: Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains.
3: To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD,
4: largely accredited to their token-dependent learning rate.
5: However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored.
6: We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent.
7: Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced.
8: Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms.
9: Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.
10: \end{abstract}
11: