abstract:1cc660897625e4fe.tex

1: \begin{abstract}

2: Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains.

3: To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD,

4: largely accredited to their token-dependent learning rate.

5: However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored.

6: We show that incorporating frequency information of tokens in the embedding learning problems  leads to provably  efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent.

7: Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced.

8: Empirically, we show the proposed algorithms are able to  improve or match  adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system,  closing the performance gap between SGD and adaptive algorithms.

9: Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.

10: \end{abstract}

11: