1: \begin{abstract}
2: %%\vspace{-0.1in}
3: %%Adaptive learning rate algorithms have been widely adopted for embedding learning problems in various domains,
4: %% due to their token-dependent learning rates and superior performance over stochastic gradient descent (SGD).
5: %%However, the theoretical understanding on the superiority of adaptive algorithms over SGD is largely underexplored.
6: %%adaptive useful, different from existing algorithm which uses gradient and no theoretical guarantees, we use frequency information to explicitly (provably benefits over sgd)
7: %
8: %Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains.
9: %To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD,
10: %largely accredited to their token-dependent learning rate.
11: %However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored.
12: %We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent.
13: %Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced.
14: %Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms.
15: %Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.
16: %%\vspace{-0.2in}
17: %\end{abstract}
18: