8401a2c5defc7ce9.tex
1: \begin{abstract}
2: %%\vspace{-0.1in}
3: %%Adaptive learning rate algorithms have been widely adopted for embedding learning problems in various domains, 
4: %% due to their token-dependent learning rates and superior performance over stochastic gradient descent (SGD).
5: %%However, the theoretical understanding on the superiority of adaptive algorithms over SGD is largely underexplored.
6: %%adaptive useful, different from existing algorithm which uses gradient and no theoretical guarantees, we use frequency information to explicitly (provably benefits over sgd) 
7: %
8: %Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. 
9: %To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, 
10: %largely accredited to their token-dependent learning rate. 
11: %However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored.
12: %We show that incorporating frequency information of tokens in the embedding learning problems  leads to provably  efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. 
13: %Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. 
14: %Empirically, we show the proposed algorithms are able to  improve or match  adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system,  closing the performance gap between SGD and adaptive algorithms. 
15: %Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.
16: %%\vspace{-0.2in}
17: %\end{abstract}
18: