abstract:8401a2c5defc7ce9.tex

1: \begin{abstract}

2: %%\vspace{-0.1in}

3: %%Adaptive learning rate algorithms have been widely adopted for embedding learning problems in various domains,

4: %% due to their token-dependent learning rates and superior performance over stochastic gradient descent (SGD).

5: %%However, the theoretical understanding on the superiority of adaptive algorithms over SGD is largely underexplored.

6: %%adaptive useful, different from existing algorithm which uses gradient and no theoretical guarantees, we use frequency information to explicitly (provably benefits over sgd)

7: %

8: %Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains.

9: %To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD,

10: %largely accredited to their token-dependent learning rate.

11: %However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored.

12: %We show that incorporating frequency information of tokens in the embedding learning problems  leads to provably  efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent.

13: %Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced.

14: %Empirically, we show the proposed algorithms are able to  improve or match  adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system,  closing the performance gap between SGD and adaptive algorithms.

15: %Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems.

16: %%\vspace{-0.2in}

17: %\end{abstract}

18: