94c281d06e7fd422.tex
1: \begin{abstract}
2: Data abundance across different domains exhibits a long-tailed distribution: few domains have abundant data, while most face data scarcity. Our work focuses on a multilingual setting, where available data is heavily skewed toward high-resource languages, creating significant imbalances in training data sizes across languages. This disparity challenges training language models to perform uniformly well in all languages. Two common strategies to address this issue are upsampling low-resource languages (\textit{Temperature Sampling}) and upweighting their loss functions (\textit{Scalarization}). These methods are often assumed to be equivalent, but this equivalence has not been rigorously established, prompting our investigation.
3: 
4: Through theoretical and empirical analysis, we identify when these two methods are equivalent and when they diverge. We prove that they are equivalent under \emph{full} gradient descent but differ under \emph{stochastic} gradient descent due to differences in gradient variance. Specifically, Temperature Sampling exhibits lower variance in gradient estimation, leading to faster convergence but a higher risk of overfitting. Based on these insights, we propose \name{}, a strategy that starts by heavily upsampling low-resource languages to accelerate convergence and gradually reduces the upsampling to prevent overfitting—achieving the best of both worlds. Our method competes effectively with existing data re-weighting techniques while offering computational efficiency.
5: \end{abstract}
6: