732f796ebac7192e.tex
1: \begin{abstract}
2: Distributed machine learning has recently become a critical paradigm for training large models on vast datasets. 
3: We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints.
4: While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question. 
5: In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization. 
6: By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients. 
7: We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation. 
8: Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.
9: A sample implementation of the method is available at \url{https://github.com/yoniLc/AdaCons}.
10: \end{abstract}
11: