abstract:732f796ebac7192e.tex

1: \begin{abstract}

2: Distributed machine learning has recently become a critical paradigm for training large models on vast datasets.

3: We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints.

4: While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question.

5: In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization.

6: By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients.

7: We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation.

8: Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.

9: A sample implementation of the method is available at \url{https://github.com/yoniLc/AdaCons}.

10: \end{abstract}

11: