1: \begin{abstract}
2: Distributed machine learning has recently become a critical paradigm for training large models on vast datasets.
3: We examine the stochastic optimization problem for deep learning within synchronous parallel computing environments under communication constraints.
4: While averaging distributed gradients is the most widely used method for gradient estimation, whether this is the optimal strategy remains an open question.
5: In this work, we analyze the distributed gradient aggregation process through the lens of subspace optimization.
6: By formulating the aggregation problem as an objective-aware subspace optimization problem, we derive an efficient weighting scheme for gradients, guided by subspace coefficients.
7: We further introduce subspace momentum to accelerate convergence while maintaining statistical unbiasedness in the aggregation.
8: Our method demonstrates improved performance over the ubiquitous gradient averaging on multiple MLPerf tasks while remaining extremely efficient in both communicational and computational complexity.
9: A sample implementation of the method is available at \url{https://github.com/yoniLc/AdaCons}.
10: \end{abstract}
11: