1: \begin{abstract}
2: On-device memory concerns in distributed deep learning have become severe due to (i) the growth of model size in multi-GPU training, and (ii) the wide adoption of deep neural networks for federated learning on IoT devices which have limited storage. In such settings, communication efficient optimization methods are attractive alternatives, however they still struggle with memory issues. To tackle these challenges, we propose an communication efficient method called contractive error feedback (\name). As opposed to SGD with error-feedback (EFSGD) that inefficiently manages memory, \name obtains the sweet spot of convergence and memory usage, and achieves communication efficiency by leveraging biased and all-reducable gradient compression. We empirically validate \name on various learning tasks that include image classification, language modeling, and machine translation and observe that \name saves 80\% – 90\% of the extra memory in EFSGD with almost no loss on test performance, while also achieving 1.3x – 5x speedup of SGD. Through our work, we also demonstrate the feasibility and convergence of \name to clear up the theoretical barrier of integrating \name to popular memory efficient frameworks such as ZeRO-3.
3: \end{abstract}
4: