abstract:7ba16b6f34a3118a.tex

1: \begin{abstract}

2:

3: 	%

4: %

5: %

6: %

7: %

8: %

9:

10: 	State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD).

11:     Recently, many variations have been explored:\ perturbing parameters for better accuracy (such as in Extragradient), limiting SGD updates to a subset of parameters for increased efficiency (such as meProp) or a combination of both (such as Dropout). However, the convergence of these methods is often not studied in theory. \\

12:     %

13: 	We propose a unified theoretical framework to study

14: 	such SGD  variants---encompassing the aforementioned algorithms and additionally a broad variety of methods used for communication efficient training or model compression.

15: 	Our insights can be used as a guide to improve the efficiency of such methods and facilitate generalization to new applications.

16: 	As an example, we tackle the task of jointly training networks, a version of which (limited to sub-networks)

17: 	 is used to create Slimmable Networks. By training a low-rank Transformer jointly with a standard one we obtain superior performance than when it is trained separately.

18: 	 %

19: 	%

20:

21:

22: %

23: %

24: %

25: %

26: %

27: %

28: \end{abstract}

29: