abstract:2a767649d9bc5c4c.tex

1: \begin{abstract}

2:

3: Micro-batch clipping, a gradient clipping method, has recently shown potential in enhancing auto-speech recognition (ASR) model performance.

4: %

5: However, the underlying mechanism behind this improvement remains mysterious, particularly the observation that only certain micro-batch sizes are beneficial.

6: %

7: In this paper, we make the first attempt to explain this phenomenon.

8: %

9: Inspired by recent data pruning research, we assume that specific training samples may impede model convergence during certain training phases.

10: %

11: Under this assumption, the convergence analysis shows that micro-batch clipping can improve the convergence rate asymptotically at the cost of an additional constant bias that does not diminish with more training iterations.

12: %

13: The bias is dependent on a few factors and can be minimized at specific micro-batch size, thereby elucidating the existence of the sweet-spot micro-batch size observed previously.

14: %

15: We also verify the effectiveness of micro-batch clipping beyond speech models on vision and language models, and show promising performance gains in these domains.

16: %

17: An exploration of potential limitations shows that micro-batch clipping is less effective when training data originates from multiple distinct domains.

18:

19: \end{abstract}

20: