1: \begin{abstract}
2:
3: Micro-batch clipping, a gradient clipping method, has recently shown potential in enhancing auto-speech recognition (ASR) model performance.
4: %
5: However, the underlying mechanism behind this improvement remains mysterious, particularly the observation that only certain micro-batch sizes are beneficial.
6: %
7: In this paper, we make the first attempt to explain this phenomenon.
8: %
9: Inspired by recent data pruning research, we assume that specific training samples may impede model convergence during certain training phases.
10: %
11: Under this assumption, the convergence analysis shows that micro-batch clipping can improve the convergence rate asymptotically at the cost of an additional constant bias that does not diminish with more training iterations.
12: %
13: The bias is dependent on a few factors and can be minimized at specific micro-batch size, thereby elucidating the existence of the sweet-spot micro-batch size observed previously.
14: %
15: We also verify the effectiveness of micro-batch clipping beyond speech models on vision and language models, and show promising performance gains in these domains.
16: %
17: An exploration of potential limitations shows that micro-batch clipping is less effective when training data originates from multiple distinct domains.
18:
19: \end{abstract}
20: