Revisit Micro-batch Clipping: Adaptive Data Pruning via Gradient Manipulation
This addresses a mysterious bottleneck in gradient clipping for researchers and practitioners in deep learning, though it is incremental as it builds on existing micro-batch clipping and data pruning concepts.
The paper tackled the unexplained phenomenon of micro-batch clipping improving model performance only at specific micro-batch sizes, revealing that it enhances convergence rate asymptotically with a constant bias minimized at certain sizes, and verified gains across speech, vision, and language models.
Micro-batch clipping, a gradient clipping method, has recently shown potential in enhancing auto-speech recognition (ASR) model performance. However, the underlying mechanism behind this improvement remains mysterious, particularly the observation that only certain micro-batch sizes are beneficial. In this paper, we make the first attempt to explain this phenomenon. Inspired by recent data pruning research, we assume that specific training samples may impede model convergence during certain training phases. Under this assumption, the convergence analysis shows that micro-batch clipping can improve the convergence rate asymptotically at the cost of an additional constant bias that does not diminish with more training iterations. The bias is dependent on a few factors and can be minimized at specific micro-batch size, thereby elucidating the existence of the sweet-spot micro-batch size observed previously. We also verify the effectiveness of micro-batch clipping beyond speech models on vision and language models, and show promising performance gains in these domains. An exploration of potential limitations shows that micro-batch clipping is less effective when training data originates from multiple distinct domains.