Self-discipline on multiple channels
This work addresses the challenge of making self-distillation more practical by reducing computational costs and memory usage for machine learning practitioners, though it appears incremental as it builds on existing self-distillation and regularization techniques.
The paper tackles the problem of improving model generalization and robustness to noisy labels by proposing Self-discipline on multiple channels (SMC), a method combining consistency regularization and self-distillation without requiring extra models or modifications. Results show that SMC-2 outperforms existing methods like Label Smoothing Regularization and Self-distillation From The Last Mini-batch on all models, and beats Sharpness-Aware Minimization on 83% of models, with improvements of 0.28% to 1.80% when combined with data augmentation.
Self-distillation relies on its own information to improve the generalization ability of the model and has a bright future. Existing self-distillation methods either require additional models, model modification, or batch size expansion for training, which increases the difficulty of use, memory consumption, and computational cost. This paper developed Self-discipline on multiple channels(SMC), which combines consistency regularization with self-distillation using the concept of multiple channels. Conceptually, SMC consists of two steps: 1) each channel data is simultaneously passed through the model to obtain its corresponding soft label, and 2) the soft label saved in the previous step is read together with the soft label obtained from the current channel data through the model to calculate the loss function. SMC uses consistent regularization and self-distillation to improve the generalization ability of the model and the robustness of the model to noisy labels. We named the SMC containing only two channels as SMC-2. Comparative experimental results on both datasets show that SMC-2 outperforms Label Smoothing Regularizaion and Self-distillation From The Last Mini-batch on all models, and outperforms the state-of-the-art Sharpness-Aware Minimization method on 83% of the models.Compatibility of SMC-2 and data augmentation experimental results show that using both SMC-2 and data augmentation improves the generalization ability of the model between 0.28% and 1.80% compared to using only data augmentation. Ultimately, the results of the label noise interference experiments show that SMC-2 curbs the tendency that the model's generalization ability decreases in the late training period due to the interference of label noise. The code is available at https://github.com/JiuTiannn/SMC-Self-discipline-on-multiple-channels.