LG MLSep 11, 2020

Extending Label Smoothing Regularization with Self-Knowledge Distillation

Ji-Yue Wang, Pei Zhang, Wen-feng Pang, Jie Li

arXiv:2009.05226v11.2

Originality Incremental advance

AI Analysis

This work addresses a specific issue in training deep neural networks by refining regularization techniques, but it is incremental as it builds on existing LSR and knowledge distillation methods.

The authors tackled the problem of improving Label Smoothing Regularization (LSR) by integrating it with knowledge distillation, resulting in methods like LsrKD and MrKD that consistently enhance LSR performance, especially in deep neural networks where LSR was previously ineffective, with experiments showing comparable or better results.

Inspired by the strong correlation between the Label Smoothing Regularization(LSR) and Knowledge distillation(KD), we propose an algorithm LsrKD for training boost by extending the LSR method to the KD regime and applying a softer temperature. Then we improve the LsrKD by a Teacher Correction(TC) method, which manually sets a constant larger proportion for the right class in the uniform distribution teacher. To further improve the performance of LsrKD, we develop a self-distillation method named Memory-replay Knowledge Distillation (MrKD) that provides a knowledgeable teacher to replace the uniform distribution one in LsrKD. The MrKD method penalizes the KD loss between the current model's output distributions and its copies' on the training trajectory. By preventing the model learning so far from its historical output distribution space, MrKD can stabilize the learning and find a more robust minimum. Our experiments show that LsrKD can improve LSR performance consistently at no cost, especially on several deep neural networks where LSR is ineffectual. Also, MrKD can significantly improve single model training. The experiment results confirm that the TC can help LsrKD and MrKD to boost training, especially on the networks they are failed. Overall, LsrKD, MrKD, and their TC variants are comparable to or outperform the LSR method, suggesting the broad applicability of these KD methods.

View on arXiv PDF

Similar