Nakyil Kim

LGMay 19, 2021Code

Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation

Taehyeon Kim, Jaehoon Oh, NakYil Kim et al.

Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures. Generally, the objective function of KD is the Kullback-Leibler (KL) divergence loss between the softened probability distributions of the teacher model and the student model with the temperature scaling hyperparameter tau. Despite its widespread use, few studies have discussed the influence of such softening on generalization. Here, we theoretically show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0 and empirically show that the logit matching is positively correlated to performance improvement in general. From this observation, we consider an intuitive KD loss function, the mean squared error (MSE) between the logit vectors, so that the student model can directly learn the logit of the teacher model. The MSE loss outperforms the KL divergence loss, explained by the difference in the penultimate layer representations between the two losses. Furthermore, we show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise. The code to reproduce the experiments is publicly available online at https://github.com/jhoon-oh/kd_data/.

LGDec 7, 2020

Adaptive Local Bayesian Optimization Over Multiple Discrete Variables

Taehyeon Kim, Jaeyeon Ahn, Nakyil Kim et al.

In the machine learning algorithms, the choice of the hyperparameter is often an art more than a science, requiring labor-intensive search with expert experience. Therefore, automation on hyperparameter optimization to exclude human intervention is a great appeal, especially for the black-box functions. Recently, there have been increasing demands of solving such concealed tasks for better generalization, though the task-dependent issue is not easy to solve. The Black-Box Optimization challenge (NeurIPS 2020) required competitors to build a robust black-box optimizer across different domains of standard machine learning problems. This paper describes the approach of team KAIST OSI in a step-wise manner, which outperforms the baseline algorithms by up to +20.39%. We first strengthen the local Bayesian search under the concept of region reliability. Then, we design a combinatorial kernel for a Gaussian process kernel. In a similar vein, we combine the methodology of Bayesian and multi-armed bandit,(MAB) approach to select the values with the consideration of the variable types; the real and integer variables are with Bayesian, while the boolean and categorical variables are with MAB. Empirical evaluations demonstrate that our method outperforms the existing methods across different tasks.

Nakyil Kim

2 Papers