LGAug 27, 2025
The Role of Teacher Calibration in Knowledge DistillationSuyoung Kim, Seonguk Park, Junhoo Lee et al.
Knowledge Distillation (KD) has emerged as an effective model compression technique in deep learning, enabling the transfer of knowledge from a large teacher model to a compact student model. While KD has demonstrated significant success, it is not yet fully understood which factors contribute to improving the student's performance. In this paper, we reveal a strong correlation between the teacher's calibration error and the student's accuracy. Therefore, we claim that the calibration of the teacher model is an important factor for effective KD. Furthermore, we demonstrate that the performance of KD can be improved by simply employing a calibration method that reduces the teacher's calibration error. Our algorithm is versatile, demonstrating effectiveness across various tasks from classification to detection. Moreover, it can be easily integrated with existing state-of-the-art methods, consistently achieving superior performance.
CVSep 29, 2022
Semantics-Guided Object Removal for Facial Images: with Broad Applicability and Robust Style PreservationJookyung Song, Yeonjin Chang, Seonguk Park et al.
Object removal and image inpainting in facial images is a task in which objects that occlude a facial image are specifically targeted, removed, and replaced by a properly reconstructed facial image. Two different approaches utilizing U-net and modulated generator respectively have been widely endorsed for this task for their unique advantages but notwithstanding each method's innate disadvantages. U-net, a conventional approach for conditional GANs, retains fine details of unmasked regions but the style of the reconstructed image is inconsistent with the rest of the original image and only works robustly when the size of the occluding object is small enough. In contrast, the modulated generative approach can deal with a larger occluded area in an image and provides {a} more consistent style, yet it usually misses out on most of the detailed features. This trade-off between these two models necessitates an invention of a model that can be applied to any size of mask while maintaining a consistent style and preserving minute details of facial features. Here, we propose Semantics-Guided Inpainting Network (SGIN) which itself is a modification of the modulated generator, aiming to take advantage of its advanced generative capability and preserve the high-fidelity details of the original image. By using the guidance of a semantic map, our model is capable of manipulating facial features which grants direction to the one-to-many problem for further practicability.
IVNov 22, 2021
Local-Selective Feature Distillation for Single Image Super-ResolutionSeongUk Park, Nojun Kwak
Recent improvements in convolutional neural network (CNN)-based single image super-resolution (SISR) methods rely heavily on fabricating network architectures, rather than finding a suitable training algorithm other than simply minimizing the regression loss. Adapting knowledge distillation (KD) can open a way for bringing further improvement for SISR, and it is also beneficial in terms of model efficiency. KD is a model compression method that improves the performance of Deep Neural Networks (DNNs) without using additional parameters for testing. It is getting the limelight recently for its competence at providing a better capacity-performance tradeoff. In this paper, we propose a novel feature distillation (FD) method which is suitable for SISR. We show the limitations of the existing FitNet-based FD method that it suffers in the SISR task, and propose to modify the existing FD algorithm to focus on local feature information. In addition, we propose a teacher-student-difference-based soft feature attention method that selectively focuses on specific pixel locations to extract feature information. We call our method local-selective feature distillation (LSFD) and verify that our method outperforms conventional FD methods in SISR problems.
LGSep 9, 2020
On the Orthogonality of Knowledge Distillation with Other Techniques: From an Ensemble PerspectiveSeongUk Park, KiYoon Yoo, Nojun Kwak
To put a state-of-the-art neural network to practical use, it is necessary to design a model that has a good trade-off between the resource consumption and performance on the test set. Many researchers and engineers are developing methods that enable training or designing a model more efficiently. Developing an efficient model includes several strategies such as network architecture search, pruning, quantization, knowledge distillation, utilizing cheap convolution, regularization, and also includes any craft that leads to a better performance-resource trade-off. When combining these technologies together, it would be ideal if one source of performance improvement does not conflict with others. We call this property as the orthogonality in model efficiency. In this paper, we focus on knowledge distillation and demonstrate that knowledge distillation methods are orthogonal to other efficiency-enhancing methods both analytically and empirically. Analytically, we claim that knowledge distillation functions analogous to a ensemble method, bootstrap aggregating. This analytical explanation is provided from the perspective of implicit data augmentation property of knowledge distillation. Empirically, we verify knowledge distillation as a powerful apparatus for practical deployment of efficient neural network, and also introduce ways to integrate it with other methods effectively.
LGFeb 5, 2020
Feature-map-level Online Adversarial Knowledge DistillationInseop Chung, SeongUk Park, Jangho Kim et al.
Feature maps contain rich information about image intensity and spatial correlation. However, previous online knowledge distillation methods only utilize the class probabilities. Thus in this paper, we propose an online knowledge distillation method that transfers not only the knowledge of the class probabilities but also that of the feature map using the adversarial training framework. We train multiple networks simultaneously by employing discriminators to distinguish the feature map distributions of different networks. Each network has its corresponding discriminator which discriminates the feature map from its own as fake while classifying that of the other network as real. By training a network to fool the corresponding discriminator, it can learn the other network's feature map distribution. We show that our method performs better than the conventional direct alignment method such as L1 and is more suitable for online distillation. Also, we propose a novel cyclic learning scheme for training more than two networks together. We have applied our method to various network architectures on the classification task and discovered a significant improvement of performance especially in the case of training a pair of a small network and a large one.
CVSep 24, 2019
FEED: Feature-level Ensemble for Knowledge DistillationSeongUk Park, Nojun Kwak
Knowledge Distillation (KD) aims to transfer knowledge in a teacher-student framework, by providing the predictions of the teacher network to the student network in the training stage to help the student network generalize better. It can use either a teacher with high capacity or {an} ensemble of multiple teachers. However, the latter is not convenient when one wants to use feature-map-based distillation methods. For a solution, this paper proposes a versatile and powerful training algorithm named FEature-level Ensemble for knowledge Distillation (FEED), which aims to transfer the ensemble knowledge using multiple teacher networks. We introduce a couple of training algorithms that transfer ensemble knowledge to the student at the feature map level. Among the feature-map-based distillation methods, using several non-linear transformations in parallel for transferring the knowledge of the multiple teacher{s} helps the student find more generalized solutions. We name this method as parallel FEED, andexperimental results on CIFAR-100 and ImageNet show that our method has clear performance enhancements, without introducing any additional parameters or computations at test time. We also show the experimental results of sequentially feeding teacher's information to the student, hence the name sequential FEED, and discuss the lessons obtained. Additionally, the empirical results on measuring the reconstruction errors at the feature map give hints for the enhancements.
CVNov 29, 2018
Sym-parameterized Dynamic Inference for Mixed-Domain Image TranslationSimyung Chang, SeongUk Park, John Yang et al.
Recent advances in image-to-image translation have led to some ways to generate multiple domain images through a single network. However, there is still a limit in creating an image of a target domain without a dataset on it. We propose a method that expands the concept of `multi-domain' from data to the loss area and learns the combined characteristics of each domain to dynamically infer translations of images in mixed domains. First, we introduce Sym-parameter and its learning method for variously mixed losses while synchronizing them with input conditions. Then, we propose Sym-parameterized Generative Network (SGN) which is empirically confirmed of learning mixed characteristics of various data and losses, and translating images to any mixed-domain without ground truths, such as 30% Van Gogh and 20% Monet and 40% snowy.
CVFeb 14, 2018
Paraphrasing Complex Network: Network Compression via Factor TransferJangho Kim, SeongUk Park, Nojun Kwak
Many researchers have sought ways of model compression to reduce the size of a deep neural network (DNN) with minimal performance degradation in order to use DNNs in embedded systems. Among the model compression methods, a method called knowledge transfer is to train a student network with a stronger teacher network. In this paper, we propose a novel knowledge transfer method which uses convolutional operations to paraphrase teacher's knowledge and to translate it for the student. This is done by two convolutional modules, which are called a paraphraser and a translator. The paraphraser is trained in an unsupervised manner to extract the teacher factors which are defined as paraphrased information of the teacher network. The translator located at the student network extracts the student factors and helps to translate the teacher factors by mimicking them. We observed that our student network trained with the proposed factor transfer method outperforms the ones trained with conventional knowledge transfer methods.
CVDec 7, 2017
Broadcasting Convolutional Network for Visual Relational ReasoningSimyung Chang, John Yang, Seonguk Park et al.
In this paper, we propose the Broadcasting Convolutional Network (BCN) that extracts key object features from the global field of an entire input image and recognizes their relationship with local features. BCN is a simple network module that collects effective spatial features, embeds location information and broadcasts them to the entire feature maps. We further introduce the Multi-Relational Network (multiRN) that improves the existing Relation Network (RN) by utilizing the BCN module. In pixel-based relation reasoning problems, with the help of BCN, multiRN extends the concept of `pairwise relations' in conventional RNs to `multiwise relations' by relating each object with multiple objects at once. This yields in O(n) complexity for n objects, which is a vast computational gain from RNs that take O(n^2). Through experiments, multiRN has achieved a state-of-the-art performance on CLEVR dataset, which proves the usability of BCN on relation reasoning problems.
LGNov 27, 2017
Butterfly Effect: Bidirectional Control of Classification Performance by Small Additive PerturbationYoungJoon Yoo, Seonguk Park, Junyoung Choi et al.
This paper proposes a new algorithm for controlling classification results by generating a small additive perturbation without changing the classifier network. Our work is inspired by existing works generating adversarial perturbation that worsens classification performance. In contrast to the existing methods, our work aims to generate perturbations that can enhance overall classification performance. To solve this performance enhancement problem, we newly propose a perturbation generation network (PGN) influenced by the adversarial learning strategy. In our problem, the information in a large external dataset is summarized by a small additive perturbation, which helps to improve the performance of the classifier trained with the target dataset. In addition to this performance enhancement problem, we show that the proposed PGN can be adopted to solve the classical adversarial problem without utilizing the information on the target classifier. The mentioned characteristics of our method are verified through extensive experiments on publicly available visual datasets.