CVOct 28, 2023Code
Switching Temporary Teachers for Semi-Supervised Semantic SegmentationJaemin Na, Jung-Woo Ha, Hyung Jin Chang et al.
The teacher-student framework, prevalent in semi-supervised semantic segmentation, mainly employs the exponential moving average (EMA) to update a single teacher's weights based on the student's. However, EMA updates raise a problem in that the weights of the teacher and student are getting coupled, causing a potential performance bottleneck. Furthermore, this problem may become more severe when training with more complicated labels such as segmentation masks but with few annotated data. This paper introduces Dual Teacher, a simple yet effective approach that employs dual temporary teachers aiming to alleviate the coupling problem for the student. The temporary teachers work in shifts and are progressively improved, so consistently prevent the teacher and student from becoming excessively close. Specifically, the temporary teachers periodically take turns generating pseudo-labels to train a student model and maintain the distinct characteristics of the student model for each epoch. Consequently, Dual Teacher achieves competitive performance on the PASCAL VOC, Cityscapes, and ADE20K benchmarks with remarkably shorter training times than state-of-the-art methods. Moreover, we demonstrate that our approach is model-agnostic and compatible with both CNN- and Transformer-based models. Code is available at \url{https://github.com/naver-ai/dual-teacher}.
CVMay 31, 2022
itKD: Interchange Transfer-based Knowledge Distillation for 3D Object DetectionHyeon Cho, Junyong Choi, Geonwoo Baek et al.
Point-cloud based 3D object detectors recently have achieved remarkable progress. However, most studies are limited to the development of network architectures for improving only their accuracy without consideration of the computational efficiency. In this paper, we first propose an autoencoder-style framework comprising channel-wise compression and decompression via interchange transfer-based knowledge distillation. To learn the map-view feature of a teacher network, the features from teacher and student networks are independently passed through the shared autoencoder; here, we use a compressed representation loss that binds the channel-wised compression knowledge from both student and teacher networks as a kind of regularization. The decompressed features are transferred in opposite directions to reduce the gap in the interchange reconstructions. Lastly, we present an head attention loss to match the 3D object detection information drawn by the multi-head self-attention mechanism. Through extensive experiments, we verify that our method can train the lightweight model that is well-aligned with the 3D point cloud detection task and we demonstrate its superiority using the well-known public datasets; e.g., Waymo and nuScenes.
LGJun 1, 2022
ORC: Network Group-based Knowledge Distillation using Online Role ChangeJunyong Choi, Hyeon Cho, Seokhwa Cheung et al.
In knowledge distillation, since a single, omnipotent teacher network cannot solve all problems, multiple teacher-based knowledge distillations have been studied recently. However, sometimes their improvements are not as good as expected because some immature teachers may transfer the false knowledge to the student. In this paper, to overcome this limitation and take the efficacy of the multiple networks, we divide the multiple networks into teacher and student groups, respectively. That is, the student group is a set of immature networks that require learning the teacher's knowledge, while the teacher group consists of the selected networks that are capable of teaching successfully. We propose our online role change strategy where the top-ranked networks in the student group are able to promote to the teacher group at every iteration. After training the teacher group using the error samples of the student group to refine the teacher group's knowledge, we transfer the collaborative knowledge from the teacher group to the student group successfully. We verify the superiority of the proposed method on CIFAR-10, CIFAR-100, and ImageNet which achieves high performance. We further show the generality of our method with various backbone architectures such as ResNet, WRN, VGG, Mobilenet, and Shufflenet.
CVMay 22, 2025Code
Ranked Entropy Minimization for Continual Test-Time AdaptationJisu Han, Jaemin Na, Wonjun Hwang
Test-time adaptation aims to adapt to realistic environments in an online manner by learning during test time. Entropy minimization has emerged as a principal strategy for test-time adaptation due to its efficiency and adaptability. Nevertheless, it remains underexplored in continual test-time adaptation, where stability is more important. We observe that the entropy minimization method often suffers from model collapse, where the model converges to predicting a single class for all images due to a trivial solution. We propose ranked entropy minimization to mitigate the stability problem of the entropy minimization method and extend its applicability to continuous scenarios. Our approach explicitly structures the prediction difficulty through a progressive masking strategy. Specifically, it gradually aligns the model's probability distributions across different levels of prediction difficulty while preserving the rank order of entropy. The proposed method is extensively evaluated across various benchmarks, demonstrating its effectiveness through empirical results. Our code is available at https://github.com/pilsHan/rem
CVMar 14, 2024Code
D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object DetectionDinh Phat Do, Taehoon Kim, Jaemin Na et al.
Domain adaptation for object detection typically entails transferring knowledge from one visible domain to another visible domain. However, there are limited studies on adapting from the visible to the thermal domain, because the domain gap between the visible and thermal domains is much larger than expected, and traditional domain adaptation can not successfully facilitate learning in this situation. To overcome this challenge, we propose a Distinctive Dual-Domain Teacher (D3T) framework that employs distinct training paradigms for each domain. Specifically, we segregate the source and target training sets for building dual-teachers and successively deploy exponential moving average to the student model to individual teachers of each domain. The framework further incorporates a zigzag learning method between dual teachers, facilitating a gradual transition from the visible to thermal domains during training. We validate the superiority of our method through newly designed experimental protocols with well-known thermal datasets, i.e., FLIR and KAIST. Source code is available at https://github.com/EdwardDo69/D3T .
CVNov 26, 2021Code
Contrastive Vicinal Space for Unsupervised Domain AdaptationJaemin Na, Dongyoon Han, Hyung Jin Chang et al.
Recent unsupervised domain adaptation methods have utilized vicinal space between the source and target domains. However, the equilibrium collapse of labels, a problem where the source labels are dominant over the target labels in the predictions of vicinal instances, has never been addressed. In this paper, we propose an instance-wise minimax strategy that minimizes the entropy of high uncertainty instances in the vicinal space to tackle the stated problem. We divide the vicinal space into two subspaces through the solution of the minimax problem: contrastive space and consensus space. In the contrastive space, inter-domain discrepancy is mitigated by constraining instances to have contrastive views and labels, and the consensus space reduces the confusion between intra-domain categories. The effectiveness of our method is demonstrated on public benchmarks, including Office-31, Office-Home, and VisDA-C, achieving state-of-the-art performances. We further show that our method outperforms the current state-of-the-art methods on PACS, which indicates that our instance-wise approach works well for multi-source domain adaptation as well. Code is available at https://github.com/NaJaeMin92/CoVi.
CVMar 28, 2025
SCHNet: SAM Marries CLIP for Human ParsingKunliang Liu, Jianming Wang, Rize Jin et al.
Vision Foundation Model (VFM) such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training Model (CLIP) has shown promising performance for segmentation and detection tasks. However, although SAM excels in fine-grained segmentation, it faces major challenges when applying it to semantic-aware segmentation. While CLIP exhibits a strong semantic understanding capability via aligning the global features of language and vision, it has deficiencies in fine-grained segmentation tasks. Human parsing requires to segment human bodies into constituent parts and involves both accurate fine-grained segmentation and high semantic understanding of each part. Based on traits of SAM and CLIP, we formulate high efficient modules to effectively integrate features of them to benefit human parsing. We propose a Semantic-Refinement Module to integrate semantic features of CLIP with SAM features to benefit parsing. Moreover, we formulate a high efficient Fine-tuning Module to adjust the pretrained SAM for human parsing that needs high semantic information and simultaneously demands spatial details, which significantly reduces the training time compared with full-time training and achieves notable performance. Extensive experiments demonstrate the effectiveness of our method on LIP, PPP, and CIHP databases.
CVMar 18, 2024
Semantic Prompting with Image-Token for Continual LearningJisu Han, Jaemin Na, Wonjun Hwang
Continual learning aims to refine model parameters for new tasks while retaining knowledge from previous tasks. Recently, prompt-based learning has emerged to leverage pre-trained models to be prompted to learn subsequent tasks without the reliance on the rehearsal buffer. Although this approach has demonstrated outstanding results, existing methods depend on preceding task-selection process to choose appropriate prompts. However, imperfectness in task-selection may lead to negative impacts on the performance particularly in the scenarios where the number of tasks is large or task distributions are imbalanced. To address this issue, we introduce I-Prompt, a task-agnostic approach focuses on the visual semantic information of image tokens to eliminate task prediction. Our method consists of semantic prompt matching, which determines prompts based on similarities between tokens, and image token-level prompting, which applies prompts directly to image tokens in the intermediate layers. Consequently, our method achieves competitive performance on four benchmarks while significantly reducing training time compared to state-of-the-art methods. Moreover, we demonstrate the superiority of our method across various scenarios through extensive experiments.
CVNov 24, 2025
Collaborative Learning with Multiple Foundation Models for Source-Free Domain AdaptationHuisoo Lee, Jisu Han, Hyunsouk Cho et al.
Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain without access to source data. Recent advances in Foundation Models (FMs) have introduced new opportunities for leveraging external semantic knowledge to guide SFDA. However, relying on a single FM is often insufficient, as it tends to bias adaptation toward a restricted semantic coverage, failing to capture diverse contextual cues under domain shift. To overcome this limitation, we propose a Collaborative Multi-foundation Adaptation (CoMA) framework that jointly leverages two different FMs (e.g., CLIP and BLIP) with complementary properties to capture both global semantics and local contextual cues. Specifically, we employ a bidirectional adaptation mechanism that (1) aligns different FMs with the target model for task adaptation while maintaining their semantic distinctiveness, and (2) transfers complementary knowledge from the FMs to the target model. To ensure stable adaptation under mini-batch training, we introduce Decomposed Mutual Information (DMI) that selectively enhances true dependencies while suppressing false dependencies arising from incomplete class coverage. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art SFDA methods across four benchmarks, including Office-31, Office-Home, DomainNet-126, and VisDA, under the closed-set setting, while also achieving best results on partial-set and open-set variants.
CVOct 10, 2025
D-TPT: Dimensional Entropy Maximization for Calibrating Test-Time Prompt Tuning in Vision-Language ModelsJisu Han, Wonjun Hwang
Test-time adaptation paradigm provides flexibility towards domain shifts by performing immediate adaptation on unlabeled target data from the source model. Vision-Language Models (VLMs) leverage their generalization capabilities for diverse downstream tasks, and test-time prompt tuning has emerged as a prominent solution for adapting VLMs. In this work, we explore contrastive VLMs and identify the modality gap caused by a single dominant feature dimension across modalities. We observe that the dominant dimensions in both text and image modalities exhibit high predictive sensitivity, and that constraining their influence can improve calibration error. Building on this insight, we propose dimensional entropy maximization that regularizes the distribution of textual features toward uniformity to mitigate the dependency of dominant dimensions. Our method alleviates the degradation of calibration performance in test-time prompt tuning, offering a simple yet effective solution to enhance the reliability of VLMs in real-world deployment scenarios.
CVJun 30, 2025
When Test-Time Adaptation Meets Self-Supervised ModelsJisu Han, Jihee Park, Dongyoon Han et al.
Training on test-time data enables deep learning models to adapt to dynamic environmental changes, enhancing their practical applicability. Online adaptation from source to target domains is promising but it remains highly reliant on the performance of source pretrained model. In this paper, we investigate whether test-time adaptation (TTA) methods can continuously improve models trained via self-supervised learning (SSL) without relying on source pretraining. We introduce a self-supervised TTA protocol after observing that existing TTA approaches struggle when directly applied to self-supervised models with low accuracy on the source domain. Furthermore, we propose a collaborative learning framework that integrates SSL and TTA models, leveraging contrastive learning and knowledge distillation for stepwise representation refinement. We validate our method on diverse self-supervised models, including DINO, MoCo, and iBOT, across TTA benchmarks. Extensive experiments validate the effectiveness of our approach in SSL, showing that it achieves competitive performance even without source pretraining.
CVMar 18, 2024
OurDB: Ouroboric Domain Bridging for Multi-Target Domain Adaptive Semantic SegmentationSeungbeom Woo, Geonwoo Baek, Taehoon Kim et al.
Multi-target domain adaptation (MTDA) for semantic segmentation poses a significant challenge, as it involves multiple target domains with varying distributions. The goal of MTDA is to minimize the domain discrepancies among a single source and multi-target domains, aiming to train a single model that excels across all target domains. Previous MTDA approaches typically employ multiple teacher architectures, where each teacher specializes in one target domain to simplify the task. However, these architectures hinder the student model from fully assimilating comprehensive knowledge from all target-specific teachers and escalate training costs with increasing target domains. In this paper, we propose an ouroboric domain bridging (OurDB) framework, offering an efficient solution to the MTDA problem using a single teacher architecture. This framework dynamically cycles through multiple target domains, aligning each domain individually to restrain the biased alignment problem, and utilizes Fisher information to minimize the forgetting of knowledge from previous target domains. We also propose a context-guided class-wise mixup (CGMix) that leverages contextual information tailored to diverse target contexts in MTDA. Experimental evaluations conducted on four urban driving datasets (i.e., GTA5, Cityscapes, IDD, and Mapillary) demonstrate the superiority of our method over existing state-of-the-art approaches.
CVMay 9, 2023
SRIL: Selective Regularization for Class-Incremental LearningJisu Han, Jaemin Na, Wonjun Hwang
Human intelligence gradually accepts new information and accumulates knowledge throughout the lifespan. However, deep learning models suffer from a catastrophic forgetting phenomenon, where they forget previous knowledge when acquiring new information. Class-Incremental Learning aims to create an integrated model that balances plasticity and stability to overcome this challenge. In this paper, we propose a selective regularization method that accepts new knowledge while maintaining previous knowledge. We first introduce an asymmetric feature distillation method for old and new classes inspired by cognitive science, using the gradient of classification and knowledge distillation losses to determine whether to perform pattern completion or pattern separation. We also propose a method to selectively interpolate the weight of the previous model for a balance between stability and plasticity, and we adjust whether to transfer through model confidence to ensure the performance of the previous class and enable exploratory learning. We validate the effectiveness of the proposed method, which surpasses the performance of existing methods through extensive experimental protocols using CIFAR-100, ImageNet-Subset, and ImageNet-Full.
CVNov 28, 2021
CDGNet: Class Distribution Guided Network for Human ParsingKunliang Liu, Ouk Choi, Jianming Wang et al.
The objective of human parsing is to partition a human in an image into constituent parts. This task involves labeling each pixel of the human image according to the classes. Since the human body comprises hierarchically structured parts, each body part of an image can have its sole position distribution characteristic. Probably, a human head is less likely to be under the feet, and arms are more likely to be near the torso. Inspired by this observation, we make instance class distributions by accumulating the original human parsing label in the horizontal and vertical directions, which can be utilized as supervision signals. Using these horizontal and vertical class distribution labels, the network is guided to exploit the intrinsic position distribution of each class. We combine two guided features to form a spatial guidance map, which is then superimposed onto the baseline network by multiplication and concatenation to distinguish the human parts precisely. We conducted extensive experiments to demonstrate the effectiveness and superiority of our method on three well-known benchmarks: LIP, ATR, and CIHP databases.
CVNov 18, 2020
FixBi: Bridging Domain Spaces for Unsupervised Domain AdaptationJaemin Na, Heechul Jung, Hyung Jin Chang et al.
Unsupervised domain adaptation (UDA) methods for learning domain invariant representations have achieved remarkable progress. However, most of the studies were based on direct adaptation from the source domain to the target domain and have suffered from large domain discrepancies. In this paper, we propose a UDA method that effectively handles such large domain discrepancies. We introduce a fixed ratio-based mixup to augment multiple intermediate domains between the source and target domain. From the augmented-domains, we train the source-dominant model and the target-dominant model that have complementary characteristics. Using our confidence-based learning methodologies, e.g., bidirectional matching with high-confidence predictions and self-penalization using low-confidence predictions, the models can learn from each other or from its own results. Through our proposed methods, the models gradually transfer domain knowledge from the source to the target domain. Extensive experiments demonstrate the superiority of our proposed method on three public benchmarks: Office-31, Office-Home, and VisDA-2017.
CVSep 18, 2020
Densely Guided Knowledge Distillation using Multiple Teacher AssistantsWonchul Son, Jaemin Na, Junyong Choi et al.
With the success of deep neural networks, knowledge distillation which guides the learning of a small student network from a large teacher network is being actively studied for model compression and transfer learning. However, few studies have been performed to resolve the poor learning issue of the student network when the student and teacher model sizes significantly differ. In this paper, we propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size to efficiently bridge the large gap between the teacher and student networks. To stimulate more efficient learning of the student network, we guide each teacher assistant to every other smaller teacher assistants iteratively. Specifically, when teaching a smaller teacher assistant at the next step, the existing larger teacher assistants from the previous step are used as well as the teacher network. Moreover, we design stochastic teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped. This acts as a regularizer to improve the efficiency of teaching of the student network. Thus, the student can always learn salient distilled knowledge from the multiple sources. We verified the effectiveness of the proposed method for a classification task using CIFAR-10, CIFAR-100, and ImageNet. We also achieved significant performance improvements with various backbone architectures such as ResNet, WideResNet, and VGG.
CVMar 9, 2020
Pacemaker: Intermediate Teacher Knowledge Distillation For On-The-Fly Convolutional Neural NetworkWonchul Son, Youngbin Kim, Wonseok Song et al.
There is a need for an on-the-fly computational process with very low performance system such as system-on-chip (SoC) and embedded device etc. This paper presents pacemaker knowledge distillation as intermediate ensemble teacher to use convolutional neural network in these systems. For on-the-fly system, we consider student model using 1xN shape on-the-fly filter and teacher model using normal NxN shape filter. We note three points about training student model, caused by applying on-the-fly filter. First, same depth but unavoidable thin model compression. Second, the large capacity gap and parameter size gap due to only the horizontal field must be selected not the vertical receptive. Third, the performance instability and degradation of direct distilling. To solve these problems, we propose intermediate teacher, named pacemaker, for an on-the-fly student. So, student can be trained from pacemaker and original teacher step by step. Experiments prove our proposed method make significant performance (accuracy) improvements: on CIFAR100, 5.39% increased in WRN-40-4 than conventional knowledge distillation which shows even low performance than baseline. And we solve train instability, occurred when conventional knowledge distillation was applied without proposed method, by reducing deviation range by applying proposed method pacemaker knowledge distillation.
CVMar 5, 2020
Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a VideoHyeon Cho, Taehoon Kim, Hyung Jin Chang et al.
We propose a self-supervised visual learning method by predicting the variable playback speeds of a video. Without semantic labels, we learn the spatio-temporal visual representation of the video by leveraging the variations in the visual appearance according to different playback speeds under the assumption of temporal coherence. To learn the spatio-temporal visual variations in the entire video, we have not only predicted a single playback speed but also generated clips of various playback speeds and directions with randomized starting points. Hence the visual representation can be successfully learned from the meta information (playback speeds and directions) of the video. We also propose a new layer dependable temporal group normalization method that can be applied to 3D convolutional networks to improve the representation learning performance where we divide the temporal features into several groups and normalize each one using the different corresponding parameters. We validate the effectiveness of our method by fine-tuning it to the action recognition and video retrieval tasks on UCF-101 and HMDB-51.
CVNov 17, 2018
Sequential Image-based Attention Network for Inferring Force Estimation without Haptic SensorHochul Shin, Hyeon Cho, Dongyi Kim et al.
Humans can infer approximate interaction force between objects from only vision information because we already have learned it through experiences. Based on this idea, we propose a recurrent convolutional neural network-based method using sequential images for inferring interaction force without using a haptic sensor. For training and validating deep learning methods, we collected a large number of images and corresponding interaction forces through an electronic motor-based device. To concentrate on changing shapes of a target object by the external force in images, we propose a sequential image-based attention module, which learns a salient model from temporal dynamics. The proposed sequential image-based attention module consists of a sequential spatial attention module and a sequential channel attention module, which are extended to exploit multiple sequential images. For gaining better accuracy, we also created a weighted average pooling layer for both spatial and channel attention modules. The extensive experimental results verified that the proposed method successfully infers interaction forces under the various conditions, such as different target materials, illumination changes, and external force directions.