57.2CVMay 22
Multimodal Distribution Matching for Vision-Language Dataset DistillationJongoh Jeong, Hoyong Kwon, Minseok Kim et al.
Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.
33.7CVApr 14
Bootstrapping Video Semantic Segmentation Model via Distillation-assisted Test-Time AdaptationJihun Kim, Hoyong Kwon, Hyeokjun Kweon et al.
Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA's effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.
CVJun 29, 2025
DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive SegmentationJihun Kim, Hoyong Kwon, Hyeokjun Kweon et al.
Interactive segmentation (IS) allows users to iteratively refine object boundaries with minimal cues, such as positive and negative clicks. While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi-part objects). To overcome these challenges, we propose DC-TTA, a novel test-time adaptation (TTA) framework that adapts SAM on a per-sample basis by leveraging user interactions as supervision. Instead of forcing a single model to incorporate all user clicks at once, DC-TTA partitions the clicks into more coherent subsets, each processed independently via TTA with a separated model. This Divide-and-Conquer strategy reduces conflicts among diverse cues and enables more localized updates. Finally, we merge the adapted models to form a unified predictor that integrates the specialized knowledge from each subset. Experimental results across various benchmarks demonstrate that DC-TTA significantly outperforms SAM's zero-shot results and conventional TTA methods, effectively handling complex tasks such as camouflaged object segmentation with fewer interactions and improved accuracy.