54.3CVJun 2
GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology RepresentationsJonggwon Park, Seongeun Lee, Junhyun Park et al.
Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.
64.8CVApr 17
AHS: Adaptive Head Synthesis via Synthetic Data AugmentationsTaewoong Kang, Hyojin Jang, Sohyun Jeong et al.
Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.
CVFeb 12, 2024
SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoderJaeseong Lee, Junha Hyung, Sohyun Jeong et al.
Face swapping has gained significant attention for its varied applications. Most previous face swapping approaches have relied on the seesaw game training scheme, also known as the target-oriented approach. However, this often leads to instability in model training and results in undesired samples with blended identities due to the target identity leakage problem. Source-oriented methods achieve more stable training with self-reconstruction objective but often fail to accurately reflect target image's skin color and illumination. This paper introduces the Shape Agnostic Masked AutoEncoder (SAMAE) training scheme, a novel self-supervised approach that combines the strengths of both target-oriented and source-oriented approaches. Our training scheme addresses the limitations of traditional training methods by circumventing the conventional seesaw game and introducing clear ground truth through its self-reconstruction training regime. Our model effectively mitigates identity leakage and reflects target albedo and illumination through learned disentangled identity and non-identity features. Additionally, we closely tackle the shape misalignment and volume discrepancy problems with new techniques, including perforation confusion and random mesh scaling. SAMAE establishes a new state-of-the-art, surpassing other baseline methods, preserving both identity and non-identity attributes without sacrificing on either aspect.
CVJun 14, 2025
Good Noise Makes Good Edits: A Training-Free Diffusion-Based Video Editing with Image and Text PromptsSaemee Choi, Sohyun Jeong, Jaegul Choo et al.
We propose ImEdit, the first zero-shot, training-free video editing method conditioned on both images and text. The proposed method introduces $ρ$-start sampling and dilated dual masking to construct well-structured noise maps for coherent and accurate edits. We further present zero image guidance, a controllable negative prompt strategy, for visual fidelity. Both quantitative and qualitative evaluations show that our method outperforms state-of-the-art methods across all metrics.
CVMar 2, 2025
Zero-Shot Head Swapping in Real-World ScenariosTaewoong Kang, Sohyun Jeong, Hyojin Jang et al.
With growing demand in media and social networks for personalized images, the need for advanced head-swapping techniques, integrating an entire head from the head image with the body from the body image, has increased. However, traditional head swapping methods heavily rely on face-centered cropped data with primarily frontal facing views, which limits their effectiveness in real world applications. Additionally, their masking methods, designed to indicate regions requiring editing, are optimized for these types of dataset but struggle to achieve seamless blending in complex situations, such as when the original data includes features like long hair extending beyond the masked area. To overcome these limitations and enhance adaptability in diverse and complex scenarios, we propose a novel head swapping method, HID, that is robust to images including the full head and the upper body, and handles from frontal to side views, while automatically generating context aware masks. For automatic mask generation, we introduce the IOMask, which enables seamless blending of the head and body, effectively addressing integration challenges. We further introduce the hair injection module to capture hair details with greater precision. Our experiments demonstrate that the proposed approach achieves state-of-the-art performance in head swapping, providing visually consistent and realistic results across a wide range of challenging conditions.