Junhyug Noh

CV
h-index11
16papers
90citations
Novelty46%
AI Score55

16 Papers

CVAug 16, 2022
Object Discovery via Contrastive Learning for Weakly Supervised Object Detection

Jinhwan Seo, Wonho Bae, Danica J. Sutherland et al.

Weakly Supervised Object Detection (WSOD) is a task that detects objects in an image using a model trained only on image-level annotations. Current state-of-the-art models benefit from self-supervised instance-level supervision, but since weak supervision does not include count or location information, the most common ``argmax'' labeling method often ignores many instances of objects. To alleviate this issue, we propose a novel multiple instance labeling method called object discovery. We further introduce a new contrastive loss under weak supervision where no instance-level information is available for sampling, called weakly supervised contrastive loss (WSCL). WSCL aims to construct a credible similarity threshold for object discovery by leveraging consistent features for embedding vectors in the same class. As a result, we achieve new state-of-the-art results on MS-COCO 2014 and 2017 as well as PASCAL VOC 2012, and competitive results on PASCAL VOC 2007.

82.8LGApr 3Code
Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

Giyeong Oh, Junghyun Lee, Jaehyun Park et al.

Modern LLMs inherit strong priors from web-scale pretraining, which can limit the headroom of post-training data-selection strategies. While Active Preference Learning (APL) seeks to optimize query efficiency in online Direct Preference Optimization (DPO), the inherent richness of on-policy candidate pools often renders simple Random sampling a surprisingly formidable baseline. We evaluate uncertainty-based APL against Random across harmlessness, helpfulness, and instruction-following settings, utilizing both reward models and LLM-as-a-judge proxies. We find that APL yields negligible improvements in proxy win-rates compared to Random. Crucially, we observe a dissociation where win-rate improves even as general capability -- measured by standard benchmarks -- degrades. APL fails to mitigate this capability collapse or reduce variance significantly better than random sampling. Our findings suggest that in the regime of strong pre-trained priors, the computational overhead of active selection is difficult to justify against the ``cheap diversity'' provided by simple random samples. Our code is available at https://github.com/BootsofLagrangian/random-vs-apl.

CVNov 5, 2025Code
Beyond Softmax: Dual-Branch Sigmoid Architecture for Accurate Class Activation Maps

Yoojin Oh, Junhyug Noh

Class Activation Mapping (CAM) and its extensions have become indispensable tools for visualizing the evidence behind deep network predictions. However, by relying on a final softmax classifier, these methods suffer from two fundamental distortions: additive logit shifts that arbitrarily bias importance scores, and sign collapse that conflates excitatory and inhibitory features. We propose a simple, architecture-agnostic dual-branch sigmoid head that decouples localization from classification. Given any pretrained model, we clone its classification head into a parallel branch ending in per-class sigmoid outputs, freeze the original softmax head, and fine-tune only the sigmoid branch with class-balanced binary supervision. At inference, softmax retains recognition accuracy, while class evidence maps are generated from the sigmoid branch -- preserving both magnitude and sign of feature contributions. Our method integrates seamlessly with most CAM variants and incurs negligible overhead. Extensive evaluations on fine-grained tasks (CUB-200-2011, Stanford Cars) and WSOL benchmarks (ImageNet-1K, OpenImages30K) show improved explanation fidelity and consistent Top-1 Localization gains -- without any drop in classification accuracy. Code is available at https://github.com/finallyupper/beyond-softmax.

CVMay 2, 2022
One Weird Trick to Improve Your Semi-Weakly Supervised Semantic Segmentation Model

Wonho Bae, Junhyug Noh, Milad Jalali Asadabadi et al.

Semi-weakly supervised semantic segmentation (SWSSS) aims to train a model to identify objects in images based on a small number of images with pixel-level labels, and many more images with only image-level labels. Most existing SWSSS algorithms extract pixel-level pseudo-labels from an image classifier - a very difficult task to do well, hence requiring complicated architectures and extensive hyperparameter tuning on fully-supervised validation sets. We propose a method called prediction filtering, which instead of extracting pseudo-labels, just uses the classifier as a classifier: it ignores any segmentation predictions from classes which the classifier is confident are not present. Adding this simple post-processing method to baselines gives results competitive with or better than prior SWSSS algorithms. Moreover, it is compatible with pseudo-label methods: adding prediction filtering to existing SWSSS algorithms further improves segmentation performance.

LGJul 16, 2024
Generalized Coverage for More Robust Low-Budget Active Learning

Wonho Bae, Junhyug Noh, Danica J. Sutherland

The ProbCover method of Yehuda et al. is a well-motivated algorithm for active learning in low-budget regimes, which attempts to "cover" the data distribution with balls of a given radius at selected data points. We demonstrate, however, that the performance of this algorithm is extremely sensitive to the choice of this radius hyper-parameter, and that tuning it is quite difficult, with the original heuristic frequently failing. We thus introduce (and theoretically motivate) a generalized notion of "coverage," including ProbCover's objective as a special case, but also allowing smoother notions that are far more robust to hyper-parameter choice. We propose an efficient greedy method to optimize this coverage, generalizing ProbCover's algorithm; due to its close connection to kernel herding, we call it "MaxHerding." The objective can also be optimized non-greedily through a variant of $k$-medoids, clarifying the relationship to other low-budget active learning methods. In comprehensive experiments, MaxHerding surpasses existing active learning methods across multiple low-budget image classification benchmarks, and does so with less computational cost than most competitive methods.

30.6CVMay 20
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition

Junghyun Lee, Hyunseo Kim, Hanna Jang et al.

Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rather than a single dominant signal. We propose a rank-aware multi-encoder framework that selectively combines complementary representations from diverse pre-extracted video and audio encoders. Our method projects heterogeneous encoder features into a shared latent space, estimates sample-wise encoder importance through an attention-based gating module, and fuses only the top-n most informative encoders. To better model blended emotions, we decouple prediction into presence and salience heads and align them through probability-level fusion. We further incorporate feature-level unsupervised domain adaptation without pseudo-labeling to improve robustness under distribution shift. Experiments on the BlEmoRE challenge show that the proposed framework outperforms strong individual encoders and naïve multi-encoder fusion baselines. Our final system ranked 2nd in the competition, supporting the effectiveness of rank-aware selective fusion for fine-grained blended emotion recognition.

6.9CVApr 3
Explicit Time-Frequency Dynamics for Skeleton-Based Gait Recognition

Seoyeon Ko, Yeojin Song, Egene Chung et al.

Skeleton-based gait recognizers excel at modeling spatial configurations but often underuse explicit motion dynamics that are crucial under appearance changes. We introduce a plug-and-play Wavelet Feature Stream that augments any skeleton backbone with time-frequency dynamics of joint velocities. Concretely, per-joint velocity sequences are transformed by the continuous wavelet transform (CWT) into multi-scale scalograms, from which a lightweight multi-scale CNN learns discriminative dynamic cues. The resulting descriptor is fused with the backbone representation for classification, requiring no changes to the backbone architecture or additional supervision. Across CASIA-B, the proposed stream delivers consistent gains on strong skeleton backbones (e.g., GaitMixer, GaitFormer, GaitGraph) and establishes a new skeleton-based state of the art when attached to GaitMixer. The improvements are especially pronounced under covariate shifts such as carrying bags (BG) and wearing coats (CL), highlighting the complementarity of explicit time-frequency modeling and standard spatio-temporal encoders.

CVNov 4, 2025
Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models

Jinhwan Seo, Yoonki Cho, Junhyug Noh et al.

In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.

CVOct 25, 2025Code
Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

Jeongin Kim, Wonho Bae, YouLee Han et al.

Semantic segmentation demands dense pixel-level annotations, which can be prohibitively expensive - especially under extremely constrained labeling budgets. In this paper, we address the problem of low-budget active learning for semantic segmentation by proposing a novel two-stage selection pipeline. Our approach leverages a pre-trained diffusion model to extract rich multi-scale features that capture both global structure and fine details. In the first stage, we perform a hierarchical, representation-based candidate selection by first choosing a small subset of representative pixels per image using MaxHerding, and then refining these into a diverse global pool. In the second stage, we compute an entropy-augmented disagreement score (eDALD) over noisy multi-scale diffusion features to capture both epistemic uncertainty and prediction confidence, selecting the most informative pixels for annotation. This decoupling of diversity and uncertainty lets us achieve high segmentation accuracy with only a tiny fraction of labeled pixels. Extensive experiments on four benchmarks (CamVid, ADE-Bed, Cityscapes, and Pascal-Context) demonstrate that our method significantly outperforms existing baselines under extreme pixel-budget regimes. Our code is available at https://github.com/jn-kim/two-stage-edald.

CVJun 25, 2024Code
Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation

Youngmin Kim, Saejin Kim, Hoyeon Moon et al.

Scalp disorders are highly prevalent worldwide, yet remain underdiagnosed due to limited access to expert evaluation and the high cost of annotation. Although AI-based approaches hold great promise, their practical deployment is hindered by challenges such as severe data imbalance and the absence of pixel-level segmentation labels. To address these issues, we propose ScalpVision, an AI-driven system for the holistic diagnosis of scalp diseases. In ScalpVision, effective hair segmentation is achieved using pseudo image-label pairs and an innovative prompting method in the absence of traditional hair masking labels. Additionally, ScalpVision introduces DiffuseIT-M, a generative model adopted for dataset augmentation while maintaining hair information, facilitating improved predictions of scalp disease severity. Our experimental results affirm ScalpVision's efficiency in diagnosing a variety of scalp conditions, showcasing its potential as a valuable tool in dermatological care. Our code is available at https://github.com/winston1214/ScalpVision.

CVJul 15, 2021Code
What and When to Look?: Temporal Span Proposal Network for Video Relation Detection

Sangmin Woo, Junhyug Noh, Kangil Kim

Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD.

27.7SDMar 31
SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision

Mingyeong Song, Seoyeon Ko, Junhyug Noh

Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard audio-visual pipelines.

CVOct 16, 2025
SteeringTTA: Guiding Diffusion Trajectories for Robust Test-Time-Adaptation

Jihyun Yu, Yoojin Oh, Wonho Bae et al.

Test-time adaptation (TTA) aims to correct performance degradation of deep models under distribution shifts by updating models or inputs using unlabeled test data. Input-only diffusion-based TTA methods improve robustness for classification to corruptions but rely on gradient guidance, limiting exploration and generalization across distortion types. We propose SteeringTTA, an inference-only framework that adapts Feynman-Kac steering to guide diffusion-based input adaptation for classification with rewards driven by pseudo-label. SteeringTTA maintains multiple particle trajectories, steered by a combination of cumulative top-K probabilities and an entropy schedule, to balance exploration and confidence. On ImageNet-C, SteeringTTA consistently outperforms the baseline without any model updates or source data.

LGOct 1, 2025
Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation

Jing Wang, Wonho Bae, Jiahong Chen et al.

Recent work on latent diffusion models (LDMs) has focused almost exclusively on generative tasks, leaving their potential for discriminative transfer largely unexplored. We introduce Discriminative Vicinity Diffusion (DVD), a novel LDM-based framework for a more practical variant of source-free domain adaptation (SFDA): the source provider may share not only a pre-trained classifier but also an auxiliary latent diffusion module, trained once on the source data and never exposing raw source samples. DVD encodes each source feature's label information into its latent vicinity by fitting a Gaussian prior over its k-nearest neighbors and training the diffusion network to drift noisy samples back to label-consistent representations. During adaptation, we sample from each target feature's latent vicinity, apply the frozen diffusion module to generate source-like cues, and use a simple InfoNCE loss to align the target encoder to these cues, explicitly transferring decision boundaries without source access. Across standard SFDA benchmarks, DVD outperforms state-of-the-art methods. We further show that the same latent diffusion module enhances the source classifier's accuracy on in-domain data and boosts performance in supervised classification and domain generalization experiments. DVD thus reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, addressing a core challenge in source-free domain adaptation that prior methods have yet to solve.

CVJun 16, 2021
Tackling the Challenges in Scene Graph Generation with Local-to-Global Interactions

Sangmin Woo, Junhyug Noh, Kangil Kim

In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies -- 1) Ambiguity: even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry: despite the nature of the relationship that embodied the direction, it was not well addressed in previous studies, and 3) Higher-order contexts: leveraging the identities of certain graph elements can help to generate accurate scene graphs. Motivated by the analysis, we design a novel SGG framework, Local-to-Global Interaction Networks (LOGIN). Locally, interactions extract the essence between three instances of subject, object, and background, while baking direction awareness into the network by explicitly constraining the input order of subject and object. Globally, interactions encode the contexts between every graph component (i.e., nodes and edges). Finally, Attract & Repel loss is utilized to fine-tune the distribution of predicate embeddings. By design, our framework enables predicting the scene graph in a bottom-up manner, leveraging the possible complementariness. To quantify how much LOGIN is aware of relational direction, a new diagnostic task called Bidirectional Relationship Classification (BRC) is also proposed. Experimental results demonstrate that LOGIN can successfully distinguish relational direction than existing methods (in BRC task), while showing state-of-the-art results on the Visual Genome benchmark (in SGG task).

CVOct 17, 2020
LID 2020: The Learning from Imperfect Data Challenge Results

Yunchao Wei, Shuai Zheng, Ming-Ming Cheng et al.

Learning from imperfect data becomes an issue in many industrial applications after the research community has made profound progress in supervised learning from perfectly annotated datasets. The purpose of the Learning from Imperfect Data (LID) workshop is to inspire and facilitate the research in developing novel approaches that would harness the imperfect data and improve the data-efficiency during training. A massive amount of user-generated data nowadays available on multiple internet services. How to leverage those and improve the machine learning models is a high impact problem. We organize the challenges in conjunction with the workshop. The goal of these challenges is to find the state-of-the-art approaches in the weakly supervised learning setting for object detection, semantic segmentation, and scene parsing. There are three tracks in the challenge, i.e., weakly supervised semantic segmentation (Track 1), weakly supervised scene parsing (Track 2), and weakly supervised object localization (Track 3). In Track 1, based on ILSVRC DET, we provide pixel-level annotations of 15K images from 200 categories for evaluation. In Track 2, we provide point-based annotations for the training set of ADE20K. In Track 3, based on ILSVRC CLS-LOC, we provide pixel-level annotations of 44,271 images for evaluation. Besides, we further introduce a new evaluation metric proposed by \cite{zhang2020rethinking}, i.e., IoU curve, to measure the quality of the generated object localization maps. This technical report summarizes the highlights from the challenge. The challenge submission server and the leaderboard will continue to open for the researchers who are interested in it. More details regarding the challenge and the benchmarks are available at https://lidchallenge.github.io