Jaewoo Park

CV
h-index9
33papers
364citations
Novelty49%
AI Score57

33 Papers

CVSep 26, 2023Code
Nearest Neighbor Guidance for Out-of-Distribution Detection

Jaewoo Park, Yoon Gyo Jung, Andrew Beng Jin Teoh

Detecting out-of-distribution (OOD) samples are crucial for machine learning models deployed in open-world environments. Classifier-based scores are a standard approach for OOD detection due to their fine-grained detection capability. However, these scores often suffer from overconfidence issues, misclassifying OOD samples distant from the in-distribution region. To address this challenge, we propose a method called Nearest Neighbor Guidance (NNGuide) that guides the classifier-based score to respect the boundary geometry of the data manifold. NNGuide reduces the overconfidence of OOD samples while preserving the fine-grained capability of the classifier-based score. We conduct extensive experiments on ImageNet OOD detection benchmarks under diverse settings, including a scenario where the ID data undergoes natural distribution shift. Our results demonstrate that NNGuide provides a significant performance improvement on the base detection scores, achieving state-of-the-art results on both AUROC, FPR95, and AUPR metrics. The code is given at \url{https://github.com/roomo7time/nnguide}.

CVJan 5, 2023Code
Open-Set Face Identification on Few-Shot Gallery by Fine-Tuning

Hojin Park, Jaewoo Park, Andrew Beng Jin Teoh

In this paper, we focus on addressing the open-set face identification problem on a few-shot gallery by fine-tuning. The problem assumes a realistic scenario for face identification, where only a small number of face images is given for enrollment and any unknown identity must be rejected during identification. We observe that face recognition models pretrained on a large dataset and naively fine-tuned models perform poorly for this task. Motivated by this issue, we propose an effective fine-tuning scheme with classifier weight imprinting and exclusive BatchNorm layer tuning. For further improvement of rejection accuracy on unknown identities, we propose a novel matcher called Neighborhood Aware Cosine (NAC) that computes similarity based on neighborhood information. We validate the effectiveness of the proposed schemes thoroughly on large-scale face benchmarks across different convolutional neural network architectures. The source code for this project is available at: https://github.com/1ho0jin1/OSFI-by-FineTuning

CVApr 20, 2023
Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation

Jacky Chen Long Chai, Tiong-Sik Ng, Cheng-Yaw Low et al.

Very low-resolution face recognition (VLRFR) poses unique challenges, such as tiny regions of interest and poor resolution due to extreme standoff distance or wide viewing angle of the acquisition devices. In this paper, we study principled approaches to elevate the recognizability of a face in the embedding space instead of the visual quality. We first formulate a robust learning-based face recognizability measure, namely recognizability index (RI), based on two criteria: (i) proximity of each face embedding against the unrecognizable faces cluster center and (ii) closeness of each face embedding against its positive and negative class prototypes. We then devise an index diversion loss to push the hard-to-recognize face embedding with low RI away from unrecognizable faces cluster to boost the RI, which reflects better recognizability. Additionally, a perceptibility attention mechanism is introduced to attend to the most recognizable face regions, which offers better explanatory and discriminative traits for embedding learning. Our proposed model is trained end-to-end and simultaneously serves recognizability-aware embedding learning and face quality estimation. To address VLRFR, our extensive evaluations on three challenging low-resolution datasets and face quality assessment demonstrate the superiority of the proposed model over the state-of-the-art methods.

CVSep 23, 2022
Understanding Open-Set Recognition by Jacobian Norm and Inter-Class Separation

Jaewoo Park, Hojin Park, Eunju Jeong et al.

The findings on open-set recognition (OSR) show that models trained on classification datasets are capable of detecting unknown classes not encountered during the training process. Specifically, after training, the learned representations of known classes dissociate from the representations of the unknown class, facilitating OSR. In this paper, we investigate this emergent phenomenon by examining the relationship between the Jacobian norm of representations and the inter/intra-class learning dynamics. We provide a theoretical analysis, demonstrating that intra-class learning reduces the Jacobian norm for known class samples, while inter-class learning increases the Jacobian norm for unknown samples, even in the absence of direct exposure to any unknown sample. Overall, the discrepancy in the Jacobian norm between the known and unknown classes enables OSR. Based on this insight, which highlights the pivotal role of inter-class learning, we devise a marginal one-vs-rest (m-OvR) loss function that promotes strong inter-class separation. To further improve OSR performance, we integrate the m-OvR loss with additional strategies that maximize the Jacobian norm disparity. We present comprehensive experimental results that support our theoretical observations and demonstrate the efficacy of our proposed OSR approach.

26.8CVMay 26
Memory-Distilled Selection for Noise-Robust Anomaly Detection

Sirojbek Safarov, Jaewoo Park, Yoon Gyo Jung et al.

Anomaly detection (AD) under data contamination is critical for deploying unsupervised defect detection in industrial environments, where curating perfectly clean training sets is impractical. However, existing methods are sensitive to contamination, suffering significant performance degradation as the noise ratio increases. In this paper, we propose Memory-Distilled Selection (MeDS), a training algorithm based on data selection. MeDS constructs an ensemble of partial memories via random subsampling, where the resulting sparsity acts as a low-pass filter that captures nominal patterns across a wide range of noise ratios, enabling coarse-level identification of contaminated samples. The aggregated distances to the bootstrapped memories are then distilled into a reconstruction score network, which is subsequently fine-tuned on clean data filtered using scores from the distilled model, enabling fine-grained localization of anomalies. MeDS is robust across a wide range of noise ratios without requiring noise-ratio-specific hyperparameter tuning, achieving 99.16\% image-level AUROC on MVTecAD at a 40\% noise ratio, and attaining state-of-the-art performance on both VisA and Real-IAD under noisy settings. We thoroughly verify the efficacy of MeDS on industrial AD benchmarks under noisy data scenarios, accompanied by in-depth empirical analyses.

CVJul 2, 2024
Face Reconstruction Transfer Attack as Out-of-Distribution Generalization

Yoon Gyo Jung, Jaewoo Park, Xingbo Dong et al.

Understanding the vulnerability of face recognition systems to malicious attacks is of critical importance. Previous works have focused on reconstructing face images that can penetrate a targeted verification system. Even in the white-box scenario, however, naively reconstructed images misrepresent the identity information, hence the attacks are easily neutralized once the face system is updated or changed. In this paper, we aim to reconstruct face images which are capable of transferring face attacks on unseen encoders. We term this problem as Face Reconstruction Transfer Attack (FRTA) and show that it can be formulated as an out-of-distribution (OOD) generalization problem. Inspired by its OOD nature, we propose to solve FRTA by Averaged Latent Search and Unsupervised Validation with pseudo target (ALSUV). To strengthen the reconstruction attack on OOD unseen encoders, ALSUV reconstructs the face by searching the latent of amortized generator StyleGAN2 through multiple latent optimization, latent optimization trajectory averaging, and unsupervised validation with a pseudo target. We demonstrate the efficacy and generalization of our method on widely used face datasets, accompanying it with extensive ablation studies and visually, qualitatively, and quantitatively analyses. The source code will be released.

CVOct 9, 2023
Understanding the Feature Norm for Out-of-Distribution Detection

Jaewoo Park, Jacky Chen Long Chai, Jaeho Yoon et al.

A neural network trained on a classification dataset often exhibits a higher vector norm of hidden layer features for in-distribution (ID) samples, while producing relatively lower norm values on unseen instances from out-of-distribution (OOD). Despite this intriguing phenomenon being utilized in many applications, the underlying cause has not been thoroughly investigated. In this study, we demystify this very phenomenon by scrutinizing the discriminative structures concealed in the intermediate layers of a neural network. Our analysis leads to the following discoveries: (1) The feature norm is a confidence value of a classifier hidden in the network layer, specifically its maximum logit. Hence, the feature norm distinguishes OOD from ID in the same manner that a classifier confidence does. (2) The feature norm is class-agnostic, thus it can detect OOD samples across diverse discriminative models. (3) The conventional feature norm fails to capture the deactivation tendency of hidden layer neurons, which may lead to misidentification of ID samples as OOD instances. To resolve this drawback, we propose a novel negative-aware norm (NAN) that can capture both the activation and deactivation tendencies of hidden layer neurons. We conduct extensive experiments on NAN, demonstrating its efficacy and compatibility with existing OOD detectors, as well as its capability in label-free environments.

CRAug 17, 2023
Hyperdimensional Computing as a Rescue for Efficient Privacy-Preserving Machine Learning-as-a-Service

Jaewoo Park, Chenghao Quan, Hyungon Moon et al.

Machine learning models are often provisioned as a cloud-based service where the clients send their data to the service provider to obtain the result. This setting is commonplace due to the high value of the models, but it requires the clients to forfeit the privacy that the query data may contain. Homomorphic encryption (HE) is a promising technique to address this adversity. With HE, the service provider can take encrypted data as a query and run the model without decrypting it. The result remains encrypted, and only the client can decrypt it. All these benefits come at the cost of computational cost because HE turns simple floating-point arithmetic into the computation between long (degree over 1024) polynomials. Previous work has proposed to tailor deep neural networks for efficient computation over encrypted data, but already high computational cost is again amplified by HE, hindering performance improvement. In this paper we show hyperdimensional computing can be a rescue for privacy-preserving machine learning over encrypted data. We find that the advantage of hyperdimensional computing in performance is amplified when working with HE. This observation led us to design HE-HDC, a machine-learning inference system that uses hyperdimensional computing with HE. We carefully structure the machine learning service so that the server will perform only the HE-friendly computation. Moreover, we adapt the computation and HE parameters to expedite computation while preserving accuracy and security. Our experimental result based on real measurements shows that HE-HDC outperforms existing systems by 26~3000 times with comparable classification accuracy.

LGAug 4, 2022
Customs Import Declaration Datasets

Chaeyoon Jeong, Sundong Kim, Jaewoo Park et al.

Given the huge volume of cross-border flows, effective and efficient control of trade becomes more crucial in protecting people and society from illicit trade. However, limited accessibility of the transaction-level trade datasets hinders the progress of open research, and lots of customs administrations have not benefited from the recent progress in data-based risk management. In this paper, we introduce an import declaration dataset to facilitate the collaboration between domain experts in customs administrations and researchers from diverse domains, such as data science and machine learning. The dataset contains 54,000 artificially generated trades with 22 key attributes, and it is synthesized with conditional tabular GAN while maintaining correlated features. Synthetic data has several advantages. First, releasing the dataset is free from restrictions that do not allow disclosing the original import data. The fabrication step minimizes the possible identity risk which may exist in trade statistics. Second, the published data follow a similar distribution to the source data so that it can be used in various downstream tasks. Hence, our dataset can be used as a benchmark for testing the performance of any classification algorithm. With the provision of data and its generation process, we open baseline codes for fraud detection tasks, as we empirically show that more advanced algorithms can better detect fraud.

33.1CVMar 26
EgoXtreme: A Dataset for Robust Object Pose Estimation in Egocentric Views under Extreme Conditions

Taegyoon Yoon, Yegyu Han, Seojin Ji et al.

Smart glass is emerging as an useful device since it provides plenty of insights under hands-busy, eyes-on-task situations. To understand the context of the wearer, 6D object pose estimation in egocentric view is becoming essential. However, existing 6D object pose estimation benchmarks fail to capture the challenges of real-world egocentric applications, which are often dominated by severe motion blur, dynamic illumination, and visual obstructions. This discrepancy creates a significant gap between controlled lab data and chaotic real-world application. To bridge this gap, we introduce EgoXtreme, a new large-scale 6D pose estimation dataset captured entirely from an egocentric perspective. EgoXtreme features three challenging scenarios - industrial maintenance, sports, and emergency rescue - designed to introduce severe perceptual ambiguities through extreme lighting, heavy motion blur, and smoke. Evaluations of state-of-the-art generalizable pose estimators on EgoXtreme indicate that their generalization fails to hold in extreme conditions, especially under low light. We further demonstrate that simply applying image restoration (e.g., deblurring) offers no positive improvement for extreme conditions. While performance gain has appeared in tracking-based approach, implying using temporal information in fast-motion scenarios is meaningful. We conclude that EgoXtreme is an essential resource for developing and evaluating the next generation of pose estimation models robust enough for real-world egocentric vision. The dataset and code are available at https://taegyoun88.github.io/EgoXtreme/

NEAug 19, 2024
A More Accurate Approximation of Activation Function with Few Spikes Neurons

Dayena Jeong, Jaewoo Park, Jeonghee Jo et al.

Recent deep neural networks (DNNs), such as diffusion models [1], have faced high computational demands. Thus, spiking neural networks (SNNs) have attracted lots of attention as energy-efficient neural networks. However, conventional spiking neurons, such as leaky integrate-and-fire neurons, cannot accurately represent complex non-linear activation functions, such as Swish [2]. To approximate activation functions with spiking neurons, few spikes (FS) neurons were proposed [3], but the approximation performance was limited due to the lack of training methods considering the neurons. Thus, we propose tendency-based parameter initialization (TBPI) to enhance the approximation of activation function with FS neurons, exploiting temporal dependencies initializing the training parameters.

CLJan 5
K-EXAONE Technical Report

Eunbi Choi, Kibong Choi, Seokhee Hong et al.

This technical report presents K-EXAONE, a large-scale multilingual language model developed by LG AI Research. K-EXAONE is built on a Mixture-of-Experts architecture with 236B total parameters, activating 23B parameters during inference. It supports a 256K-token context window and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. We evaluate K-EXAONE on a comprehensive benchmark suite spanning reasoning, agentic, general, Korean, and multilingual abilities. Across these evaluations, K-EXAONE demonstrates performance comparable to open-weight models of similar size. K-EXAONE, designed to advance AI for a better life, is positioned as a powerful proprietary AI foundation model for a wide range of industrial and research applications.

36.6CRMay 13
HE-PIM: Demystifying Homomorphic Operations on a Real-world Processing-in-Memory System

Harshita Gupta, Mayank Kabra, Jaewoo Park et al.

Homomorphic encryption (HE) enables computation over encrypted data, offering strong privacy guarantees for untrusted computing environments. Practical adoption remains limited by high computational complexity, large ciphertext sizes, and substantial data movement. Processor-centric architectures (CPUs, GPUs, ASICs) hit fundamental bottlenecks on HE workloads because ciphertexts are large, data locality is low, and primitives such as relinearization and bootstrapping repeatedly access large auxiliary metadata. Processing-In-Memory (PIM) is a promising mitigation by computing near or inside memory. Prior PIM proposals for HE either do not target real-world PIM systems or cover only a narrow set of operations. We comprehensively characterize HE operations on a real-world, general-purpose PIM system. We implement a complete set of HE kernels used by emerging applications (databases, machine learning) on the UPMEM PIM system, evaluate performance and scalability, compare against CPU and GPU baselines, and discuss implications for future PIM hardware. Our results demonstrate four major findings. (1) HE-based applications expose distinct bottlenecks across execution stages: some kernels are compute-bound due to modular arithmetic, while others are memory-bound due to large ciphertexts and intermediate data. These bottlenecks are exacerbated by limited per-core compute and per-bank capacity, which force frequent data movement. (2) The dominant compute bottleneck is the lack of native 64-bit modular integer multiplication, a key HE primitive. (3) Limited per-bank memory capacity is the second major bottleneck, since HE ciphertexts and auxiliary metadata do not fit and require inter-bank movement. (4) Despite these limits, PIM can be a viable alternative to state-of-the-art CPU and GPU systems for HE when equipped with native modular multiplication and efficient inter-PIM data movement.

93.8CLApr 9
EXAONE 4.5 Technical Report

Eunbi Choi, Kibong Choi, Sehyun Chun et al.

This technical report introduces EXAONE 4.5, the first open-weight vision language model released by LG AI Research. EXAONE 4.5 is architected by integrating a dedicated visual encoder into the existing EXAONE 4.0 framework, enabling native multimodal pretraining over both visual and textual modalities. The model is trained on large-scale data with careful curation, particularly emphasizing document-centric corpora that align with LG's strategic application domains. This targeted data design enables substantial performance gains in document understanding and related tasks, while also delivering broad improvements across general language capabilities. EXAONE 4.5 extends context length up to 256K tokens, facilitating long-context reasoning and enterprise-scale use cases. Comparative evaluations demonstrate that EXAONE 4.5 achieves competitive performance in general benchmarks while outperforming state-of-the-art models of similar scale in document understanding and Korean contextual reasoning. As part of LG's ongoing effort toward practical industrial deployment, EXAONE 4.5 is designed to be continuously extended with additional domains and application scenarios to advance AI for a better life.

CLApr 4, 2025Code
Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation

Jaewoo Park, Jungyang Park, Dongju Jang et al.

With the rapid advancement of mathematical reasoning capabilities in Large Language Models (LLMs), AI systems are increasingly being adopted in educational settings to support students' comprehension of problem-solving processes. However, a critical component remains underexplored in current LLM-generated explanations: multimodal explanation. In real-world instructional contexts, human tutors routinely employ visual aids, such as diagrams, markings, and highlights, to enhance conceptual clarity. To bridge this gap, we introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding. To evaluate model performance on this task, we propose ME2, a multimodal benchmark consisting of 1,000 math problems annotated with visual keypoints and corresponding explanatory text that references those elements. Our empirical results show that current models struggle to identify visual keypoints. In the task of generating keypoint-based explanations, open-source models also face notable difficulties. This highlights a significant gap in current LLMs' ability to perform mathematical visual grounding, engage in visually grounded reasoning, and provide explanations in educational contexts. We expect that the multimodal solution explanation task and the ME2 dataset will catalyze further research on LLMs in education and promote their use as effective, explanation-oriented AI tutors.

CVDec 16, 2021Code
DProST: Dynamic Projective Spatial Transformer Network for 6D Pose Estimation

Jaewoo Park, Nam Ik Cho

Predicting the object's 6D pose from a single RGB image is a fundamental computer vision task. Generally, the distance between transformed object vertices is employed as an objective function for pose estimation methods. However, projective geometry in the camera space is not considered in those methods and causes performance degradation. In this regard, we propose a new pose estimation system based on a projective grid instead of object vertices. Our pose estimation method, dynamic projective spatial transformer network (DProST), localizes the region of interest grid on the rays in camera space and transforms the grid to object space by estimated pose. The transformed grid is used as both a sampling grid and a new criterion of the estimated pose. Additionally, because DProST does not require object vertices, our method can be used in a mesh-less setting by replacing the mesh with a reconstructed feature. Experimental results show that mesh-less DProST outperforms the state-of-the-art mesh-based methods on the LINEMOD and LINEMOD-OCCLUSION dataset, and shows competitive performance on the YCBV dataset with mesh data. The source code is available at https://github.com/parkjaewoo0611/DProST

87.9DSMay 8
Deterministic Monotone Min-Plus Product and Convolution

Ce Jin, Jaewoo Park, Barna Saha et al.

The Monotone Min-Plus Product problem is a useful primitive that has seen many algorithmic applications over the past decade. In this problem, we are given two $n\times n$ integer matrices $A$ and $B$, where each row of $B$ is a monotone non-decreasing sequence of integers from $\{1,\dots,n\}$, and the goal is to compute their Min-Plus product, defined as the $n\times n$ matrix $C$ with $C_{i,j} = \min_{k}\{A_{i,k} + B_{k,j}\}$. The fastest known algorithm for this task [Chi, Duan, Xie, and Zhang, STOC'22] runs in $n^{(ω+3)/2+o(1)} = O(n^{2.686})$ time, significantly improving over the brute-force cubic algorithm. However, its main disadvantage is that it requires randomization, which is then inherited by all downstream applications. Our main result is a deterministic algorithm for Monotone Min-Plus product with the same time complexity $n^{(ω+3)/2+o(1)} = O(n^{2.686})$ as its randomized counterpart, improving upon the previous deterministic bound $O(n^{2.875})$ [Gu, Polak, Vassilevska Williams, and Xu, ICALP'21]. Our derandomization also applies to previously studied extensions and variants (e.g., [Dürr, IPL'23]), including rectangular matrices, bounded range $[n^μ]$, and column-monotone matrices. As an immediate consequence, we derandomize state-of-the-art algorithms for multiple problems, including Language Edit Distance, RNA Folding, Optimum Stack Generation, unweighted Tree Edit Distance, Batched Range Mode, and Approximate All-Pairs Shortest Paths. Our techniques also yield a deterministic algorithm for the Monotone Min-Plus Convolution problem that runs in $n^{1.5+o(1)}$ time, nearly matching the best known randomized time complexity $\widetilde{O}(n^{1.5})$ [Chi, Duan, Xie, and Zhang, STOC'22]. This algorithm can be used to derandomize state-of-the-art algorithms for Jumbled Indexing for binary strings and several variants of Knapsack.

CVDec 22, 2025
Towards AI-Guided Open-World Ecological Taxonomic Classification

Cheng Yaw Low, Heejoon Koo, Jaewoo Park et al.

AI-guided classification of ecological families, genera, and species underpins global sustainability efforts such as biodiversity monitoring, conservation planning, and policy-making. Progress toward this goal is hindered by long-tailed taxonomic distributions from class imbalance, along with fine-grained taxonomic variations, test-time spatiotemporal domain shifts, and closed-set assumptions that can only recognize previously seen taxa. We introduce the Open-World Ecological Taxonomy Classification, a unified framework that captures the co-occurrence of these challenges in realistic ecological settings. To address them, we propose TaxoNet, an embedding-based encoder with a dual-margin penalization loss that strengthens learning signals from rare underrepresented taxa while mitigating the dominance of overrepresented ones, directly confronting interrelated challenges. We evaluate our method on diverse ecological domains: Google Auto-Arborist (urban trees), iNat-Plantae (Plantae observations from various ecosystems in iNaturalist-2019), and NAFlora-Mini (a curated herbarium collection). Our model consistently outperforms baselines, particularly for rare taxa, establishing a strong foundation for open-world plant taxonomic monitoring. Our findings further show that general-purpose multimodal foundation models remain constrained in plant-domain applications.

CRNov 1, 2025
EP-HDC: Hyperdimensional Computing with Encrypted Parameters for High-Throughput Privacy-Preserving Inference

Jaewoo Park, Chenghao Quan, Jongeun Lee

While homomorphic encryption (HE) provides strong privacy protection, its high computational cost has restricted its application to simple tasks. Recently, hyperdimensional computing (HDC) applied to HE has shown promising performance for privacy-preserving machine learning (PPML). However, when applied to more realistic scenarios such as batch inference, the HDC-based HE has still very high compute time as well as high encryption and data transmission overheads. To address this problem, we propose HDC with encrypted parameters (EP-HDC), which is a novel PPML approach featuring client-side HE, i.e., inference is performed on a client using a homomorphically encrypted model. Our EP-HDC can effectively mitigate the encryption and data transmission overhead, as well as providing high scalability with many clients while providing strong protection for user data and model parameters. In addition to application examples for our client-side PPML, we also present design space exploration involving quantization, architecture, and HE-related parameters. Our experimental results using the BFV scheme and the Face/Emotion datasets demonstrate that our method can improve throughput and latency of batch inference by orders of magnitude over previous PPML methods (36.52~1068x and 6.45~733x, respectively) with less than 1% accuracy degradation.

CVFeb 19, 2024
Adversarial Feature Alignment: Balancing Robustness and Accuracy in Deep Learning via Adversarial Training

Leo Hyun Park, Jaeuk Kim, Myung Gyo Oh et al.

Deep learning models continue to advance in accuracy, yet they remain vulnerable to adversarial attacks, which often lead to the misclassification of adversarial examples. Adversarial training is used to mitigate this problem by increasing robustness against these attacks. However, this approach typically reduces a model's standard accuracy on clean, non-adversarial samples. The necessity for deep learning models to balance both robustness and accuracy for security is obvious, but achieving this balance remains challenging, and the underlying reasons are yet to be clarified. This paper proposes a novel adversarial training method called Adversarial Feature Alignment (AFA), to address these problems. Our research unveils an intriguing insight: misalignment within the feature space often leads to misclassification, regardless of whether the samples are benign or adversarial. AFA mitigates this risk by employing a novel optimization algorithm based on contrastive learning to alleviate potential feature misalignment. Through our evaluations, we demonstrate the superior performance of AFA. The baseline AFA delivers higher robust accuracy than previous adversarial contrastive learning methods while minimizing the drop in clean accuracy to 1.86% and 8.91% on CIFAR10 and CIFAR100, respectively, in comparison to cross-entropy. We also show that joint optimization of AFA and TRADES, accompanied by data augmentation using a recent diffusion model, achieves state-of-the-art accuracy and robustness.

CVJan 29, 2024
Leveraging Positional Encoding for Robust Multi-Reference-Based Object 6D Pose Estimation

Jaewoo Park, Jaeguk Kim, Nam Ik Cho

Accurately estimating the pose of an object is a crucial task in computer vision and robotics. There are two main deep learning approaches for this: geometric representation regression and iterative refinement. However, these methods have some limitations that reduce their effectiveness. In this paper, we analyze these limitations and propose new strategies to overcome them. To tackle the issue of blurry geometric representation, we use positional encoding with high-frequency components for the object's 3D coordinates. To address the local minimum problem in refinement methods, we introduce a normalized image plane-based multi-reference refinement strategy that's independent of intrinsic matrix constraints. Lastly, we utilize adaptive instance normalization and a simple occlusion augmentation method to help our model concentrate on the target object. Our experiments on Linemod, Linemod-Occlusion, and YCB-Video datasets demonstrate that our approach outperforms existing methods. We will soon release the code.

CVMar 17, 2025
GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Junhyeok Kim, Jaewoo Park, Junhee Park et al.

Mobility remains a significant challenge for the 2.2 billion people worldwide affected by blindness and low vision (BLV), with 7% of visually impaired individuals experiencing falls at least once a month. While recent advances in Multimodal Large Language Models (MLLMs) offer promising opportunities for BLV assistance, their development has been hindered by limited datasets. This limitation stems from the fact that BLV-aware annotation requires specialized domain knowledge and intensive labor. To address this gap, we introduce GuideDog, a novel accessibility-aware guide dataset containing 22K image-description pairs (including 2K human-annotated pairs) that capture diverse real-world scenes from a pedestrian's viewpoint. Our approach shifts the annotation burden from generation to verification through a collaborative human-AI framework grounded in established accessibility standards, significantly improving efficiency while maintaining high-quality annotations. We also develop GuideDogQA, a subset of 818 samples featuring multiple-choice questions designed to evaluate fine-grained visual perception capabilities, specifically object recognition and relative depth perception. Our experimental results highlight the importance of accurate spatial understanding for effective BLV guidance. GuideDog and GuideDogQA will advance research in MLLM-based assistive technologies for BLV individuals while contributing to broader applications in understanding egocentric scenes for robotics and augmented reality. The code and dataset will be publicly available.

CVApr 3, 2025
TailedCore: Few-Shot Sampling for Unsupervised Long-Tail Noisy Anomaly Detection

Yoon Gyo Jung, Jaewoo Park, Jaeho Yoon et al.

We aim to solve unsupervised anomaly detection in a practical challenging environment where the normal dataset is both contaminated with defective regions and its product class distribution is tailed but unknown. We observe that existing models suffer from tail-versus-noise trade-off where if a model is robust against pixel noise, then its performance deteriorates on tail class samples, and vice versa. To mitigate the issue, we handle the tail class and noise samples independently. To this end, we propose TailSampler, a novel class size predictor that estimates the class cardinality of samples based on a symmetric assumption on the class-wise distribution of embedding similarities. TailSampler can be utilized to sample the tail class samples exclusively, allowing to handle them separately. Based on these facets, we build a memory-based anomaly detection model TailedCore, whose memory both well captures tail class information and is noise-robust. We extensively validate the effectiveness of TailedCore on the unsupervised long-tail noisy anomaly detection setting, and show that TailedCore outperforms the state-of-the-art in most settings.

IRAug 23, 2025
Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

Yejin Choi, Jaewoo Park, Janghan Yoon et al.

Rapid advances in Multimodal Large Language Models (MLLMs) have expanded information retrieval beyond purely textual inputs, enabling retrieval from complex real world documents that combine text and visuals. However, most documents are private either owned by individuals or confined within corporate silos and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval. Unlike earlier multimodal retrievers that compare embeddings in a single vector space, PREMIR leverages preQs from multiple complementary modalities to expand the scope of matching to the token level. Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings, outperforming strong baselines across all retrieval metrics. We confirm the contribution of each component through in depth ablation studies, and qualitative analyses of the generated preQs further highlight the model's robustness in real world settings.

IVSep 1, 2025
Challenges and Lessons from MIDOG 2025: A Two-Stage Approach to Domain-Robust Mitotic Figure Detection

Euiseop Song, Jaeyoung Park, Jaewoo Park

Mitotic figure detection remains a challenging task in computational pathology due to domain variability and morphological complexity. This paper describes our participation in the MIDOG 2025 challenge, focusing on robust mitotic figure detection across diverse tissue domains. We developed a two-stage pipeline combining Faster R-CNN for candidate detection with an ensemble of three classifiers (DenseNet-121, EfficientNet-v2, InceptionResNet-v2) for false positive reduction. Our best submission achieved F1-score 0.2237 (Recall: 0.9528, Precision: 0.1267) using a Faster R-CNN trained solely on MIDOG++ dataset. While our high recall demonstrates effective mitotic figure detection, the critically low precision (12.67%) reveals fundamental challenges in distinguishing true mitoses from morphologically similar imposters across diverse domains. Analysis of six submission variants showed that subsequent optimization attempts were counterproductive, highlighting the omplexity of domain generalization in histopathology. This work provides valuable insights into the practical challenges of developing robust mitotic figure detection algorithms and emphasizes the importance of effective false positive suppression strategies.

MLJun 23, 2025
Posterior Contraction for Sparse Neural Networks in Besov Spaces with Intrinsic Dimensionality

Kyeongwon Lee, Lizhen Lin, Jaewoo Park et al.

This work establishes that sparse Bayesian neural networks achieve optimal posterior contraction rates over anisotropic Besov spaces and their hierarchical compositions. These structures reflect the intrinsic dimensionality of the underlying function, thereby mitigating the curse of dimensionality. Our analysis shows that Bayesian neural networks equipped with either sparse or continuous shrinkage priors attain the optimal rates which are dependent on the intrinsic dimension of the true structures. Moreover, we show that these priors enable rate adaptation, allowing the posterior to contract at the optimal rate even when the smoothness level of the true function is unknown. The proposed framework accommodates a broad class of functions, including additive and multiplicative Besov functions as special cases. These results advance the theoretical foundations of Bayesian neural networks and provide rigorous justification for their practical effectiveness in high-dimensional, structured estimation problems.

LGMar 4, 2025
Teaching Metric Distance to Discrete Autoregressive Language Models

Jiwan Chung, Saejin Kim, Yongrae Jo et al.

As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are most notable in low-data regimes, demonstrating DIST2Loss's strength under resource constraints.

CVNov 25, 2024
Background-Aware Defect Generation for Robust Industrial Anomaly Detection

Youngjae Cho, Gwangyeol Kim, Sirojbek Safarov et al.

Detecting anomalies in industrial settings is challenging due to the scarcity of labeled anomalous data. Generative models can mitigate this issue by synthesizing realistic defect samples, but existing approaches often fail to model the crucial interplay between defects and their background. This oversight leads to unrealistic anomalies, especially in scenarios where contextual consistency is essential (i.e., logical anomaly). To address this, we propose a novel background-aware defect generation framework, where the background influences defect denoising without affecting the background itself by ensuring realistic synthesis while preserving structural integrity. Our method leverages a disentanglement loss to separate the background' s denoising process from the defect, enabling controlled defect synthesis through DDIM Inversion. We theoretically demonstrate that our approach maintains background fidelity while generating contextually accurate defects. Extensive experiments on MVTec AD and MVTec Loco benchmarks validate our mehtod's superiority over existing techniques in both defect generation quality and anomaly detection performance.

CVDec 12, 2020
Periocular Embedding Learning with Consistent Knowledge Distillation from Face

Yoon Gyo Jung, Jaewoo Park, Cheng Yaw Low et al.

Periocular biometric, the peripheral area of the ocular, is a collaborative alternative to the face, especially when the face is occluded or masked. However, in practice, sole periocular biometric capture the least salient facial features, thereby lacking discriminative information, particularly in wild environments. To address these problems, we transfer discriminatory information from the face to support the training of a periocular network by using knowledge distillation. Specifically, we leverage face images for periocular embedding learning, but periocular alone is utilized for identity identification or verification. To enhance periocular embeddings by face effectively, we proposeConsistent Knowledge Distillation (CKD) that imposes consistency between face and periocular networks across prediction and feature layers. We find that imposing consistency at the prediction layer enables (1) extraction of global discriminative relationship information from face images and (2) effective transfer of the information from the face network to the periocular network. Particularly, consistency regularizes the prediction units to extract and store profound inter-class relationship information of face images. (3) The feature layer consistency, on the other hand, makes the periocular features robust against identity-irrelevant attributes. Overall, CKD empowers the sole periocular network to produce robust discriminative embeddings for periocular recognition in the wild. We theoretically and empirically validate the core principles of the distillation mechanism in CKD, discovering that CKD is equivalent to label smoothing with a novel sparsity-oriented regularizer that helps the network prediction to capture the global discriminative relationship. Extensive experiments reveal that CKD achieves state-of-the-art results on standard periocular recognition benchmark datasets.

LGMar 3, 2020
Discriminative Multi-level Reconstruction under Compact Latent Space for One-Class Novelty Detection

Jaewoo Park, Yoon Gyo Jung, Andrew Beng Jin Teoh

In one-class novelty detection, a model learns solely on the in-class data to single out out-class instances. Autoencoder (AE) variants aim to compactly model the in-class data to reconstruct it exclusively, thus differentiating the in-class from out-class by the reconstruction error. However, compact modeling in an improper way might collapse the latent representations of the in-class data and thus their reconstruction, which would lead to performance deterioration. Moreover, to properly measure the reconstruction error of high-dimensional data, a metric is required that captures high-level semantics of the data. To this end, we propose Discriminative Compact AE (DCAE) that learns both compact and collapse-free latent representations of the in-class data, thereby reconstructing them both finely and exclusively. In DCAE, (a) we force a compact latent space to bijectively represent the in-class data by reconstructing them through internal discriminative layers of generative adversarial nets. (b) Based on the deep encoder's vulnerability to open set risk, out-class instances are encoded into the same compact latent space and reconstructed poorly without sacrificing the quality of in-class data reconstruction. (c) In inference, the reconstruction error is measured by a novel metric that computes the dissimilarity between a query and its reconstruction based on the class semantics captured by the internal discriminator. Extensive experiments on public image datasets validate the effectiveness of our proposed model on both novelty and adversarial example detection, delivering state-of-the-art performance.

CVOct 17, 2019
On the Risk of Cancelable Biometrics

Xingbo Dong, Jaewoo Park, Zhe Jin et al.

Cancelable biometrics (CB) employs an irreversible transformation to convert the biometric features into transformed templates while preserving the relative distance between two templates for security and privacy protection. However, distance preservation invites unexpected security issues such as pre-image attacks, which are often neglected.This paper presents a generalized pre-image attack method and its extension version that operates on practical CB systems. We theoretically reveal that distance preservation property is a vulnerability source in the CB schemes. We then propose an empirical information leakage estimation algorithm to access the pre-image attack risk of the CB schemes. The experiments conducted with six CB schemes designed for the face, iris and fingerprint, demonstrate that the risks originating from the distance computed from two transformed templates significantly compromise the security of CB schemes. Our work reveals the potential risk of existing CB systems theoretically and experimentally.

LGMar 2, 2019
PuVAE: A Variational Autoencoder to Purify Adversarial Examples

Uiwon Hwang, Jaewoo Park, Hyemi Jang et al.

Deep neural networks are widely used and exhibit excellent performance in many areas. However, they are vulnerable to adversarial attacks that compromise the network at the inference time by applying elaborately designed perturbation to input data. Although several defense methods have been proposed to address specific attacks, other attack methods can circumvent these defense mechanisms. Therefore, we propose Purifying Variational Autoencoder (PuVAE), a method to purify adversarial examples. The proposed method eliminates an adversarial perturbation by projecting an adversarial example on the manifold of each class, and determines the closest projection as a purified sample. We experimentally illustrate the robustness of PuVAE against various attack methods without any prior knowledge. In our experiments, the proposed method exhibits performances competitive with state-of-the-art defense methods, and the inference time is approximately 130 times faster than that of Defense-GAN that is the state-of-the art purifier model.

CVNov 17, 2018
Stacking-Based Deep Neural Network: Deep Analytic Network for Pattern Classification

Cheng-Yaw Low, Jaewoo Park, Andrew Beng-Jin Teoh

Stacking-based deep neural network (S-DNN) is aggregated with pluralities of basic learning modules, one after another, to synthesize a deep neural network (DNN) alternative for pattern classification. Contrary to the DNNs trained end to end by backpropagation (BP), each S-DNN layer, i.e., a self-learnable module, is to be trained decisively and independently without BP intervention. In this paper, a ridge regression-based S-DNN, dubbed deep analytic network (DAN), along with its kernelization (K-DAN), are devised for multilayer feature re-learning from the pre-extracted baseline features and the structured features. Our theoretical formulation demonstrates that DAN/K-DAN re-learn by perturbing the intra/inter-class variations, apart from diminishing the prediction errors. We scrutinize the DAN/K-DAN performance for pattern classification on datasets of varying domains - faces, handwritten digits, generic objects, to name a few. Unlike the typical BP-optimized DNNs to be trained from gigantic datasets by GPU, we disclose that DAN/K-DAN are trainable using only CPU even for small-scale training sets. Our experimental results disclose that DAN/K-DAN outperform the present S-DNNs and also the BP-trained DNNs, including multiplayer perceptron, deep belief network, etc., without data augmentation applied.