Jong Taek Lee

CV
h-index11
6papers
32citations
Novelty48%
AI Score44

6 Papers

CVNov 2, 2023
Semantic Scene Graph Generation Based on an Edge Dual Scene Graph and Message Passing Neural Network

Hyeongjin Kim, Sangwon Kim, Jong Taek Lee et al.

Along with generative AI, interest in scene graph generation (SGG), which comprehensively captures the relationships and interactions between objects in an image and creates a structured graph-based representation, has significantly increased in recent years. However, relying on object-centric and dichotomous relationships, existing SGG methods have a limited ability to accurately predict detailed relationships. To solve these problems, a new approach to the modeling multiobject relationships, called edge dual scene graph generation (EdgeSGG), is proposed herein. EdgeSGG is based on a edge dual scene graph and Dual Message Passing Neural Network (DualMPNN), which can capture rich contextual interactions between unconstrained objects. To facilitate the learning of edge dual scene graphs with a symmetric graph structure, the proposed DualMPNN learns both object- and relation-centric features for more accurately predicting relation-aware contexts and allows fine-grained relational updates between objects. A comparative experiment with state-of-the-art (SoTA) methods was conducted using two public datasets for SGG operations and six metrics for three subtasks. Compared with SoTA approaches, the proposed model exhibited substantial performance improvements across all SGG subtasks. Furthermore, experiment on long-tail distributions revealed that incorporating the relationships between objects effectively mitigates existing long-tail problems.

CVNov 22, 2023
High-Quality Face Caricature via Style Translation

Lamyanba Laishram, Muhammad Shaheryar, Jong Taek Lee et al.

Caricature is an exaggerated form of artistic portraiture that accentuates unique yet subtle characteristics of human faces. Recently, advancements in deep end-to-end techniques have yielded encouraging outcomes in capturing both style and elevated exaggerations in creating face caricatures. Most of these approaches tend to produce cartoon-like results that could be more practical for real-world applications. In this study, we proposed a high-quality, unpaired face caricature method that is appropriate for use in the real world and uses computer vision techniques and GAN models. We attain the exaggeration of facial features and the stylization of appearance through a two-step process: Face caricature generation and face caricature projection. The face caricature generation step creates new caricature face datasets from real images and trains a generative model using the real and newly created caricature datasets. The Face caricature projection employs an encoder trained with real and caricature faces with the pretrained generator to project real and caricature faces. We perform an incremental facial exaggeration from the real image to the caricature faces using the encoder and generator's latent space. Our projection preserves the facial identity, attributes, and expressions from the input image. Also, it accounts for facial occlusions, such as reading glasses or sunglasses, to enhance the robustness of our model. Furthermore, we conducted a comprehensive comparison of our approach with various state-of-the-art face caricature methods, highlighting our process's distinctiveness and exceptional realism.

25.8CVMay 12
PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

Sanghyeon Lee, Jinwoo Kim, Jong Taek Lee

Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.

CVMay 21, 2024
Scene Graph Generation Strategy with Co-occurrence Knowledge and Learnable Term Frequency

Hyeongjin Kim, Sangwon Kim, Dasom Ahn et al.

Scene graph generation (SGG) is an important task in image understanding because it represents the relationships between objects in an image as a graph structure, making it possible to understand the semantic relationships between objects intuitively. Previous SGG studies used a message-passing neural networks (MPNN) to update features, which can effectively reflect information about surrounding objects. However, these studies have failed to reflect the co-occurrence of objects during SGG generation. In addition, they only addressed the long-tail problem of the training dataset from the perspectives of sampling and learning methods. To address these two problems, we propose CooK, which reflects the Co-occurrence Knowledge between objects, and the learnable term frequency-inverse document frequency (TF-l-IDF) to solve the long-tail problem. We applied the proposed model to the SGG benchmark dataset, and the results showed a performance improvement of up to 3.8% compared with existing state-of-the-art models in SGGen subtask. The proposed method exhibits generalization ability from the results obtained, showing uniform performance improvement for all MPNN models.

CVNov 21, 2025
FLUID: Training-Free Face De-identification via Latent Identity Substitution

Jinhyeong Park, Shaheryar Muhammad, Seangmin Lee et al.

Current face de-identification methods that replace identifiable cues in the face region with other sacrifices utilities contributing to realism, such as age and gender. To retrieve the damaged realism, we present FLUID (Face de-identification in the Latent space via Utility-preserving Identity Displacement), a single-input face de-identification framework that directly replaces identity features in the latent space of a pretrained diffusion model without affecting the model's weights. We reinterpret face de-identification as an image editing task in the latent h-space of a pretrained unconditional diffusion model. Our framework estimates identity-editing directions through optimization guided by loss functions that encourage attribute preservation while suppressing identity signals. We further introduce both linear and geodesic (tangent-based) editing schemes to effectively navigate the latent manifold. Experiments on CelebA-HQ and FFHQ show that FLUID achieves a superior balance between identity suppression and attribute preservation, outperforming existing de-identification approaches in both qualitative and quantitative evaluations.

CRApr 2, 2019
SurFi: Detecting Surveillance Camera Looping Attacks with Wi-Fi Channel State Information (Extended Version)

Nitya Lakshmanan, Inkyu Bang, Min Suk Kang et al.

The proliferation of surveillance cameras has greatly improved the physical security of many security-critical properties including buildings, stores, and homes. However, recent surveillance camera looping attacks demonstrate new security threats - adversaries can replay a seemingly benign video feed of a place of interest while trespassing or stealing valuables without getting caught. Unfortunately, such attacks are extremely difficult to detect in real-time due to cost and implementation constraints. In this paper, we propose SurFi to detect these attacks in real-time by utilizing commonly available Wi-Fi signals. In particular, we leverage that channel state information (CSI) from Wi-Fi signals also perceives human activities in the place of interest in addition to surveillance cameras. SurFi processes and correlates the live video feeds and the Wi-Fi CSI signals to detect any mismatches that would identify the presence of the surveillance camera looping attacks. SurFi does not require the deployment of additional infrastructure because Wi-Fi transceivers are easily found in the urban indoor environment. We design and implement the SurFi system and evaluate its effectiveness in detecting surveillance camera looping attacks. Our evaluation demonstrates that SurFi effectively identifies attacks with up to an attack detection accuracy of 98.8% and 0.1% false positive rate