Hoang M. Truong

CV
h-index24
4papers
14citations
Novelty44%
AI Score37

4 Papers

45.5CVMay 9
Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation

Hoang M. Truong, Hai Nguyen-Truong, Dang Huynh

Open-vocabulary semantic segmentation requires adapting image-level vision-language models such as CLIP to dense pixel-level prediction, which is challenging due to the mismatch between hierarchical structure and semantic alignment in the embedding space. While recent works leverage hyperbolic geometry to model hierarchical relationships, they align embeddings across hierarchical levels but overlook semantic misalignment among embeddings within the same level. In this work, we propose HyRo, a hyperbolic fine-tuning framework that decouples hierarchical and semantic alignment in the Poincaré ball model. HyRo aligns hierarchical levels by adjusting the hyperbolic radius and refines semantic relationships through angular alignment using an orthogonal transformation that theoretically preserves the hyperbolic radius. Experiments on standard open-vocabulary semantic segmentation benchmarks demonstrate that HyRo achieves state-of-the-art performance over prior methods.

CVApr 25, 2025
Event-Based Eye Tracking. 2025 Event-based Vision Workshop

Qinyu Chen, Chang Gao, Min Liu et al.

This survey serves as a review for the 2025 Event-Based Eye Tracking Challenge organized as part of the 2025 CVPR event-based vision workshop. This challenge focuses on the task of predicting the pupil center by processing event camera recorded eye movement. We review and summarize the innovative methods from teams rank the top in the challenge to advance future event-based eye tracking research. In each method, accuracy, model size, and number of operations are reported. In this survey, we also discuss event-based eye tracking from the perspective of hardware design.

CVApr 14, 2025
Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling

Hoang M. Truong, Vinh-Thuan Ly, Huy G. Tran et al.

Event-based eye tracking has become a pivotal technology for augmented reality and human-computer interaction. Yet, existing methods struggle with real-world challenges such as abrupt eye movements and environmental noise. Building on the efficiency of the Lightweight Spatiotemporal Network-a causal architecture optimized for edge devices-we introduce two key advancements. First, a robust data augmentation pipeline incorporating temporal shift, spatial flip, and event deletion improves model resilience, reducing Euclidean distance error by 12% (1.61 vs. 1.70 baseline) on challenging samples. Second, we propose KnightPupil, a hybrid architecture combining an EfficientNet-B3 backbone for spatial feature extraction, a bidirectional GRU for contextual temporal modeling, and a Linear Time-Varying State-Space Module to adapt to sparse inputs and noise dynamically. Evaluated on the 3ET+ benchmark, our framework achieved 1.61 Euclidean distance on the private test set of the Event-based Eye Tracking Challenge at CVPR 2025, demonstrating its effectiveness for practical deployment in AR/VR systems while providing a foundation for future innovations in neuromorphic vision.

CVAug 25, 2025
TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints

Vinh-Thuan Ly, Hoang M. Truong, Xuan-Huong Nguyen

Reasoning about fine-grained spatial relationships in warehouse-scale environments poses a significant challenge for existing vision-language models (VLMs), which often struggle to comprehend 3D layouts, object arrangements, and multimodal cues in real-world industrial settings. In this paper, we present TinyGiantVLM, a lightweight and modular two-stage framework designed for physical spatial reasoning, distinguishing itself from traditional geographic reasoning in complex logistics scenes. Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones. To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module, which dynamically combines spatial representations to support downstream reasoning tasks and improve convergence. Training is conducted in a two-phase strategy: the first phase focuses on generating free-form answers to enhance spatial reasoning ability, while the second phase uses normalized answers for evaluation. Evaluated on Track 3 of the AI City Challenge 2025, our 64M-parameter base model achieved 5th place on the leaderboard with a score of 66.8861, demonstrating strong performance in bridging visual perception and spatial understanding in industrial environments. We further present an 80M-parameter variant with expanded MoE capacity, which demonstrates improved performance on spatial reasoning tasks.