Van Thong Huynh

h-index8

11papers

169citations

Novelty31%

AI Score28

Ranked #147,544 of 194,257 authors (top 76%)#48,269 in CV (top 82%)

11 Papers

4.6CVJul 11

A Shared Latent for Partially-Labeled Multi-Task Facial Affect Recognition

Hong Hai Nguyen, Sy Phan Van, Soo-Hyung Kim et al.

Facial affect in the wild is naturally multi-task: valence-arousal, discrete expressions, and facial action units describe the same face. Yet real corpora annotate these tasks only partially and unevenly, so most systems mask the missing labels or impute pseudo-labels and forgo the cross-task signal. We instead cast partially-labeled multi-task learning as marginalization over a shared affect latent: one variational bottleneck mediates all three task decoders, so a frame annotated for one task shapes the representation the others use, and the masked objective reappears as the reconstruction term of an evidence lower bound. On s-Aff-Wild2, where only 37% of frames carry all three labels, the classes are severely imbalanced, and pretraining on the source data is disallowed, we isolate where this coupling acts. On a single backbone it lifts expression macro-F1 from 0.403 for a dedicated specialist to 0.446, which the masked-loss model does not reach; a second, near-peer backbone with decorrelated errors then breaks an action-unit ceiling that external action-unit data could not, while valence-arousal stays within noise. Every gain is disciplined by a matched-control negative; together these controls indicate that the rare-class failure is representational, not a matter of loss shaping. As each task's source is chosen on the evaluation split, we report the assembled result, a combined multi-task score of 1.679 on validation, as an in-sample endpoint and rest our conclusions on the controlled comparisons; a small, regime-dependent transfer of the expression advantage to AffectNet and RAF-DB is presented as exploratory rather than conclusive.

4.3MMJul 31, 2023

DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in Conversation

Vu Ngoc Tu, Van Thong Huynh, Hyung-Jeong Yang et al.

Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$\% improvement on test set and $4$\% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.

3.9CVJan 11, 2023Code

Generic Event Boundary Detection in Video with Pyramid Features

Van Thong Huynh, Hyung-Jeong Yang, Guee-Sang Lee et al.

Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries. In this study, we present an approach that considers the correlation between neighbor frames with pyramid feature maps in both spatial and temporal dimensions to construct a framework for localizing generic events in video. The features at multiple spatial dimensions of a pre-trained ResNet-50 are exploited with different views in the temporal dimension to form a temporal pyramid feature map. Based on that, the similarity between neighbor frames is calculated and projected to build a temporal pyramid similarity feature vector. A decoder with 1D convolution operations is used to decode these similarities to a new representation that incorporates their temporal relationship for later boundary score estimation. Extensive experiments conducted on the GEBD benchmark dataset show the effectiveness of our system and its variations, in which we outperformed the state-of-the-art approaches. Additional experiments on TAPOS dataset, which contains long-form videos with Olympic sport actions, demonstrated the effectiveness of our study compared to others.

10.6CVMar 24, 2022

An Ensemble Approach for Facial Expression Analysis in Video

Hong-Hai Nguyen, Van-Thong Huynh, Soo-Hyung Kim

Human emotions recognization contributes to the development of human-computer interaction. The machines understanding human emotions in the real world will significantly contribute to life in the future. This paper will introduce the Affective Behavior Analysis in-the-wild (ABAW3) 2022 challenge. The paper focuses on solving the problem of the valence-arousal estimation and action unit detection. For valence-arousal estimation, we conducted two stages: creating new features from multimodel and temporal learning to predict valence-arousal. First, we make new features; the Gated Recurrent Unit (GRU) and Transformer are combined using a Regular Networks (RegNet) feature, which is extracted from the image. The next step is the GRU combined with Local Attention to predict valence-arousal. The Concordance Correlation Coefficient (CCC) was used to evaluate the model.

7.6CVMar 16, 2023

Vision Transformer for Action Units Detection

Tu Vu, Van Thong Huynh, Soo Hyung Kim

Facial Action Units detection (FAUs) represents a fine-grained classification problem that involves identifying different units on the human face, as defined by the Facial Action Coding System. In this paper, we present a simple yet efficient Vision Transformer-based approach for addressing the task of Action Units (AU) detection in the context of Affective Behavior Analysis in-the-wild (ABAW) competition. We employ the Video Vision Transformer(ViViT) Network to capture the temporal facial change in the video. Besides, to reduce massive size of the Vision Transformers model, we replace the ViViT feature extraction layers with the CNN backbone (Regnet). Our model outperform the baseline model of ABAW 2023 challenge, with a notable 14% difference in result. Furthermore, the achieved results are comparable to those of the top three teams in the previous ABAW 2022 challenge.

8.1CVMar 24, 2022

Facial Expression Classification using Fusion of Deep Neural Network in Video for the 3rd ABAW3 Competition

Kim Ngan Phan, Hong-Hai Nguyen, Van-Thong Huynh et al.

For computers to recognize human emotions, expression classification is an equally important problem in the human-computer interaction area. In the 3rd Affective Behavior Analysis In-The-Wild competition, the task of expression classification includes eight classes with six basic expressions of human faces from videos. In this paper, we employ a transformer mechanism to encode the robust representation from the backbone. Fusion of the robust representations plays an important role in the expression classification task. Our approach achieves 30.35\% and 28.60\% for the $F_1$ score on the validation set and the test set, respectively. This result shows the effectiveness of the proposed architecture based on the Aff-Wild2 dataset.

4.7CVNov 4, 2019Code

Eye Semantic Segmentation with a Lightweight Model

Van Thong Huynh, Soo-Hyung Kim, Guee-Sang Lee et al.

In this paper, we present a multi-class eye segmentation method that can run the hardware limitations for real-time inference. Our approach includes three major stages: get a grayscale image from the input, segment three distinct eye region with a deep network, and remove incorrect areas with heuristic filters. Our model based on the encoder decoder structure with the key is the depthwise convolution operation to reduce the computation cost. We experiment on OpenEDS, a large scale dataset of eye images captured by a head-mounted display with two synchronized eye facing cameras. We achieved the mean intersection over union (mIoU) of 94.85% with a model of size 0.4 megabytes. The source code are available https://github.com/th2l/Eye_VR_Segmentation

5.2CVMay 13, 2024Code

Adaptation of Distinct Semantics for Uncertain Areas in Polyp Segmentation

Quang Vinh Nguyen, Van Thong Huynh, Soo-Hyung Kim

Colonoscopy is a common and practical method for detecting and treating polyps. Segmenting polyps from colonoscopy image is useful for diagnosis and surgery progress. Nevertheless, achieving excellent segmentation performance is still difficult because of polyp characteristics like shape, color, condition, and obvious non-distinction from the surrounding context. This work presents a new novel architecture namely Adaptation of Distinct Semantics for Uncertain Areas in Polyp Segmentation (ADSNet), which modifies misclassified details and recovers weak features having the ability to vanish and not be detected at the final stage. The architecture consists of a complementary trilateral decoder to produce an early global map. A continuous attention module modifies semantics of high-level features to analyze two separate semantics of the early global map. The suggested method is experienced on polyp benchmarks in learning ability and generalization ability, experimental results demonstrate the great correction and recovery ability leading to better segmentation performance compared to the other state of the art in the polyp image segmentation task. Especially, the proposed architecture could be experimented flexibly for other CNN-based encoders, Transformer-based encoders, and decoder backbones.

2.3SPMay 1, 2023Code

Multi-scale Transformer-based Network for Emotion Recognition from Multi Physiological Signals

Tu Vu, Van Thong Huynh, Soo-Hyung Kim

This paper presents an efficient Multi-scale Transformer-based approach for the task of Emotion recognition from Physiological data, which has gained widespread attention in the research community due to the vast amount of information that can be extracted from these signals using modern sensors and machine learning techniques. Our approach involves applying a Multi-modal technique combined with scaling data to establish the relationship between internal body signals and human emotions. Additionally, we utilize Transformer and Gaussian Transformation techniques to improve signal encoding effectiveness and overall performance. Our model achieves decent results on the CASE dataset of the EPiC competition, with an RMSE score of 1.45.

1.4CVJun 16, 2021Code

Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation

VanThong Huynh, Guee-Sang Lee, Hyung-Jeong Yang et al.

This paper presents an approach for Evoked Expressions from Videos (EEV) challenge, which aims to predict evoked facial expressions from video. We take advantage of pre-trained models on large-scale datasets in computer vision and audio signals to extract the deep representation of timestamps in the video. A temporal convolution network, rather than an RNN like architecture, is used to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the missing annotations of some timestamps, positional encoding is employed to ensure continuity of input data when discarding these timestamps during training. We achieved state-of-the-art results on the EEV challenge with a Pearson correlation coefficient of 0.05477, the first ranked performance in the EEV 2021 challenge.

9.1CVApr 21, 2020Code

The 1st Agriculture-Vision Challenge: Methods and Results

Mang Tik Chiu, Xingqian Xu, Kai Wang et al.

The first Agriculture-Vision Challenge aims to encourage research in developing novel and effective algorithms for agricultural pattern recognition from aerial images, especially for the semantic segmentation task associated with our challenge dataset. Around 57 participating teams from various countries compete to achieve state-of-the-art in aerial agriculture semantic segmentation. The Agriculture-Vision Challenge Dataset was employed, which comprises of 21,061 aerial and multi-spectral farmland images. This paper provides a summary of notable methods and results in the challenge. Our submission server and leaderboard will continue to open for researchers that are interested in this challenge dataset and task; the link can be found here.