29.2CVJun 3
Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain ReconstructionHoang-Son Vo, Van-Hung Bui, Minh-Huy Mai-Duc et al.
Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.
CVMar 11, 2023
CoNIC Challenge: Pushing the Frontiers of Nuclear Detection, Segmentation, Classification and CountingSimon Graham, Quoc Dang Vu, Mostafa Jahanifar et al.
Nuclear detection, segmentation and morphometric profiling are essential in helping us further understand the relationship between histology and patient outcome. To drive innovation in this area, we setup a community-wide challenge using the largest available dataset of its kind to assess nuclear segmentation and cellular composition. Our challenge, named CoNIC, stimulated the development of reproducible algorithms for cellular recognition with real-time result inspection on public leaderboards. We conducted an extensive post-challenge analysis based on the top-performing models using 1,658 whole-slide images of colon tissue. With around 700 million detected nuclei per model, associated features were used for dysplasia grading and survival analysis, where we demonstrated that the challenge's improvement over the previous state-of-the-art led to significant boosts in downstream performance. Our findings also suggest that eosinophils and neutrophils play an important role in the tumour microevironment. We release challenge models and WSI-level results to foster the development of further methods for biomarker discovery.
MMJul 31, 2023
DCTM: Dilated Convolutional Transformer Model for Multimodal Engagement Estimation in ConversationVu Ngoc Tu, Van Thong Huynh, Hyung-Jeong Yang et al.
Conversational engagement estimation is posed as a regression problem, entailing the identification of the favorable attention and involvement of the participants in the conversation. This task arises as a crucial pursuit to gain insights into human's interaction dynamics and behavior patterns within a conversation. In this research, we introduce a dilated convolutional Transformer for modeling and estimating human engagement in the MULTIMEDIATE 2023 competition. Our proposed system surpasses the baseline models, exhibiting a noteworthy $7$\% improvement on test set and $4$\% on validation set. Moreover, we employ different modality fusion mechanism and show that for this type of data, a simple concatenated method with self-attention fusion gains the best performance.
SPJul 23, 2023
Mental Workload Estimation with Electroencephalogram Signals by Combining Multi-Space Deep ModelsHong-Hai Nguyen, Ngumimi Karen Iyortsuun, Seungwon Kim et al.
The human brain remains continuously active, whether an individual is working or at rest. Mental activity is a daily process, and if the brain becomes excessively active, known as overload, it can adversely affect human health. Recently, advancements in early prediction of mental health conditions have emerged, aiming to prevent serious consequences and enhance the overall quality of life. Consequently, the estimation of mental status has garnered significant attention from diverse researchers due to its potential benefits. While various signals are employed to assess mental state, the electroencephalogram, containing extensive information about the brain, is widely utilized by researchers. In this paper, we categorize mental workload into three states (low, middle, and high) and estimate a continuum of mental workload levels. Our method leverages information from multiple spatial dimensions to achieve optimal results in mental estimation. For the time domain approach, we employ Temporal Convolutional Networks. In the frequency domain, we introduce a novel architecture based on combining residual blocks, termed the Multi-Dimensional Residual Block. The integration of these two domains yields significant results compared to individual estimates in each domain. Our approach achieved a 74.98% accuracy in the three-class classification, surpassing the provided data results at 69.00%. Specially, our method demonstrates efficacy in estimating continuous levels, evidenced by a corresponding Concordance Correlation Coefficient (CCC) result of 0.629. The combination of time and frequency domain analysis in our approach highlights the exciting potential to improve healthcare applications in the future.
CVJul 22, 2022
An Ensemble Approach for Multiple Emotion Descriptors Estimation Using Multi-task LearningIrfan Haider, Minh-Trieu Tran, Soo-Hyung Kim et al.
This paper illustrates our submission method to the fourth Affective Behavior Analysis in-the-Wild (ABAW) Competition. The method is used for the Multi-Task Learning Challenge. Instead of using only face information, we employ full information from a provided dataset containing face and the context around the face. We utilized the InceptionNet V3 model to extract deep features then we applied the attention mechanism to refine the features. After that, we put those features into the transformer block and multi-layer perceptron networks to get the final multiple kinds of emotion. Our model predicts arousal and valence, classifies the emotional expression and estimates the action units simultaneously. The proposed system achieves the performance of 0.917 on the MTL Challenge validation dataset.
CVJan 11, 2023
Generic Event Boundary Detection in Video with Pyramid FeaturesVan Thong Huynh, Hyung-Jeong Yang, Guee-Sang Lee et al.
Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries. In this study, we present an approach that considers the correlation between neighbor frames with pyramid feature maps in both spatial and temporal dimensions to construct a framework for localizing generic events in video. The features at multiple spatial dimensions of a pre-trained ResNet-50 are exploited with different views in the temporal dimension to form a temporal pyramid feature map. Based on that, the similarity between neighbor frames is calculated and projected to build a temporal pyramid similarity feature vector. A decoder with 1D convolution operations is used to decode these similarities to a new representation that incorporates their temporal relationship for later boundary score estimation. Extensive experiments conducted on the GEBD benchmark dataset show the effectiveness of our system and its variations, in which we outperformed the state-of-the-art approaches. Additional experiments on TAPOS dataset, which contains long-form videos with Olympic sport actions, demonstrated the effectiveness of our study compared to others.
CVMar 24, 2022
An Ensemble Approach for Facial Expression Analysis in VideoHong-Hai Nguyen, Van-Thong Huynh, Soo-Hyung Kim
Human emotions recognization contributes to the development of human-computer interaction. The machines understanding human emotions in the real world will significantly contribute to life in the future. This paper will introduce the Affective Behavior Analysis in-the-wild (ABAW3) 2022 challenge. The paper focuses on solving the problem of the valence-arousal estimation and action unit detection. For valence-arousal estimation, we conducted two stages: creating new features from multimodel and temporal learning to predict valence-arousal. First, we make new features; the Gated Recurrent Unit (GRU) and Transformer are combined using a Regular Networks (RegNet) feature, which is extracted from the image. The next step is the GRU combined with Local Attention to predict valence-arousal. The Concordance Correlation Coefficient (CCC) was used to evaluate the model.
CVMar 24, 2022
Facial Expression Classification using Fusion of Deep Neural Network in Video for the 3rd ABAW3 CompetitionKim Ngan Phan, Hong-Hai Nguyen, Van-Thong Huynh et al.
For computers to recognize human emotions, expression classification is an equally important problem in the human-computer interaction area. In the 3rd Affective Behavior Analysis In-The-Wild competition, the task of expression classification includes eight classes with six basic expressions of human faces from videos. In this paper, we employ a transformer mechanism to encode the robust representation from the backbone. Fusion of the robust representations plays an important role in the expression classification task. Our approach achieves 30.35\% and 28.60\% for the $F_1$ score on the validation set and the test set, respectively. This result shows the effectiveness of the proposed architecture based on the Aff-Wild2 dataset.
CVSep 8, 2024
Transformer with Leveraged Masked Autoencoder for video-based Pain AssessmentMinh-Duc Nguyen, Hyung-Jeong Yang, Soo-Hyung Kim et al.
Accurate pain assessment is crucial in healthcare for effective diagnosis and treatment; however, traditional methods relying on self-reporting are inadequate for populations unable to communicate their pain. Cutting-edge AI is promising for supporting clinicians in pain recognition using facial video data. In this paper, we enhance pain recognition by employing facial video analysis within a Transformer-based deep learning model. By combining a powerful Masked Autoencoder with a Transformers-based classifier, our model effectively captures pain level indicators through both expressions and micro-expressions. We conducted our experiment on the AI4Pain dataset, which produced promising results that pave the way for innovative healthcare solutions that are both comprehensive and objective.
IVMar 4, 2022
MF-Hovernet: An Extension of Hovernet for Colon Nuclei Identification and Counting (CoNiC) ChallengeVi Thi-Tuong Vo, Soo-Hyung Kim, Taebum Lee
Nuclei Identification and Counting is the most important morphological feature of cancers, especially in the colon. Many deep learning-based methods have been proposed to deal with this problem. In this work, we construct an extension of Hovernet for nuclei identification and counting to address the problem named MF-Hovernet. Our proposed model is the combination of multiple filer block to Hovernet architecture. The current result shows the efficiency of multiple filter block to improve the performance of the original Hovernet model.
CVSep 8, 2024
Leveraging WaveNet for Dynamic Listening Head Modeling from SpeechMinh-Duc Nguyen, Hyung-Jeong Yang, Seung-Won Kim et al.
The creation of listener facial responses aims to simulate interactive communication feedback from a listener during a face-to-face conversation. Our goal is to generate believable videos of listeners' heads that respond authentically to a single speaker by a sequence-to-sequence model with an combination of WaveNet and Long short-term memory network. Our approach focuses on capturing the subtle nuances of listener feedback, ensuring the preservation of individual listener identity while expressing appropriate attitudes and viewpoints. Experiment results show that our method surpasses the baseline models on ViCo benchmark Dataset.
CVJul 17, 2025Code
ATL-Diff: Audio-Driven Talking Head Generation with Early Landmarks-Guide Noise DiffusionHoang-Son Vo, Quang-Vinh Nguyen, Seungwon Kim et al.
Audio-driven talking head generation requires precise synchronization between facial animations and audio signals. This paper introduces ATL-Diff, a novel approach addressing synchronization limitations while reducing noise and computational costs. Our framework features three key components: a Landmark Generation Module converting audio to facial landmarks, a Landmarks-Guide Noise approach that decouples audio by distributing noise according to landmarks, and a 3D Identity Diffusion network preserving identity characteristics. Experiments on MEAD and CREMA-D datasets demonstrate that ATL-Diff outperforms state-of-the-art methods across all metrics. Our approach achieves near real-time processing with high-quality animations, computational efficiency, and exceptional preservation of facial nuances. This advancement offers promising applications for virtual assistants, education, medical communication, and digital platforms. The source code is available at: \href{https://github.com/sonvth/ATL-Diff}{https://github.com/sonvth/ATL-Diff}
CVMay 12, 2025Code
Anatomical Attention Alignment representation for Radiology Report GenerationQuang Vinh Nguyen, Minh Duc Nguyen, Thanh Hoang Son Vo et al.
Automated Radiology report generation (RRG) aims at producing detailed descriptions of medical images, reducing radiologists' workload and improving access to high-quality diagnostic services. Existing encoder-decoder models only rely on visual features extracted from raw input images, which can limit the understanding of spatial structures and semantic relationships, often resulting in suboptimal text generation. To address this, we propose Anatomical Attention Alignment Network (A3Net), a framework that enhance visual-textual understanding by constructing hyper-visual representations. Our approach integrates a knowledge dictionary of anatomical structures with patch-level visual features, enabling the model to effectively associate image regions with their corresponding anatomical entities. This structured representation improves semantic reasoning, interpretability, and cross-modal alignment, ultimately enhancing the accuracy and clinical relevance of generated reports. Experimental results on IU X-Ray and MIMIC-CXR datasets demonstrate that A3Net significantly improves both visual perception and text generation quality. Our code is available at \href{https://github.com/Vinh-AI/A3Net}{GitHub}.
CVNov 4, 2019Code
Eye Semantic Segmentation with a Lightweight ModelVan Thong Huynh, Soo-Hyung Kim, Guee-Sang Lee et al.
In this paper, we present a multi-class eye segmentation method that can run the hardware limitations for real-time inference. Our approach includes three major stages: get a grayscale image from the input, segment three distinct eye region with a deep network, and remove incorrect areas with heuristic filters. Our model based on the encoder decoder structure with the key is the depthwise convolution operation to reduce the computation cost. We experiment on OpenEDS, a large scale dataset of eye images captured by a head-mounted display with two synchronized eye facing cameras. We achieved the mean intersection over union (mIoU) of 94.85% with a model of size 0.4 megabytes. The source code are available https://github.com/th2l/Eye_VR_Segmentation
CVSep 9, 2024
KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks GenerationHoang-Son Vo-Thanh, Quang-Vinh Nguyen, Soo-Hyung Kim
Audio-driven talking face generation is a widely researched topic due to its high applicability. Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Early studies often focused solely on changing the mouth movements, which resulted in outcomes with limited practical applications. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. To achieve this, they need to generate through landmarks. However, creating stable landmarks that align well with the audio is a challenge. In this paper, we propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio. We separate the audio into two distinct domains to learn emotional information and facial context, then use a fusion mechanism based on the KAN model. Our model demonstrates high efficiency compared to recent models. This will lay the groundwork for the development of the audio-driven talking face generation problem in the future.
CVMay 13, 2024
Adaptation of Distinct Semantics for Uncertain Areas in Polyp SegmentationQuang Vinh Nguyen, Van Thong Huynh, Soo-Hyung Kim
Colonoscopy is a common and practical method for detecting and treating polyps. Segmenting polyps from colonoscopy image is useful for diagnosis and surgery progress. Nevertheless, achieving excellent segmentation performance is still difficult because of polyp characteristics like shape, color, condition, and obvious non-distinction from the surrounding context. This work presents a new novel architecture namely Adaptation of Distinct Semantics for Uncertain Areas in Polyp Segmentation (ADSNet), which modifies misclassified details and recovers weak features having the ability to vanish and not be detected at the final stage. The architecture consists of a complementary trilateral decoder to produce an early global map. A continuous attention module modifies semantics of high-level features to analyze two separate semantics of the early global map. The suggested method is experienced on polyp benchmarks in learning ability and generalization ability, experimental results demonstrate the great correction and recovery ability leading to better segmentation performance compared to the other state of the art in the polyp image segmentation task. Especially, the proposed architecture could be experimented flexibly for other CNN-based encoders, Transformer-based encoders, and decoder backbones.
LGMay 12, 2025
Latent Behavior Diffusion for Sequential Reaction Generation in Dyadic SettingMinh-Duc Nguyen, Hyung-Jeong Yang, Soo-Hyung Kim et al.
The dyadic reaction generation task involves synthesizing responsive facial reactions that align closely with the behaviors of a conversational partner, enhancing the naturalness and effectiveness of human-like interaction simulations. This paper introduces a novel approach, the Latent Behavior Diffusion Model, comprising a context-aware autoencoder and a diffusion-based conditional generator that addresses the challenge of generating diverse and contextually relevant facial reactions from input speaker behaviors. The autoencoder compresses high-dimensional input features, capturing dynamic patterns in listener reactions while condensing complex input data into a concise latent representation, facilitating more expressive and contextually appropriate reaction synthesis. The diffusion-based conditional generator operates on the latent space generated by the autoencoder to predict realistic facial reactions in a non-autoregressive manner. This approach allows for generating diverse facial reactions that reflect subtle variations in conversational cues and emotional states. Experimental results demonstrate the effectiveness of our approach in achieving superior performance in dyadic reaction synthesis tasks compared to existing methods.
CVNov 19, 2024
Rethinking Top Probability from Multi-view for Distracted Driver Behaviour LocalizationQuang Vinh Nguyen, Vo Hoang Thanh Son, Chau Truong Vinh Hoang et al.
Naturalistic driving action localization task aims to recognize and comprehend human behaviors and actions from video data captured during real-world driving scenarios. Previous studies have shown great action localization performance by applying a recognition model followed by probability-based post-processing. Nevertheless, the probabilities provided by the recognition model frequently contain confused information causing challenge for post-processing. In this work, we adopt an action recognition model based on self-supervise learning to detect distracted activities and give potential action probabilities. Subsequently, a constraint ensemble strategy takes advantages of multi-camera views to provide robust predictions. Finally, we introduce a conditional post-processing operation to locate distracted behaviours and action temporal boundaries precisely. Experimenting on test set A2, our method obtains the sixth position on the public leaderboard of track 3 of the 2024 AI City Challenge.
SPMay 1, 2023
Multi-scale Transformer-based Network for Emotion Recognition from Multi Physiological SignalsTu Vu, Van Thong Huynh, Soo-Hyung Kim
This paper presents an efficient Multi-scale Transformer-based approach for the task of Emotion recognition from Physiological data, which has gained widespread attention in the research community due to the vast amount of information that can be extracted from these signals using modern sensors and machine learning techniques. Our approach involves applying a Multi-modal technique combined with scaling data to establish the relationship between internal body signals and human emotions. Additionally, we utilize Transformer and Gaussian Transformation techniques to improve signal encoding effectiveness and overall performance. Our model achieves decent results on the CASE dataset of the EPiC competition, with an RMSE score of 1.45.
CVJul 9, 2021
Emotion Recognition with Incomplete Labels Using Modified Multi-task Learning TechniquePhan Tran Dac Thinh, Hoang Manh Hung, Hyung-Jeong Yang et al.
The task of predicting affective information in the wild such as seven basic emotions or action units from human faces has gradually become more interesting due to the accessibility and availability of massive annotated datasets. In this study, we propose a method that utilizes the association between seven basic emotions and twelve action units from the AffWild2 dataset. The method based on the architecture of ResNet50 involves the multi-task learning technique for the incomplete labels of the two tasks. By combining the knowledge for two correlated tasks, both performances are improved by a large margin compared to those with the model employing only one kind of label.
CVJul 28, 2020
Variants of BERT, Random Forests and SVM approach for Multimodal Emotion-Target Sub-challengeHoang Manh Hung, Hyung-Jeong Yang, Soo-Hyung Kim et al.
Emotion recognition has become a major problem in computer vision in recent years that made a lot of effort by researchers to overcome the difficulties in this task. In the field of affective computing, emotion recognition has a wide range of applications, such as healthcare, robotics, human-computer interaction. Due to its practical importance for other tasks, many techniques and approaches have been investigated for different problems and various data sources. Nevertheless, comprehensive fusion of the audio-visual and language modalities to get the benefits from them is still a problem to solve. In this paper, we present and discuss our classification methodology for MuSe-Topic Sub-challenge, as well as the data and results. For the topic classification, we ensemble two language models which are ALBERT and RoBERTa to predict 10 classes of topics. Moreover, for the classification of valence and arousal, SVM and Random forests are employed in conjunction with feature selection to enhance the performance.
CVApr 21, 2020
The 1st Agriculture-Vision Challenge: Methods and ResultsMang Tik Chiu, Xingqian Xu, Kai Wang et al.
The first Agriculture-Vision Challenge aims to encourage research in developing novel and effective algorithms for agricultural pattern recognition from aerial images, especially for the semantic segmentation task associated with our challenge dataset. Around 57 participating teams from various countries compete to achieve state-of-the-art in aerial agriculture semantic segmentation. The Agriculture-Vision Challenge Dataset was employed, which comprises of 21,061 aerial and multi-spectral farmland images. This paper provides a summary of notable methods and results in the challenge. Our submission server and leaderboard will continue to open for researchers that are interested in this challenge dataset and task; the link can be found here.
CVFeb 21, 2020
Affective Expression Analysis in-the-wild using Multi-Task Temporal Statistical Deep Learning ModelNhu-Tai Do, Tram-Tran Nguyen-Quynh, Soo-Hyung Kim
Affective behavior analysis plays an important role in human-computer interaction, customer marketing, health monitoring. ABAW Challenge and Aff-Wild2 dataset raise the new challenge for classifying basic emotions and regression valence-arousal value under in-the-wild environments. In this paper, we present an affective expression analysis model that deals with the above challenges. Our approach includes STAT and Temporal Module for fine-tuning again face feature model. We experimented on Aff-Wild2 dataset, a large-scale dataset for ABAW Challenge with the annotations for both the categorical and valence-arousal emotion. We achieved the expression score 0.543 and valence-arousal score 0.534 on the validation set.
CVJan 29, 2019
Automatic Whole-body Bone Age Assessment Using Deep Hierarchical FeaturesHai-Duong Nguyen, Soo-Hyung Kim
Bone age assessment gives us evidence to analyze the children growth status and the rejuvenation involved chronological and biological ages. All the previous works consider left-hand X-ray image of a child in their works. In this paper, we carry out a study on estimating human age using whole-body bone CT images and a novel convolutional neural network. Our model with additional connections shows an effective way to generate a massive number of vital features while reducing overfitting influence on small training data in the medical image analysis research area. A dataset and a comparison with common deep architectures will be provided for future research in this field.
CVJan 11, 2019
Hand Segmentation and Fingertip Tracking from Depth Camera Images Using Deep Convolutional Neural Network and Multi-task SegNetDuong Hai Nguyen, Tai Nhu Do, In-Seop Na et al.
Hand segmentation and fingertip detection play an indispensable role in hand gesture-based human-machine interaction systems. In this study, we propose a method to discriminate hand components and to locate fingertips in RGB-D images. The system consists of three main steps: hand detection using RGB images providing regions which are considered as promising areas for further processing, hand segmentation, and fingertip detection using depth image and our modified SegNet, a single lightweight architecture that can process two independent tasks at the same time. The experimental results show that our system is a promising method for hand segmentation and fingertip detection which achieves a comparable performance while model complexity is suitable for real-time applications.