Han-Seok Seo

CV
5papers
224citations
Novelty52%
AI Score32

5 Papers

CVApr 6, 2023Code
Micron-BERT: BERT-based Facial Micro-Expression Recognition

Xuan-Bac Nguyen, Chi Nhan Duong, Xin Li et al.

Micro-expression recognition is one of the most challenging topics in affective computing. It aims to recognize tiny facial movements difficult for humans to perceive in a brief period, i.e., 0.25 to 0.5 seconds. Recent advances in pre-training deep Bidirectional Transformers (BERT) have significantly improved self-supervised learning tasks in computer vision. However, the standard BERT in vision problems is designed to learn only from full images or videos, and the architecture cannot accurately detect details of facial micro-expressions. This paper presents Micron-BERT ($μ$-BERT), a novel approach to facial micro-expression recognition. The proposed method can automatically capture these movements in an unsupervised manner based on two key ideas. First, we employ Diagonal Micro-Attention (DMA) to detect tiny differences between two frames. Second, we introduce a new Patch of Interest (PoI) module to localize and highlight micro-expression interest regions and simultaneously reduce noisy backgrounds and distractions. By incorporating these components into an end-to-end deep network, the proposed $μ$-BERT significantly outperforms all previous work in various micro-expression tasks. $μ$-BERT can be trained on a large-scale unlabeled dataset, i.e., up to 8 million images, and achieves high accuracy on new unseen facial micro-expression datasets. Empirical experiments show $μ$-BERT consistently outperforms state-of-the-art performance on four micro-expression benchmarks, including SAMM, CASME II, SMIC, and CASME3, by significant margins. Code will be available at \url{https://github.com/uark-cviu/Micron-BERT}

CVMar 19, 2022
DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

Thanh-Dat Truong, Quoc-Huy Bui, Chi Nhan Duong et al.

Human action recognition has recently become one of the popular research topics in the computer vision community. Various 3D-CNN based methods have been presented to tackle both the spatial and temporal dimensions in the task of video action recognition with competitive results. However, these methods have suffered some fundamental limitations such as lack of robustness and generalization, e.g., how does the temporal ordering of video frames affect the recognition results? This work presents a novel end-to-end Transformer-based Directed Attention (DirecFormer) framework for robust action recognition. The method takes a simple but novel perspective of Transformer-based approach to understand the right order of sequence actions. Therefore, the contributions of this work are three-fold. Firstly, we introduce the problem of ordered temporal learning issues to the action recognition problem. Secondly, a new Directed Attention mechanism is introduced to understand and provide attentions to human actions in the right order. Thirdly, we introduce the conditional dependency in action sequence modeling that includes orders and classes. The proposed approach consistently achieves the state-of-the-art (SOTA) results compared with the recent action recognition methods, on three standard large-scale benchmarks, i.e. Jester, Kinetics-400 and Something-Something-V2.

CVMar 6, 2023
SPARTAN: Self-supervised Spatiotemporal Transformers Approach to Group Activity Recognition

Naga VS Raviteja Chappa, Pha Nguyen, Alexander H Nelson et al.

In this paper, we propose a new, simple, and effective Self-supervised Spatio-temporal Transformers (SPARTAN) approach to Group Activity Recognition (GAR) using unlabeled video data. Given a video, we create local and global Spatio-temporal views with varying spatial patch sizes and frame rates. The proposed self-supervised objective aims to match the features of these contrasting views representing the same video to be consistent with the variations in spatiotemporal domains. To the best of our knowledge, the proposed mechanism is one of the first works to alleviate the weakly supervised setting of GAR using the encoders in video transformers. Furthermore, using the advantage of transformer models, our proposed approach supports long-term relationship modeling along spatio-temporal dimensions. The proposed SPARTAN approach performs well on two group activity recognition benchmarks, including NBA and Volleyball datasets, by surpassing the state-of-the-art results by a significant margin in terms of MCA and MPCA metrics.

CVApr 27, 2023
SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition

Naga VS Raviteja Chappa, Pha Nguyen, Alexander H Nelson et al.

This paper introduces a novel approach to Social Group Activity Recognition (SoGAR) using Self-supervised Transformers network that can effectively utilize unlabeled video data. To extract spatio-temporal information, we created local and global views with varying frame rates. Our self-supervised objective ensures that features extracted from contrasting views of the same video were consistent across spatio-temporal domains. Our proposed approach is efficient in using transformer-based encoders to alleviate the weakly supervised setting of group activity recognition. By leveraging the benefits of transformer models, our approach can model long-term relationships along spatio-temporal dimensions. Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks, namely JRDB-PAR, NBA, and Volleyball datasets, surpassing the current numbers in terms of F1-score, MCA, and MPCA metrics.

CVSep 6, 2024
A Novel Dataset for Video-Based Neurodivergent Classification Leveraging Extra-Stimulatory Behavior

Manuel Serna-Aguilera, Xuan Bac Nguyen, Han-Seok Seo et al.

Facial expressions and actions differ among different individuals at varying degrees of intensity given responses to external stimuli, particularly among those that are neurodivergent. Such behaviors affect people in terms of overall health, communication, and sensory processing. Deep learning can be responsibly leveraged to improve productivity in addressing this task, and help medical professionals to accurately understand such behaviors. In this work, we introduce the Video ASD dataset-a dataset that contains video frame convolutional and attention map feature data-to foster further progress in the task of ASD classification. Unlike many recent studies in ASD classification with MRI data, which require expensive specialized equipment, our method utilizes a powerful but relatively affordable GPU, a standard computer setup, and a video camera for inference. Results show that our model effectively generalizes and understands key differences in the distinct movements of the children. Additionally, we test foundation models on this data to showcase how movement noise affects performance and the need for more data and more complex labels.