CVOct 23, 2022Code
Holistic Interaction Transformer Network for Action DetectionGueter Josmy Faure, Min-Hung Chen, Shang-Hong Lai · microsoft-research, nvidia
Actions are about how we interact with the environment, including other people, objects, and ourselves. In this paper, we propose a novel multi-modal Holistic Interaction Transformer Network (HIT) that leverages the largely ignored, but critical hand and pose information essential to most human actions. The proposed "HIT" network is a comprehensive bi-modal framework that comprises an RGB stream and a pose stream. Each of them separately models person, object, and hand interactions. Within each sub-network, an Intra-Modality Aggregation module (IMA) is introduced that selectively merges individual interaction units. The resulting features from each modality are then glued using an Attentive Fusion Mechanism (AFM). Finally, we extract cues from the temporal context to better classify the occurring actions using cached memory. Our method significantly outperforms previous approaches on the J-HMDB, UCF101-24, and MultiSports datasets. We also achieve competitive results on AVA. The code will be available at https://github.com/joslefaure/HIT.
CVApr 10, 2023Code
Interaction-Aware Prompting for Zero-Shot Spatio-Temporal Action DetectionWei-Jhe Huang, Jheng-Hsien Yeh, Min-Hung Chen et al. · microsoft-research, nvidia
The goal of spatial-temporal action detection is to determine the time and place where each person's action occurs in a video and classify the corresponding action category. Most of the existing methods adopt fully-supervised learning, which requires a large amount of training data, making it very difficult to achieve zero-shot learning. In this paper, we propose to utilize a pre-trained visual-language model to extract the representative image and text features, and model the relationship between these features through different interaction modules to obtain the interaction feature. In addition, we use this feature to prompt each label to obtain more appropriate text features. Finally, we calculate the similarity between the interaction feature and the text feature for each label to determine the action category. Our experiments on J-HMDB and UCF101-24 datasets demonstrate that the proposed interaction module and prompting make the visual-language features better aligned, thus achieving excellent accuracy for zero-shot spatio-temporal action detection. The code will be available at https://github.com/webber2933/iCLIP.
CVMar 27, 2022
PatchNet: A Simple Face Anti-Spoofing Framework via Fine-Grained Patch RecognitionChien-Yi Wang, Yu-Ding Lu, Shang-Ta Yang et al. · microsoft-research
Face anti-spoofing (FAS) plays a critical role in securing face recognition systems from different presentation attacks. Previous works leverage auxiliary pixel-level supervision and domain generalization approaches to address unseen spoof types. However, the local characteristics of image captures, i.e., capturing devices and presenting materials, are ignored in existing works and we argue that such information is required for networks to discriminate between live and spoof images. In this work, we propose PatchNet which reformulates face anti-spoofing as a fine-grained patch-type recognition problem. To be specific, our framework recognizes the combination of capturing devices and presenting materials based on the patches cropped from non-distorted face images. This reformulation can largely improve the data variation and enforce the network to learn discriminative feature from local capture patterns. In addition, to further improve the generalization ability of the spoof feature, we propose the novel Asymmetric Margin-based Classification Loss and Self-supervised Similarity Loss to regularize the patch embedding space. Our experimental results verify our assumption and show that the model is capable of recognizing unseen spoof types robustly by only looking at local regions. Moreover, the fine-grained and patch-level reformulation of FAS outperforms the existing approaches on intra-dataset, cross-dataset, and domain generalization benchmarks. Furthermore, our PatchNet framework can enable practical applications like Few-Shot Reference-based FAS and facilitate future exploration of spoof-related intrinsic cues.
CVAug 25, 2023
ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object TrackingCheng-Che Cheng, Min-Xuan Qiu, Chen-Kuo Chiang et al. · microsoft-research
Multi-Camera Multi-Object Tracking (MC-MOT) utilizes information from multiple views to better handle problems with occlusion and crowded scenes. Recently, the use of graph-based approaches to solve tracking problems has become very popular. However, many current graph-based methods do not effectively utilize information regarding spatial and temporal consistency. Instead, they rely on single-camera trackers as input, which are prone to fragmentation and ID switch errors. In this paper, we propose a novel reconfigurable graph model that first associates all detected objects across cameras spatially before reconfiguring it into a temporal graph for Temporal Association. This two-stage association approach enables us to extract robust spatial and temporal-aware features and address the problem with fragmented tracklets. Furthermore, our model is designed for online tracking, making it suitable for real-world applications. Experimental results show that the proposed graph model is able to extract more discriminating features for object tracking, and our model achieves state-of-the-art performance on several public datasets.
IVApr 1, 2022
Extremely Low-light Image Enhancement with Scene Text RestorationPohao Hsu, Che-Tsung Lin, Chun Chet Ng et al. · microsoft-research
Deep learning-based methods have made impressive progress in enhancing extremely low-light images - the image quality of the reconstructed images has generally improved. However, we found out that most of these methods could not sufficiently recover the image details, for instance, the texts in the scene. In this paper, a novel image enhancement framework is proposed to precisely restore the scene texts, as well as the overall quality of the image simultaneously under extremely low-light images conditions. Mainly, we employed a self-regularised attention map, an edge map, and a novel text detection loss. In addition, leveraging synthetic low-light images is beneficial for image enhancement on the genuine ones in terms of text detection. The quantitative and qualitative experimental results have shown that the proposed model outperforms state-of-the-art methods in image restoration, text detection, and text spotting on See In the Dark and ICDAR15 datasets.
CVMar 27, 2022
Local-Adaptive Face Recognition via Graph-based Meta-Clustering and Regularized AdaptationWenbin Zhu, Chien-Yi Wang, Kuan-Lun Tseng et al. · microsoft-research
Due to the rising concern of data privacy, it's reasonable to assume the local client data can't be transferred to a centralized server, nor their associated identity label is provided. To support continuous learning and fill the last-mile quality gap, we introduce a new problem setup called Local-Adaptive Face Recognition (LaFR). Leveraging the environment-specific local data after the deployment of the initial global model, LaFR aims at getting optimal performance by training local-adapted models automatically and un-supervisely, as opposed to fixing their initial global model. We achieve this by a newly proposed embedding cluster model based on Graph Convolution Network (GCN), which is trained via meta-optimization procedure. Compared with previous works, our meta-clustering model can generalize well in unseen local environments. With the pseudo identity labels from the clustering results, we further introduce novel regularization techniques to improve the model adaptation performance. Extensive experiments on racial and internal sensor adaptation demonstrate that our proposed solution is more effective for adapting face recognition models in each specific environment. Meanwhile, we show that LaFR can further improve the global model by a simple federated aggregation over the updated local models.
CVNov 28, 2022
MixFairFace: Towards Ultimate Fairness via MixFair Adapter in Face RecognitionFu-En Wang, Chien-Yi Wang, Min Sun et al. · microsoft-research
Although significant progress has been made in face recognition, demographic bias still exists in face recognition systems. For instance, it usually happens that the face recognition performance for a certain demographic group is lower than the others. In this paper, we propose MixFairFace framework to improve the fairness in face recognition models. First of all, we argue that the commonly used attribute-based fairness metric is not appropriate for face recognition. A face recognition system can only be considered fair while every person has a close performance. Hence, we propose a new evaluation protocol to fairly evaluate the fairness performance of different approaches. Different from previous approaches that require sensitive attribute labels such as race and gender for reducing the demographic bias, we aim at addressing the identity bias in face representation, i.e., the performance inconsistency between different identities, without the need for sensitive attribute labels. To this end, we propose MixFair Adapter to determine and reduce the identity bias of training samples. Our extensive experiments demonstrate that our MixFairFace approach achieves state-of-the-art fairness performance on all benchmark datasets.
CVApr 10, 2023
Kinship Representation Learning with Face Componential RelationWeng-Tai Su, Min-Hung Chen, Chien-Yi Wang et al. · microsoft-research, nvidia
Kinship recognition aims to determine whether the subjects in two facial images are kin or non-kin, which is an emerging and challenging problem. However, most previous methods focus on heuristic designs without considering the spatial correlation between face images. In this paper, we aim to learn discriminative kinship representations embedded with the relation information between face components (e.g., eyes, nose, etc.). To achieve this goal, we propose the Face Componential Relation Network, which learns the relationship between face components among images with a cross-attention mechanism, which automatically learns the important facial regions for kinship recognition. Moreover, we propose Face Componential Relation Network (FaCoRNet), which adapts the loss function by the guidance from cross-attention to learn more discriminative feature representations. The proposed FaCoRNet outperforms previous state-of-the-art methods by large margins for the largest public kinship recognition FIW benchmark.
CVNov 29, 2022
Generalized Face Anti-Spoofing via Multi-Task Learning and One-Side Meta Triplet LossChu-Chun Chuang, Chien-Yi Wang, Shang-Hong Lai · microsoft-research
With the increasing variations of face presentation attacks, model generalization becomes an essential challenge for a practical face anti-spoofing system. This paper presents a generalized face anti-spoofing framework that consists of three tasks: depth estimation, face parsing, and live/spoof classification. With the pixel-wise supervision from the face parsing and depth estimation tasks, the regularized features can better distinguish spoof faces. While simulating domain shift with meta-learning techniques, the proposed one-side triplet loss can further improve the generalization capability by a large margin. Extensive experiments on four public datasets demonstrate that the proposed framework and training strategies are more effective than previous works for model generalization to unseen domains.
CVJun 25, 2023
A Closer Look at Geometric Temporal Dynamics for Face Anti-SpoofingChih-Jung Chang, Yaw-Chern Lee, Shih-Hsuan Yao et al. · microsoft-research, nvidia
Face anti-spoofing (FAS) is indispensable for a face recognition system. Many texture-driven countermeasures were developed against presentation attacks (PAs), but the performance against unseen domains or unseen spoofing types is still unsatisfactory. Instead of exhaustively collecting all the spoofing variations and making binary decisions of live/spoof, we offer a new perspective on the FAS task to distinguish between normal and abnormal movements of live and spoof presentations. We propose Geometry-Aware Interaction Network (GAIN), which exploits dense facial landmarks with spatio-temporal graph convolutional network (ST-GCN) to establish a more interpretable and modularized FAS model. Additionally, with our cross-attention feature interaction mechanism, GAIN can be easily integrated with other existing methods to significantly boost performance. Our approach achieves state-of-the-art performance in the standard intra- and cross-dataset evaluations. Moreover, our model outperforms state-of-the-art methods by a large margin in the cross-dataset cross-type protocol on CASIA-SURF 3DMask (+10.26% higher AUC score), exhibiting strong robustness against domain shifts and unseen spoofing types.
CVAug 28, 2024
CSAD: Unsupervised Component Segmentation for Logical Anomaly DetectionYu-Hsuan Hsieh, Shang-Hong Lai
To improve logical anomaly detection, some previous works have integrated segmentation techniques with conventional anomaly detection methods. Although these methods are effective, they frequently lead to unsatisfactory segmentation results and require manual annotations. To address these drawbacks, we develop an unsupervised component segmentation technique that leverages foundation models to autonomously generate training labels for a lightweight segmentation network without human labeling. Integrating this new segmentation technique with our proposed Patch Histogram module and the Local-Global Student-Teacher (LGST) module, we achieve a detection AUROC of 95.3% in the MVTec LOCO AD dataset, which surpasses previous SOTA methods. Furthermore, our proposed method provides lower latency and higher throughput than most existing approaches.
CVAug 30, 2024
HERMES: temporal-coHERent long-forM understanding with Episodes and SemanticsGueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen et al.
Long-form video understanding presents unique challenges that extend beyond traditional short-video analysis approaches, particularly in capturing long-range dependencies, processing redundant information efficiently, and extracting high-level semantic concepts. To address these challenges, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, featuring two versatile modules that can enhance existing video-language models or operate as a standalone system. Our Episodic COmpressor (ECO) efficiently aggregates representations from micro to semi-macro levels, reducing computational overhead while preserving temporal dependencies. Our Semantics ReTRiever (SeTR) enriches these representations with semantic information by focusing on broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. We demonstrate that these modules can be seamlessly integrated into existing SOTA models, consistently improving their performance while reducing inference latency by up to 43% and memory usage by 46%. As a standalone system, HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.
CVDec 23, 2021Code
FedFR: Joint Optimization Federated Framework for Generic and Personalized Face RecognitionChih-Ting Liu, Chien-Yi Wang, Shao-Yi Chien et al.
Current state-of-the-art deep learning based face recognition (FR) models require a large number of face identities for central training. However, due to the growing privacy awareness, it is prohibited to access the face images on user devices to continually improve face recognition models. Federated Learning (FL) is a technique to address the privacy issue, which can collaboratively optimize the model without sharing the data between clients. In this work, we propose a FL based framework called FedFR to improve the generic face representation in a privacy-aware manner. Besides, the framework jointly optimizes personalized models for the corresponding clients via the proposed Decoupled Feature Customization module. The client-specific personalized model can serve the need of optimized face recognition experience for registered identities at the local device. To the best of our knowledge, we are the first to explore the personalized face recognition in FL setup. The proposed framework is validated to be superior to previous approaches on several generic and personalized face recognition benchmarks with diverse FL scenarios. The source codes and our proposed personalized FR benchmark under FL setup are available at https://github.com/jackie840129/FedFR.
CVAug 28, 2024
Spatio-Temporal Context Prompting for Zero-Shot Action DetectionWei-Jhe Huang, Min-Hung Chen, Shang-Hong Lai
Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person's interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications. The code and data can be found in https://webber2933.github.io/ST-CLIP-project-page.
CVSep 19, 2023
KFC: Kinship Verification with Fair Contrastive Loss and Multi-Task LearningJia Luo Peng, Keng Wei Chang, Shang-Hong Lai
Kinship verification is an emerging task in computer vision with multiple potential applications. However, there's no large enough kinship dataset to train a representative and robust model, which is a limitation for achieving better performance. Moreover, face verification is known to exhibit bias, which has not been dealt with by previous kinship verification works and sometimes even results in serious issues. So we first combine existing kinship datasets and label each identity with the correct race in order to take race information into consideration and provide a larger and complete dataset, called KinRace dataset. Secondly, we propose a multi-task learning model structure with attention module to enhance accuracy, which surpasses state-of-the-art performance. Lastly, our fairness-aware contrastive loss function with adversarial learning greatly mitigates racial bias. We introduce a debias term into traditional contrastive loss and implement gradient reverse in race classification task, which is an innovative idea to mix two fairness methods to alleviate bias. Exhaustive experimental evaluation demonstrates the effectiveness and superior performance of the proposed KFC in both standard deviation and accuracy at the same time.
CVNov 10, 2025
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language ModelsYing Cheng, Yu-Ho Lin, Min-Hung Chen et al.
Video anomaly understanding (VAU) aims to provide detailed interpretation and semantic comprehension of anomalous events within videos, addressing limitations of traditional methods that focus solely on detecting and localizing anomalies. However, existing approaches often neglect the deeper causal relationships and interactions between objects, which are critical for understanding anomalous behaviors. In this paper, we propose VADER, an LLM-driven framework for Video Anomaly unDErstanding, which integrates keyframe object Relation features with visual cues to enhance anomaly comprehension from video. Specifically, VADER first applies an Anomaly Scorer to assign per-frame anomaly scores, followed by a Context-AwarE Sampling (CAES) strategy to capture the causal context of each anomalous event. A Relation Feature Extractor and a COntrastive Relation Encoder (CORE) jointly model dynamic object interactions, producing compact relational representations for downstream reasoning. These visual and relational cues are integrated with LLMs to generate detailed, causally grounded descriptions and support robust anomaly-related question answering. Experiments on multiple real-world VAU benchmarks demonstrate that VADER achieves strong results across anomaly description, explanation, and causal reasoning tasks, advancing the frontier of explainable video anomaly analysis.
CLAug 26, 2025
MovieCORE: COgnitive REasoning in MoviesGueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh et al.
This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.
CVDec 30, 2024
KeyGS: A Keyframe-Centric Gaussian Splatting Method for Monocular Image SequencesKeng-Wei Chang, Zi-Ming Wang, Shang-Hong Lai
Reconstructing high-quality 3D models from sparse 2D images has garnered significant attention in computer vision. Recently, 3D Gaussian Splatting (3DGS) has gained prominence due to its explicit representation with efficient training speed and real-time rendering capabilities. However, existing methods still heavily depend on accurate camera poses for reconstruction. Although some recent approaches attempt to train 3DGS models without the Structure-from-Motion (SfM) preprocessing from monocular video datasets, these methods suffer from prolonged training times, making them impractical for many applications. In this paper, we present an efficient framework that operates without any depth or matching model. Our approach initially uses SfM to quickly obtain rough camera poses within seconds, and then refines these poses by leveraging the dense representation in 3DGS. This framework effectively addresses the issue of long training times. Additionally, we integrate the densification process with joint refinement and propose a coarse-to-fine frequency-aware densification to reconstruct different levels of details. This approach prevents camera pose estimation from being trapped in local minima or drifting due to high-frequency signals. Our method significantly reduces training time from hours to minutes while achieving more accurate novel view synthesis and camera pose estimation compared to previous methods.
CVDec 15, 2023
TAB: Text-Align Anomaly Backbone Model for Industrial Inspection TasksHo-Weng Lee, Shang-Hong Lai
In recent years, the focus on anomaly detection and localization in industrial inspection tasks has intensified. While existing studies have demonstrated impressive outcomes, they often rely heavily on extensive training datasets or robust features extracted from pre-trained models trained on diverse datasets like ImageNet. In this work, we propose a novel framework leveraging the visual-linguistic CLIP model to adeptly train a backbone model tailored to the manufacturing domain. Our approach concurrently considers visual and text-aligned embedding spaces for normal and abnormal conditions. The resulting pre-trained backbone markedly enhances performance in industrial downstream tasks, particularly in anomaly detection and localization. Notably, this improvement is substantiated through experiments conducted on multiple datasets such as MVTecAD, BTAD, and KSDD2. Furthermore, using our pre-trained backbone weights allows previous works to achieve superior performance in few-shot scenarios with less training data. The proposed anomaly backbone provides a foundation model for more precise anomaly detection and localization.
CVDec 22, 2021
High-Accuracy RGB-D Face Recognition via Segmentation-Aware Face Depth Estimation and Mask-Guided Attention NetworkMeng-Tzu Chiu, Hsun-Ying Cheng, Chien-Yi Wang et al.
Deep learning approaches have achieved highly accurate face recognition by training the models with very large face image datasets. Unlike the availability of large 2D face image datasets, there is a lack of large 3D face datasets available to the public. Existing public 3D face datasets were usually collected with few subjects, leading to the over-fitting problem. This paper proposes two CNN models to improve the RGB-D face recognition task. The first is a segmentation-aware depth estimation network, called DepthNet, which estimates depth maps from RGB face images by including semantic segmentation information for more accurate face region localization. The other is a novel mask-guided RGB-D face recognition model that contains an RGB recognition branch, a depth map recognition branch, and an auxiliary segmentation mask branch with a spatial attention module. Our DepthNet is used to augment a large 2D face image dataset to a large RGB-D face dataset, which is used for training an accurate RGB-D face recognition model. Furthermore, the proposed mask-guided RGB-D face recognition model can fully exploit the depth map and segmentation mask information and is more robust against pose variation than previous methods. Our experimental results show that DepthNet can produce more reliable depth maps from face images with the segmentation mask. Our mask-guided face recognition model outperforms state-of-the-art methods on several public 3D face datasets.
CVOct 18, 2021
Disentangled Representation with Dual-stage Feature Learning for Face Anti-spoofingYu-Chun Wang, Chien-Yi Wang, Shang-Hong Lai
As face recognition is widely used in diverse security-critical applications, the study of face anti-spoofing (FAS) has attracted more and more attention. Several FAS methods have achieved promising performances if the attack types in the testing data are the same as training data, while the performance significantly degrades for unseen attack types. It is essential to learn more generalized and discriminative features to prevent overfitting to pre-defined spoof attack types. This paper proposes a novel dual-stage disentangled representation learning method that can efficiently untangle spoof-related features from irrelevant ones. Unlike previous FAS disentanglement works with one-stage architecture, we found that the dual-stage training design can improve the training stability and effectively encode the features to detect unseen attack types. Our experiments show that the proposed method provides superior accuracy than the state-of-the-art methods on several cross-type FAS benchmarks.
MMAug 25, 2020
ByeGlassesGAN: Identity Preserving Eyeglasses Removal for Face ImagesYu-Hui Lee, Shang-Hong Lai
In this paper, we propose a novel image-to-image GAN framework for eyeglasses removal, called ByeGlassesGAN, which is used to automatically detect the position of eyeglasses and then remove them from face images. Our ByeGlassesGAN consists of an encoder, a face decoder, and a segmentation decoder. The encoder is responsible for extracting information from the source face image, and the face decoder utilizes this information to generate glasses-removed images. The segmentation decoder is included to predict the segmentation mask of eyeglasses and completed face region. The feature vectors generated by the segmentation decoder are shared with the face decoder, which facilitates better reconstruction results. Our experiments show that ByeGlassesGAN can provide visually appealing results in the eyeglasses-removed face images even for semi-transparent color eyeglasses or glasses with glare. Furthermore, we demonstrate significant improvement in face recognition accuracy for face images with glasses by applying our method as a pre-processing step in our face recognition experiment.
CVAug 11, 2020
Unified Representation Learning for Cross Model CompatibilityChien-Yi Wang, Ya-Liang Chang, Shang-Ta Yang et al.
We propose a unified representation learning framework to address the Cross Model Compatibility (CMC) problem in the context of visual search applications. Cross compatibility between different embedding models enables the visual search systems to correctly recognize and retrieve identities without re-encoding user images, which are usually not available due to privacy concerns. While there are existing approaches to address CMC in face identification, they fail to work in a more challenging setting where the distributions of embedding models shift drastically. The proposed solution improves CMC performance by introducing a light-weight Residual Bottleneck Transformation (RBT) module and a new training scheme to optimize the embedding spaces. Extensive experiments demonstrate that our proposed solution outperforms previous approaches by a large margin for various challenging visual search scenarios of face recognition and person re-identification.