Xiaoyi Feng

CV
h-index10
20papers
672citations
Novelty47%
AI Score40

20 Papers

CVSep 13, 2023
Hardening RGB-D Object Recognition Systems against Adversarial Patch Attacks

Yang Zheng, Luca Demetrio, Antonio Emanuele Cinà et al.

RGB-D object recognition systems improve their predictive performances by fusing color and depth information, outperforming neural network architectures that rely solely on colors. While RGB-D systems are expected to be more robust to adversarial examples than RGB-only systems, they have also been proven to be highly vulnerable. Their robustness is similar even when the adversarial examples are generated by altering only the original images' colors. Different works highlighted the vulnerability of RGB-D systems; however, there is a lacking of technical explanations for this weakness. Hence, in our work, we bridge this gap by investigating the learned deep representation of RGB-D systems, discovering that color features make the function learned by the network more complex and, thus, more sensitive to small perturbations. To mitigate this problem, we propose a defense based on a detection mechanism that makes RGB-D systems more robust against adversarial examples. We empirically show that this defense improves the performances of RGB-D systems against adversarial examples even when they are computed ad-hoc to circumvent this detection mechanism, and that is also more effective than adversarial training.

CVMar 17, 2025Code
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Xinyu Ma, Ziyang Ding, Zhicong Luo et al.

Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. To benchmark performance, we introduce KVG-Bench a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08\% accuracy improvements on KVG-Bench and exhibiting +4.60\% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at https://github.com/thunlp/DeepPerception.

CVSep 14, 2023
Facial Kinship Verification from remote photoplethysmography

Xiaoting Wu, Xiaoyi Feng, Constantino Álvarez Casado et al.

Facial Kinship Verification (FKV) aims at automatically determining whether two subjects have a kinship relation based on human faces. It has potential applications in finding missing children and social media analysis. Traditional FKV faces challenges as it is vulnerable to spoof attacks and raises privacy issues. In this paper, we explore for the first time the FKV with vital bio-signals, focusing on remote Photoplethysmography (rPPG). rPPG signals are extracted from facial videos, resulting in a one-dimensional signal that measures the changes in visible light reflection emitted to and detected from the skin caused by the heartbeat. Specifically, in this paper, we employed a straightforward one-dimensional Convolutional Neural Network (1DCNN) with a 1DCNN-Attention module and kinship contrastive loss to learn the kin similarity from rPPGs. The network takes multiple rPPG signals extracted from various facial Regions of Interest (ROIs) as inputs. Additionally, the 1DCNN attention module is designed to learn and capture the discriminative kin features from feature embeddings. Finally, we demonstrate the feasibility of rPPG to detect kinship with the experiment evaluation on the UvANEMO Smile Database from different kin relations.

CVMar 12, 2025
Zero-Shot Subject-Centric Generation for Creative Application Using Entropy Fusion

Kaifeng Zou, Xiaoyi Feng, Peng Wang et al.

Generative models are widely used in visual content creation. However, current text-to-image models often face challenges in practical applications-such as textile pattern design and meme generation-due to the presence of unwanted elements that are difficult to separate with existing methods. Meanwhile, subject-reference generation has emerged as a key research trend, highlighting the need for techniques that can produce clean, high-quality subject images while effectively removing extraneous components. To address this challenge, we introduce a framework for reliable subject-centric image generation. In this work, we propose an entropy-based feature-weighted fusion method to merge the informative cross-attention features obtained from each sampling step of the pretrained text-to-image model FLUX, enabling a precise mask prediction and subject-centric generation. Additionally, we have developed an agent framework based on Large Language Models (LLMs) that translates users' casual inputs into more descriptive prompts, leading to highly detailed image generation. Simultaneously, the agents extract primary elements of prompts to guide the entropy-based feature fusion, ensuring focused primary element generation without extraneous components. Experimental results and user studies demonstrate our methods generates high-quality subject-centric images, outperform existing methods or other possible pipelines, highlighting the effectiveness of our approach.

CVJul 20, 2025
Hybrid-supervised Hypergraph-enhanced Transformer for Micro-gesture Based Emotion Recognition

Zhaoqiang Xia, Hexiang Huang, Haoyu Chen et al.

Micro-gestures are unconsciously performed body gestures that can convey the emotion states of humans and start to attract more research attention in the fields of human behavior understanding and affective computing as an emerging topic. However, the modeling of human emotion based on micro-gestures has not been explored sufficiently. In this work, we propose to recognize the emotion states based on the micro-gestures by reconstructing the behavior patterns with a hypergraph-enhanced Transformer in a hybrid-supervised framework. In the framework, hypergraph Transformer based encoder and decoder are separately designed by stacking the hypergraph-enhanced self-attention and multiscale temporal convolution modules. Especially, to better capture the subtle motion of micro-gestures, we construct a decoder with additional upsampling operations for a reconstruction task in a self-supervised learning manner. We further propose a hypergraph-enhanced self-attention module where the hyperedges between skeleton joints are gradually updated to present the relationships of body joints for modeling the subtle local motion. Lastly, for exploiting the relationship between the emotion states and local motion of micro-gestures, an emotion recognition head from the output of encoder is designed with a shallow architecture and learned in a supervised way. The end-to-end framework is jointly trained in a one-stage way by comprehensively utilizing self-reconstruction and supervision information. The proposed method is evaluated on two publicly available datasets, namely iMiGUE and SMG, and achieves the best performance under multiple metrics, which is superior to the existing methods.

CVJun 3, 2025
LinkTo-Anime: A 2D Animation Optical Flow Dataset from 3D Model Rendering

Xiaoyi Feng, Kaifeng Zou, Caichun Cen et al.

Existing optical flow datasets focus primarily on real-world simulation or synthetic human motion, but few are tailored to Celluloid(cel) anime character motion: a domain with unique visual and motion characteristics. To bridge this gap and facilitate research in optical flow estimation and downstream tasks such as anime video generation and line drawing colorization, we introduce LinkTo-Anime, the first high-quality dataset specifically designed for cel anime character motion generated with 3D model rendering. LinkTo-Anime provides rich annotations including forward and backward optical flow, occlusion masks, and Mixamo Skeleton. The dataset comprises 395 video sequences, totally 24,230 training frames, 720 validation frames, and 4,320 test frames. Furthermore, a comprehensive benchmark is constructed with various optical flow estimation methods to analyze the shortcomings and limitations across multiple datasets.

LGJun 14, 2024
Over-parameterization and Adversarial Robustness in Neural Networks: An Overview and Empirical Analysis

Srishti Gupta, Zhang Chen, Luca Demetrio et al.

Thanks to their extensive capacity, over-parameterized neural networks exhibit superior predictive capabilities and generalization. However, having a large parameter space is considered one of the main suspects of the neural networks' vulnerability to adversarial example -- input samples crafted ad-hoc to induce a desired misclassification. Relevant literature has claimed contradictory remarks in support of and against the robustness of over-parameterized networks. These contradictory findings might be due to the failure of the attack employed to evaluate the networks' robustness. Previous research has demonstrated that depending on the considered model, the algorithm employed to generate adversarial examples may not function properly, leading to overestimating the model's robustness. In this work, we empirically study the robustness of over-parameterized networks against adversarial examples. However, unlike the previous works, we also evaluate the considered attack's reliability to support the results' veracity. Our results show that over-parameterized networks are robust against adversarial attacks as opposed to their under-parameterized counterparts.

LGAug 26, 2021
Why Adversarial Reprogramming Works, When It Fails, and How to Tell the Difference

Yang Zheng, Xiaoyi Feng, Zhaoqiang Xia et al.

Adversarial reprogramming allows repurposing a machine-learning model to perform a different task. For example, a model trained to recognize animals can be reprogrammed to recognize digits by embedding an adversarial program in the digit images provided as input. Recent work has shown that adversarial reprogramming may not only be used to abuse machine-learning models provided as a service, but also beneficially, to improve transfer learning when training data is scarce. However, the factors affecting its success are still largely unexplained. In this work, we develop a first-order linear model of adversarial reprogramming to show that its success inherently depends on the size of the average input gradient, which grows when input gradients are more aligned, and when inputs have higher dimensionality. The results of our experimental analysis, involving fourteen distinct reprogramming tasks, show that the above factors are correlated with the success and the failure of adversarial reprogramming.

CVMay 18, 2021
Non-contact Pain Recognition from Video Sequences with Remote Physiological Measurements Prediction

Ruijing Yang, Ziyu Guan, Zitong Yu et al.

Automatic pain recognition is paramount for medical diagnosis and treatment. The existing works fall into three categories: assessing facial appearance changes, exploiting physiological cues, or fusing them in a multi-modal manner. However, (1) appearance changes are easily affected by subjective factors which impedes objective pain recognition. Besides, the appearance-based approaches ignore long-range spatial-temporal dependencies that are important for modeling expressions over time; (2) the physiological cues are obtained by attaching sensors on human body, which is inconvenient and uncomfortable. In this paper, we present a novel multi-task learning framework which encodes both appearance changes and physiological cues in a non-contact manner for pain recognition. The framework is able to capture both local and long-range dependencies via the proposed attention mechanism for the learned appearance representations, which are further enriched by temporally attended physiological cues (remote photoplethysmography, rPPG) that are recovered from videos in the auxiliary task. This framework is dubbed rPPG-enriched Spatio-Temporal Attention Network (rSTAN) and allows us to establish the state-of-the-art performance of non-contact pain recognition on publicly available pain databases. It demonstrates that rPPG predictions can be used as an auxiliary task to facilitate non-contact automatic pain recognition.

CVJul 7, 2020
Are spoofs from latent fingerprints a real threat for the best state-of-art liveness detectors?

Roberto Casula, Giulia Orrù, Daniele Angioni et al.

We investigated the threat level of realistic attacks using latent fingerprints against sensors equipped with state-of-art liveness detectors and fingerprint verification systems which integrate such liveness algorithms. To the best of our knowledge, only a previous investigation was done with spoofs from latent prints. In this paper, we focus on using snapshot pictures of latent fingerprints. These pictures provide molds, that allows, after some digital processing, to fabricate high-quality spoofs. Taking a snapshot picture is much simpler than developing fingerprints left on a surface by magnetic powders and lifting the trace by a tape. What we are interested here is to evaluate preliminary at which extent attacks of the kind can be considered a real threat for state-of-art fingerprint liveness detectors and verification systems. To this aim, we collected a novel data set of live and spoof images fabricated with snapshot pictures of latent fingerprints. This data set provide a set of attacks at the most favourable conditions. We refer to this method and the related data set as "ScreenSpoof". Then, we tested with it the performances of the best liveness detection algorithms, namely, the three winners of the LivDet competition. Reported results point out that the ScreenSpoof method is a threat of the same level, in terms of detection and verification errors, than that of attacks using spoofs fabricated with the full consensus of the victim. We think that this is a notable result, never reported in previous work.

CVJun 17, 2020
Revealing the Invisible with Model and Data Shrinking for Composite-database Micro-expression Recognition

Zhaoqiang Xia, Wei Peng, Huai-Qian Khor et al.

Composite-database micro-expression recognition is attracting increasing attention as it is more practical to real-world applications. Though the composite database provides more sample diversity for learning good representation models, the important subtle dynamics are prone to disappearing in the domain shift such that the models greatly degrade their performance, especially for deep models. In this paper, we analyze the influence of learning complexity, including the input complexity and model complexity, and discover that the lower-resolution input data and shallower-architecture model are helpful to ease the degradation of deep models in composite-database task. Based on this, we propose a recurrent convolutional network (RCN) to explore the shallower-architecture and lower-resolution input data, shrinking model and input complexities simultaneously. Furthermore, we develop three parameter-free modules (i.e., wide expansion, shortcut connection and attention unit) to integrate with RCN without increasing any learnable parameters. These three modules can enhance the representation ability in various perspectives while preserving not-very-deep architecture for lower-resolution data. Besides, three modules can further be combined by an automatic strategy (a neural architecture search strategy) and the searched architecture becomes more robust. Extensive experiments on MEGC2019 dataset (composited of existing SMIC, CASME II and SAMM datasets) have verified the influence of learning complexity and shown that RCNs with three modules and the searched combination outperform the state-of-the-art approaches.

CVOct 1, 2019
Deep Neural Rejection against Adversarial Examples

Angelo Sotgiu, Ambra Demontis, Marco Melis et al.

Despite the impressive performances reported by deep neural networks in different application domains, they remain largely vulnerable to adversarial examples, i.e., input samples that are carefully perturbed to cause misclassification at test time. In this work, we propose a deep neural rejection mechanism to detect adversarial examples, based on the idea of rejecting samples that exhibit anomalous feature representations at different network layers. With respect to competing approaches, our method does not require generating adversarial examples at training time, and it is less computationally demanding. To properly evaluate our method, we define an adaptive white-box attack that is aware of the defense mechanism and aims to bypass it. Under this worst-case setting, we empirically show that our approach outperforms previously-proposed methods that detect adversarial examples by only analyzing the feature representation provided by the output network layer.

CVJun 24, 2019
Audio-Visual Kinship Verification

Xiaoting Wu, Eric Granger, Xiaoyi Feng

Visual kinship verification entails confirming whether or not two individuals in a given pair of images or videos share a hypothesized kin relation. As a generalized face verification task, visual kinship verification is particularly difficult with low-quality found Internet data. Due to uncontrolled variations in background, pose, facial expression, blur, illumination and occlusion, state-of-the-art methods fail to provide high level of recognition accuracy. As with many other visual recognition tasks, kinship verification may benefit from combining visual and audio signals. However, voice-based kinship verification has received very little prior attention. We hypothesize that the human voice contains kin-related cues that are complementary to visual cues. In this paper, we address, for the first time, the use of audio-visual information from face and voice modalities to perform kinship verification. We first propose a new multi-modal kinship dataset, called TALking KINship (TALKIN), that contains several pairs of Internet-quality video sequences. Using TALKIN, we then study the utility of various kinship verification methods including traditional local feature based methods, statistical methods and more recent deep learning approaches. Then, early and late fusion methods are evaluated on the TALKIN dataset for the study of kinship verification with both face and voice modalities. Finally, we propose a deep Siamese fusion network with contrastive loss for multi-modal fusion of kinship relations. Extensive experiments on the TALKIN dataset indicate that by combining face and voice modalities, the proposed Siamese network can provide a significantly higher level of accuracy compared to baseline uni-modal and multi-modal fusion techniques. Experimental results also indicate that audio (vocal) information is complementary (to facial information) and useful for kinship verification.

CVMar 27, 2019
3D Face Mask Presentation Attack Detection Based on Intrinsic Image Analysis

Lei Li, Zhaoqiang Xia, Xiaoyue Jiang et al.

Face presentation attacks have become a major threat to face recognition systems and many countermeasures have been proposed in the past decade. However, most of them are devoted to 2D face presentation attacks, rather than 3D face masks. Unlike the real face, the 3D face mask is usually made of resin materials and has a smooth surface, resulting in reflectance differences. So, we propose a novel detection method for 3D face mask presentation attack by modeling reflectance differences based on intrinsic image analysis. In the proposed method, the face image is first processed with intrinsic image decomposition to compute its reflectance image. Then, the intensity distribution histograms are extracted from three orthogonal planes to represent the intensity differences of reflectance images between the real face and 3D face mask. After that, the 1D convolutional network is further used to capture the information for describing different materials or surfaces react differently to changes in illumination. Extensive experiments on the 3DMAD database demonstrate the effectiveness of our proposed method in distinguishing a face mask from the real one and show that the detection performance outperforms other state-of-the-art methods.

CVJan 15, 2019
Spatiotemporal Recurrent Convolutional Networks for Recognizing Spontaneous Micro-expressions

Zhaoqiang Xia, Xiaopeng Hong, Xingyu Gao et al.

Recently, the recognition task of spontaneous facial micro-expressions has attracted much attention with its various real-world applications. Plenty of handcrafted or learned features have been employed for a variety of classifiers and achieved promising performances for recognizing micro-expressions. However, the micro-expression recognition is still challenging due to the subtle spatiotemporal changes of micro-expressions. To exploit the merits of deep learning, we propose a novel deep recurrent convolutional networks based micro-expression recognition approach, capturing the spatial-temporal deformations of micro-expression sequence. Specifically, the proposed deep model is constituted of several recurrent convolutional layers for extracting visual features and a classificatory layer for recognition. It is optimized by an end-to-end manner and obviates manual feature design. To handle sequential data, we exploit two types of extending the connectivity of convolutional networks across temporal domain, in which the spatiotemporal deformations are modeled in views of facial appearance and geometry separately. Besides, to overcome the shortcomings of limited and imbalanced training samples, temporal data augmentation strategies as well as a balanced loss are jointly used for our deep network. By performing the experiments on three spontaneous micro-expression datasets, we verify the effectiveness of our proposed micro-expression recognition approach compared to the state-of-the-art methods.

CVOct 31, 2018
Face Presentation Attack Detection in Learned Color-liked Space

Lei Li, Zhaoqiang Xia, Xiaoyue Jiang et al.

Face presentation attack detection (PAD) has become a thorny problem for biometric systems and numerous countermeasures have been proposed to address it. However, majority of them directly extract feature descriptors and distinguish fake faces from the real ones in existing color spaces (e.g. RGB, HSV and YCbCr). Unfortunately, it is unknown for us which color space is the best or how to combine different spaces together. To make matters worse, the real and fake faces are overlapped in existing color spaces. So, in this paper, a learned distinguishable color-liked space is generated to deal with the problem of face PAD. More specifically, we present an end-to-end deep learning network that can map existing color spaces to a new learned color-liked space. Inspired by the generator of generative adversarial network (GAN), the proposed network consists of a space generator and a feature extractor. When training the color-liked space, a new triplet combination mechanism of points-to-center is explored to maximize interclass distance and minimize intraclass distance, and also keep a safe margin between the real and presented fake faces. Extensive experiments on two standard face PAD databases, i.e., Relay-Attack and OULU-NPU, indicate that our proposed color-liked space analysis based countermeasure significantly outperforms the state-of-the-art methods and show excellent generalization capability.

CVAug 14, 2017
Kinship Verification from Videos using Spatio-Temporal Texture Features and Deep Learning

Elhocine Boutellaa, Miguel Bordallo López, Samy Ait-Aoudia et al.

Automatic kinship verification using facial images is a relatively new and challenging research problem in computer vision. It consists in automatically predicting whether two persons have a biological kin relation by examining their facial attributes. While most of the existing works extract shallow handcrafted features from still face images, we approach this problem from spatio-temporal point of view and explore the use of both shallow texture features and deep features for characterizing faces. Promising results, especially those of deep features, are obtained on the benchmark UvA-NEMO Smile database. Our extensive experiments also show the superiority of using videos over still images, hence pointing out the important role of facial dynamics in kinship verification. Furthermore, the fusion of the two types of features (i.e. shallow spatio-temporal texture features and deep features) shows significant performance improvements compared to state-of-the-art methods.

CVAug 7, 2016
Spontaneous Facial Micro-Expression Recognition using Discriminative Spatiotemporal Local Binary Pattern with an Improved Integral Projection

Xiaohua Huang, Sujing Wang, Xin Liu et al.

Recently, there are increasing interests in inferring mirco-expression from facial image sequences. Due to subtle facial movement of micro-expressions, feature extraction has become an important and critical issue for spontaneous facial micro-expression recognition. Recent works usually used spatiotemporal local binary pattern for micro-expression analysis. However, the commonly used spatiotemporal local binary pattern considers dynamic texture information to represent face images while misses the shape attribute of face images. On the other hand, their works extracted the spatiotemporal features from the global face regions, which ignore the discriminative information between two micro-expression classes. The above-mentioned problems seriously limit the application of spatiotemporal local binary pattern on micro-expression recognition. In this paper, we propose a discriminative spatiotemporal local binary pattern based on an improved integral projection to resolve the problems of spatiotemporal local binary pattern for micro-expression recognition. Firstly, we develop an improved integral projection for preserving the shape attribute of micro-expressions. Furthermore, an improved integral projection is incorporated with local binary pattern operators across spatial and temporal domains. Specifically, we extract the novel spatiotemporal features incorporating shape attributes into spatiotemporal texture features. For increasing the discrimination of micro-expressions, we propose a new feature selection based on Laplacian method to extract the discriminative information for facial micro-expression recognition. Intensive experiments are conducted on three availably published micro-expression databases. We compare our method with the state-of-the-art algorithms. Experimental results demonstrate that our proposed method achieves promising performance for micro-expression recognition.

CVJul 12, 2016
Local feature hierarchy for face recognition across pose and illumination

Xiaoyue Jiang, Dong Zhang, Xiaoyi Feng

Even though face recognition in frontal view and normal lighting condition works very well, the performance degenerates sharply in extreme conditions. Recently there are many work dealing with pose and illumination problems, respectively. However both the lighting and pose variation will always be encountered at the same time. Accordingly we propose an end-to-end face recognition method to deal with pose and illumination simultaneously based on convolutional networks where the discriminative nonlinear features that are invariant to pose and illumination are extracted. Normally the global structure for images taken in different views is quite diverse. Therefore we propose to use the 1*1 convolutional kernel to extract the local features. Furthermore the parallel multi-stream multi-layer 1*1 convolution network is developed to extract multi-hierarchy features. In the experiments we obtained the average face recognition rate of 96.9% on multiPIE dataset,which improves the state-of-the-art of face recognition across poses and illumination by 7.5%. Especially for profile-wise positions, the average recognition rate of our proposed network is 97.8%, which increases the state-of-the-art recognition rate by 19%.

CVJan 31, 2016
Unsupervised Deep Hashing for Large-scale Visual Search

Zhaoqiang Xia, Xiaoyi Feng, Jinye Peng et al.

Learning based hashing plays a pivotal role in large-scale visual search. However, most existing hashing algorithms tend to learn shallow models that do not seek representative binary codes. In this paper, we propose a novel hashing approach based on unsupervised deep learning to hierarchically transform features into hash codes. Within the heterogeneous deep hashing framework, the autoencoder layers with specific constraints are considered to model the nonlinear mapping between features and binary codes. Then, a Restricted Boltzmann Machine (RBM) layer with constraints is utilized to reduce the dimension in the hamming space. Extensive experiments on the problem of visual search demonstrate the competitiveness of our proposed approach compared to state-of-the-art.