Xuemei Xie

CV
h-index23
16papers
476citations
Novelty48%
AI Score44

16 Papers

CVAug 20, 2022
Learning Primitive-aware Discriminative Representations for Few-shot Learning

Jianpeng Yang, Yuhang Niu, Xuemei Xie et al.

Few-shot learning (FSL) aims to learn a classifier that can be easily adapted to recognize novel classes with only a few labeled examples. Some recent work about FSL has yielded promising classification performance, where the image-level feature is used to calculate the similarity among samples for classification. However, the image-level feature ignores abundant fine-grained and structural in-formation of objects that may be transferable and consistent between seen and unseen classes. How can humans easily identify novel classes with several sam-ples? Some study from cognitive science argues that humans can recognize novel categories through primitives. Although base and novel categories are non-overlapping, they can share some primitives in common. Inspired by above re-search, we propose a Primitive Mining and Reasoning Network (PMRN) to learn primitive-aware representations based on metric-based FSL model. Concretely, we first add Self-supervision Jigsaw task (SSJ) for feature extractor parallelly, guiding the model to encode visual pattern corresponding to object parts into fea-ture channels. To further mine discriminative representations, an Adaptive Chan-nel Grouping (ACG) method is applied to cluster and weight spatially and se-mantically related visual patterns to generate a group of visual primitives. To fur-ther enhance the discriminability and transferability of primitives, we propose a visual primitive Correlation Reasoning Network (CRN) based on graph convolu-tional network to learn abundant structural information and internal correlation among primitives. Finally, a primitive-level metric is conducted for classification in a meta-task based on episodic training strategy. Extensive experiments show that our method achieves state-of-the-art results on six standard benchmarks.

CVJun 3, 2025Code
Learning Pyramid-structured Long-range Dependencies for 3D Human Pose Estimation

Mingjie Wei, Xuemei Xie, Yutong Zhong et al.

Action coordination in human structure is indispensable for the spatial constraints of 2D joints to recover 3D pose. Usually, action coordination is represented as a long-range dependence among body parts. However, there are two main challenges in modeling long-range dependencies. First, joints should not only be constrained by other individual joints but also be modulated by the body parts. Second, existing methods make networks deeper to learn dependencies between non-linked parts. They introduce uncorrelated noise and increase the model size. In this paper, we utilize a pyramid structure to better learn potential long-range dependencies. It can capture the correlation across joints and groups, which complements the context of the human sub-structure. In an effective cross-scale way, it captures the pyramid-structured long-range dependence. Specifically, we propose a novel Pyramid Graph Attention (PGA) module to capture long-range cross-scale dependencies. It concatenates information from various scales into a compact sequence, and then computes the correlation between scales in parallel. Combining PGA with graph convolution modules, we develop a Pyramid Graph Transformer (PGFormer) for 3D human pose estimation, which is a lightweight multi-scale transformer architecture. It encapsulates human sub-structures into self-attention by pooling. Extensive experiments show that our approach achieves lower error and smaller model size than state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets. The code is available at https://github.com/MingjieWe/PGFormer.

CVJan 19, 2025Code
Refinement Module based on Parse Graph for Human Pose Estimation

Shibang Liu, Xuemei Xie

Parse graphs have been widely used in Human Pose Estimation (HPE) to model the hierarchical structure and context relations of the human body. However, such methods often suffer from parameter redundancy. More importantly, they rely on predefined network structures, which limits their use in other methods. To address these issues, we propose a new context relation and hierarchical structure modeling module, RMPG (Refinement Module based on Parse Graph). RMPG adaptively refines feature maps through recursive top-down decomposition of feature maps and bottom-up composition of sub-node feature maps with context information. Through recursive hierarchical composition, RMPG fuses local details and global semantics into more structured feature representations, accompanied by context information, thereby improving the accuracy of joint inference. RMPG can be flexibly embedded as a plug-in into various mainstream HPE networks. Moreover, by supervising sub-node features map, RMPG learns the context relations and hierarchical structure between different body parts with fewer parameters. Extensive experiments show that RMPG improves performance across different architectures while effectively modeling hierarchical and context relations of the human body with fewer parameters. The RMPG code can be found at https://github.com/lushbng/RMPG.

CVSep 15, 2017Code
Feature-Fused SSD: Fast Detection for Small Objects

Guimei Cao, Xuemei Xie, Wenzhe Yang et al.

Small objects detection is a challenging task in computer vision due to its limited resolution and information. In order to solve this problem, the majority of existing methods sacrifice speed for improvement in accuracy. In this paper, we aim to detect small objects at a fast speed, using the best object detector Single Shot Multibox Detector (SSD) with respect to accuracy-vs-speed trade-off as base architecture. We propose a multi-level feature fusion method for introducing contextual information in SSD, in order to improve the accuracy for small objects. In detailed fusion operation, we design two feature fusion modules, concatenation module and element-sum module, different in the way of adding contextual information. Experimental results show that these two fusion modules obtain higher mAP on PASCALVOC2007 than baseline SSD by 1.6 and 1.7 points respectively, especially with 2-3 points improvement on some smallobjects categories. The testing speed of them is 43 and 40 FPS respectively, superior to the state of the art Deconvolutional single shot detector (DSSD) by 29.4 and 26.4 FPS. Code is available at https://github.com/wnzhyee/Feature-Fused-SSD. Keywords: small object detection, feature fusion, real-time, single shot multi-box detector

ROApr 1, 2025
Visual Environment-Interactive Planning for Embodied Complex-Question Answering

Ning Lan, Baoshan Ou, Xuemei Xie et al.

This study focuses on Embodied Complex-Question Answering task, which means the embodied robot need to understand human questions with intricate structures and abstract semantics. The core of this task lies in making appropriate plans based on the perception of the visual environment. Existing methods often generate plans in a once-for-all manner, i.e., one-step planning. Such approach rely on large models, without sufficient understanding of the environment. Considering multi-step planning, the framework for formulating plans in a sequential manner is proposed in this paper. To ensure the ability of our framework to tackle complex questions, we create a structured semantic space, where hierarchical visual perception and chain expression of the question essence can achieve iterative interaction. This space makes sequential task planning possible. Within the framework, we first parse human natural language based on a visual hierarchical scene graph, which can clarify the intention of the question. Then, we incorporate external rules to make a plan for current step, weakening the reliance on large models. Every plan is generated based on feedback from visual perception, with multiple rounds of interaction until an answer is obtained. This approach enables continuous feedback and adjustment, allowing the robot to optimize its action strategy. To test our framework, we contribute a new dataset with more complex questions. Experimental results demonstrate that our approach performs excellently and stably on complex tasks. And also, the feasibility of our approach in real-world scenarios has been established, indicating its practical applicability.

CVSep 9, 2025
Parse Graph-Based Visual-Language Interaction for Human Pose Estimation

Shibang Liu, Xuemei Xie, Guangming Shi

Parse graphs boost human pose estimation (HPE) by integrating context and hierarchies, yet prior work mostly focuses on single modality modeling, ignoring the potential of multimodal fusion. Notably, language offers rich HPE priors like spatial relations for occluded scenes, but existing visual-language fusion via global feature integration weakens occluded region responses and causes alignment and location failures. To address this issue, we propose Parse Graph-based Visual-Language interaction (PGVL) with a core novel Guided Module (GM). In PGVL, low-level nodes focus on local features, maximizing the maintenance of responses in occluded areas and high-level nodes integrate global features to infer occluded or invisible parts. GM enables high semantic nodes to guide the feature update of low semantic nodes that have undergone cross attention. It ensuring effective fusion of diverse information. PGVL includes top-down decomposition and bottom-up composition. In the first stage, modality specific parse graphs are constructed. Next stage. recursive bidirectional cross-attention is used, purified by GM. We also design network based on PGVL. The PGVL and our network is validated on major pose estimation datasets. We will release the code soon.

CVJul 26, 2025
KB-DMGen: Knowledge-Based Global Guidance and Dynamic Pose Masking for Human Image Generation

Shibang Liu, Xuemei Xie, Guangming Shi

Recent methods using diffusion models have made significant progress in Human Image Generation (HIG) with various control signals such as pose priors. In HIG, both accurate human poses and coherent visual quality are crucial for image generation. However, most existing methods mainly focus on pose accuracy while neglecting overall image quality, often improving pose alignment at the cost of image quality. To address this, we propose Knowledge-Based Global Guidance and Dynamic pose Masking for human image Generation (KB-DMGen). The Knowledge Base (KB), implemented as a visual codebook, provides coarse, global guidance based on input text-related visual features, improving pose accuracy while maintaining image quality, while the Dynamic pose Mask (DM) offers fine-grained local control to enhance precise pose accuracy. By injecting KB and DM at different stages of the diffusion process, our framework enhances pose accuracy through both global and local control without compromising image quality. Experiments demonstrate the effectiveness of KB-DMGen, achieving new state-of-the-art results in terms of AP and CAP on the HumanArt dataset. The project page and code are available at https://lushbng.github.io/KBDMGen.

CVMar 14, 2025
ACMo: Attribute Controllable Motion Generation

Mingjie Wei, Xuemei Xie, Guangming Shi

Attributes such as style, fine-grained text, and trajectory are specific conditions for describing motion. However, existing methods often lack precise user control over motion attributes and suffer from limited generalizability to unseen motions. This work introduces an Attribute Controllable Motion generation architecture, to address these challenges via decouple any conditions and control them separately. Firstly, we explored the Attribute Diffusion Model to imporve text-to-motion performance via decouple text and motion learning, as the controllable model relies heavily on the pre-trained model. Then, we introduce Motion Adpater to quickly finetune previously unseen motion patterns. Its motion prompts inputs achieve multimodal text-to-motion generation that captures user-specified styles. Finally, we propose a LLM Planner to bridge the gap between unseen attributes and dataset-specific texts via local knowledage for user-friendly interaction. Our approach introduces the capability for motion prompts for stylize generation, enabling fine-grained and user-friendly attribute control while providing performance comparable to state-of-the-art methods. Project page: https://mjwei3d.github.io/ACMo/

CVDec 16, 2020
Temporal Graph Modeling for Skeleton-based Action Recognition

Jianan Li, Xuemei Xie, Zhifu Zhao et al.

Graph Convolutional Networks (GCNs), which model skeleton data as graphs, have obtained remarkable performance for skeleton-based action recognition. Particularly, the temporal dynamic of skeleton sequence conveys significant information in the recognition task. For temporal dynamic modeling, GCN-based methods only stack multi-layer 1D local convolutions to extract temporal relations between adjacent time steps. With the repeat of a lot of local convolutions, the key temporal information with non-adjacent temporal distance may be ignored due to the information dilution. Therefore, these methods still remain unclear how to fully explore temporal dynamic of skeleton sequence. In this paper, we propose a Temporal Enhanced Graph Convolutional Network (TE-GCN) to tackle this limitation. The proposed TE-GCN constructs temporal relation graph to capture complex temporal dynamic. Specifically, the constructed temporal relation graph explicitly builds connections between semantically related temporal features to model temporal relations between both adjacent and non-adjacent time steps. Meanwhile, to further explore the sufficient temporal dynamic, multi-head mechanism is designed to investigate multi-kinds of temporal relations. Extensive experiments are performed on two widely used large-scale datasets, NTU-60 RGB+D and NTU-120 RGB+D. And experimental results show that the proposed model achieves the state-of-the-art performance by making contribution to temporal modeling for action recognition.

LGSep 29, 2018
Knowledge-guided Semantic Computing Network

Guangming Shi, Zhongqiang Zhang, Dahua Gao et al.

It is very useful to integrate human knowledge and experience into traditional neural networks for faster learning speed, fewer training samples and better interpretability. However, due to the obscured and indescribable black box model of neural networks, it is very difficult to design its architecture, interpret its features and predict its performance. Inspired by human visual cognition process, we propose a knowledge-guided semantic computing network which includes two modules: a knowledge-guided semantic tree and a data-driven neural network. The semantic tree is pre-defined to describe the spatial structural relations of different semantics, which just corresponds to the tree-like description of objects based on human knowledge. The object recognition process through the semantic tree only needs simple forward computing without training. Besides, to enhance the recognition ability of the semantic tree in aspects of the diversity, randomicity and variability, we use the traditional neural network to aid the semantic tree to learn some indescribable features. Only in this case, the training process is needed. The experimental results on MNIST and GTSRB datasets show that compared with the traditional data-driven network, our proposed semantic computing network can achieve better performance with fewer training samples and lower computational complexity. Especially, Our model also has better adversarial robustness than traditional neural network with the help of human knowledge.

CVFeb 1, 2018
Full Image Recover for Block-Based Compressive Sensing

Xuemei Xie, Chenye Wang, Jiang Du et al.

Recent years, compressive sensing (CS) has improved greatly for the application of deep learning technology. For convenience, the input image is usually measured and reconstructed block by block. This usually causes block effect in reconstructed images. In this paper, we present a novel CNN-based network to solve this problem. In measurement part, the input image is adaptively measured block by block to acquire a group of measurements. While in reconstruction part, all the measurements from one image are used to reconstruct the full image at the same time. Different from previous method recovering block by block, the structure information destroyed in measurement part is recovered in our framework. Block effect is removed accordingly. We train the proposed framework by mean square error (MSE) loss function. Experiments show that there is no block effect at all in the proposed method. And our results outperform 1.8 dB compared with existing methods.

CVFeb 1, 2018
Perceptual Compressive Sensing

Jiang Du, Xuemei Xie, Chenye Wang et al.

Compressive sensing (CS) works to acquire measurements at sub-Nyquist rate and recover the scene images. Existing CS methods always recover the scene images in pixel level. This causes the smoothness of recovered images and lack of structure information, especially at a low measurement rate. To overcome this drawback, in this paper, we propose perceptual CS to obtain high-level structured recovery. Our task no longer focuses on pixel level. Instead, we work to make a better visual effect. In detail, we employ perceptual loss, defined on feature level, to enhance the structure information of the recovered images. Experiments show that our method achieves better visual results with stronger structure information than existing CS methods at the same measurement rate.

CVJan 31, 2018
ConvCSNet: A Convolutional Compressive Sensing Framework Based on Deep Learning

Xiaotong Lu, Weisheng Dong, Peiyao Wang et al.

Compressive sensing (CS), aiming to reconstruct an image/signal from a small set of random measurements has attracted considerable attentions in recent years. Due to the high dimensionality of images, previous CS methods mainly work on image blocks to avoid the huge requirements of memory and computation, i.e., image blocks are measured with Gaussian random matrices, and the whole images are recovered from the reconstructed image blocks. Though efficient, such methods suffer from serious blocking artifacts. In this paper, we propose a convolutional CS framework that senses the whole image using a set of convolutional filters. Instead of reconstructing individual blocks, the whole image is reconstructed from the linear convolutional measurements. Specifically, the convolutional CS is implemented based on a convolutional neural network (CNN), which performs both the convolutional CS and nonlinear reconstruction. Through end-to-end training, the sensing filters and the reconstruction network can be jointly optimized. To facilitate the design of the CS reconstruction network, a novel two-branch CNN inspired from a sparsity-based CS reconstruction model is developed. Experimental results show that the proposed method substantially outperforms previous state-of-the-art CS methods in term of both PSNR and visual quality.

CVNov 21, 2017
Fully Convolutional Measurement Network for Compressive Sensing Image Reconstruction

Jiang Du, Xuemei Xie, Chenye Wang et al.

Recently, deep learning methods have made a significant improvement in compressive sensing image reconstruction task. In the existing methods, the scene is measured block by block due to the high computational complexity. This results in block-effect of the recovered images. In this paper, we propose a fully convolutional measurement network, where the scene is measured as a whole. The proposed method powerfully removes the block-effect since the structure information of scene images is preserved. To make the measure more flexible, the measurement and the recovery parts are jointly trained. From the experiments, it is shown that the results by the proposed method outperforms those by the existing methods in PSNR, SSIM, and visual effect.

CVOct 5, 2017
Real-Time Illegal Parking Detection System Based on Deep Learning

Xuemei Xie, Chenye Wang, Shu Chen et al.

The increasing illegal parking has become more and more serious. Nowadays the methods of detecting illegally parked vehicles are based on background segmentation. However, this method is weakly robust and sensitive to environment. Benefitting from deep learning, this paper proposes a novel illegal vehicle parking detection system. Illegal vehicles captured by camera are firstly located and classified by the famous Single Shot MultiBox Detector (SSD) algorithm. To improve the performance, we propose to optimize SSD by adjusting the aspect ratio of default box to accommodate with our dataset better. After that, a tracking and analysis of movement is adopted to judge the illegal vehicles in the region of interest (ROI). Experiments show that the system can achieve a 99% accuracy and real-time (25FPS) detection with strong robustness in complex environments.

CVSep 23, 2017
Adaptive Measurement Network for CS Image Reconstruction

Xuemei Xie, Yuxiang Wang, Guangming Shi et al.

Conventional compressive sensing (CS) reconstruction is very slow for its characteristic of solving an optimization problem. Convolu- tional neural network can realize fast processing while achieving compa- rable results. While CS image recovery with high quality not only de- pends on good reconstruction algorithms, but also good measurements. In this paper, we propose an adaptive measurement network in which measurement is obtained by learning. The new network consists of a fully-connected layer and ReconNet. The fully-connected layer which has low-dimension output acts as measurement. We train the fully-connected layer and ReconNet simultaneously and obtain adaptive measurement. Because the adaptive measurement fits dataset better, in contrast with random Gaussian measurement matrix, under the same measuremen- t rate, it can extract the information of scene more efficiently and get better reconstruction results. Experiments show that the new network outperforms the original one.