Mingtao Pei

CV
h-index37
8papers
65citations
Novelty49%
AI Score36

8 Papers

CVMar 5, 2025Code
Mocap-2-to-3: Multi-view Lifting for Monocular Motion Recovery with 2D Pretraining

Zhumei Wang, Zechen Hu, Ruoxi Guo et al.

Recovering absolute human motion from monocular inputs is challenging due to two main issues. First, existing methods depend on 3D training data collected from limited environments, constraining out-of-distribution generalization. The second issue is the difficulty of estimating metric-scale poses from monocular input. To address these challenges, we introduce Mocap-2-to-3, a novel framework that performs multi-view lifting from monocular input by leveraging 2D data pre-training, enabling the reconstruction of metrically accurate 3D motions with absolute positions. To leverage abundant 2D data, we decompose complex 3D motion into multi-view syntheses. We first pretrain a single-view diffusion model on extensive 2D datasets, then fine-tune a multi-view model using public 3D data to enable view-consistent motion generation from monocular input, allowing the model to acquire action priors and diversity through 2D data. Furthermore, to recover absolute poses, we propose a novel human motion representation that decouples the learning of local pose and global movements, while encoding geometric priors of the ground to accelerate convergence. This enables progressive recovery of motion in absolute space during inference. Experimental results on in-the-wild benchmarks demonstrate that our method surpasses state-of-the-art approaches in both camera-space motion realism and world-grounded human positioning, while exhibiting superior generalization capability. Our code will be made publicly available.

CVMar 12, 2024
Let Storytelling Tell Vivid Stories: An Expressive and Fluent Multimodal Storyteller

Chuanqi Zang, Jiji Tang, Rongsheng Zhang et al.

Storytelling aims to generate reasonable and vivid narratives based on an ordered image stream. The fidelity to the image story theme and the divergence of story plots attract readers to keep reading. Previous works iteratively improved the alignment of multiple modalities but ultimately resulted in the generation of simplistic storylines for image streams. In this work, we propose a new pipeline, termed LLaMS, to generate multimodal human-level stories that are embodied in expressiveness and consistency. Specifically, by fully exploiting the commonsense knowledge within the LLM, we first employ a sequence data auto-enhancement strategy to enhance factual content expression and leverage a textual reasoning architecture for expressive story generation and prediction. Secondly, we propose SQ-Adatpter module for story illustration generation which can maintain sequence consistency. Numerical results are conducted through human evaluation to verify the superiority of proposed LLaMS. Evaluations show that LLaMS achieves state-of-the-art storytelling performance and 86% correlation and 100% consistency win rate as compared with previous SOTA methods. Furthermore, ablation experiments are conducted to verify the effectiveness of proposed sequence data enhancement and SQ-Adapter.

CVAug 5, 2025
Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing

Hongyu Shen, Junfeng Ni, Yixin Chen et al.

We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., self-prompting) and offline (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.

CVNov 24, 2020
Do You Live a Healthy Life? Analyzing Lifestyle by Visual Life Logging

Qing Gao, Mingtao Pei, Hongyu Shen

A healthy lifestyle is the key to better health and happiness and has a considerable effect on quality of life and disease prevention. Current lifelogging/egocentric datasets are not suitable for lifestyle analysis; consequently, there is no research on lifestyle analysis in the field of computer vision. In this work, we investigate the problem of lifestyle analysis and build a visual lifelogging dataset for lifestyle analysis (VLDLA). The VLDLA contains images captured by a wearable camera every 3 seconds from 8:00 am to 6:00 pm for seven days. In contrast to current lifelogging/egocentric datasets, our dataset is suitable for lifestyle analysis as images are taken with short intervals to capture activities of short duration; moreover, images are taken continuously from morning to evening to record all the activities performed by a user. Based on our dataset, we classify the user activities in each frame and use three latent fluents of the user, which change over time and are associated with activities, to measure the healthy degree of the user's lifestyle. The scores for the three latent fluents are computed based on recognized activities, and the healthy degree of the lifestyle for the day is determined based on the scores for the latent fluents. Experimental results show that our method can be used to analyze the healthiness of users' lifestyles.

ROMar 15, 2019
Stitching Videos from a Fisheye Lens Camera and a Wide-Angle Lens Camera for Telepresence Robots

Yanmei Dong, Mingtao Pei, Lijia Zhang et al.

Many telepresence robots are equipped with a forward-facing camera for video communication and a downward-facing camera for navigation. In this paper, we propose to stitch videos from the FF-camera with a wide-angle lens and the DF-camera with a fisheye lens for telepresence robots. We aim at providing more compact and efficient visual feedback for the user interface of telepresence robots with user-friendly interactive experiences. To this end, we present a multi-homography-based video stitching method which stitches videos from a wide-angle camera and a fisheye camera. The method consists of video image alignment, seam cutting, and image blending. We directly align the wide-angle video image and the fisheye video image based on the multi-homography alignment without calibration, distortion correction, and unwarping procedures. Thus, we can obtain a stitched video with shape preservation in the non-overlapping regions and alignment in the overlapping area for telepresence. To alleviate ghosting effects caused by moving objects and/or moving cameras during telepresence robot driving, an optimal seam is found for aligned video composition, and the optimal seam will be updated in subsequent frames, considering spatial and temporal coherence. The final stitched video is created by image blending based on the optimal seam. We conducted a user study to demonstrate the effectiveness of our method and the superiority of telepresence robots with a stitched video as visual feedback.

CVMar 13, 2019
Face Liveness Detection Based on Client Identity Using Siamese Network

Huiling Hao, Mingtao Pei

Face liveness detection is an essential prerequisite for face recognition applications. Previous face liveness detection methods usually train a binary classifier to differentiate between a fake face and a real face before face recognition. The client identity information is not utilized in previous face liveness detection methods. However, in practical face recognition applications, face spoofing attacks are always aimed at a specific client, and the client identity information can provide useful clues for face liveness detection. In this paper, we propose a face liveness detection method based on the client identity using Siamese network. We detect face liveness after face recognition instead of before face recognition, that is, we detect face liveness with the client identity information. We train a Siamese network with image pairs. Each image pair consists of two real face images or one real and one fake face images. The face images in each pair come from a same client. Given a test face image, the face image is firstly recognized by face recognition system, then the real face image of the identified client is retrieved to help the face liveness detection. Experiment results demonstrate the effectiveness of our method.

CVMar 22, 2016
Input Aggregated Network for Face Video Representation

Zhen Dong, Su Jia, Chi Zhang et al.

Recently, deep neural network has shown promising performance in face image recognition. The inputs of most networks are face images, and there is hardly any work reported in literature on network with face videos as input. To sufficiently discover the useful information contained in face videos, we present a novel network architecture called input aggregated network which is able to learn fixed-length representations for variable-length face videos. To accomplish this goal, an aggregation unit is designed to model a face video with various frames as a point on a Riemannian manifold, and the mapping unit aims at mapping the point into high-dimensional space where face videos belonging to the same subject are close-by and others are distant. These two units together with the frame representation unit build an end-to-end learning system which can learn representations of face videos for the specific tasks. Experiments on two public face video datasets demonstrate the effectiveness of the proposed network.

ROJan 22, 2016
A Low-Cost Tele-Presence Wheelchair System

Jiajun Shen, Bin Xu, Mingtao Pei et al.

This paper presents the architecture and implementation of a tele-presence wheelchair system based on tele-presence robot, intelligent wheelchair, and touch screen technologies. The tele-presence wheelchair system consists of a commercial electric wheelchair, an add-on tele-presence interaction module, and a touchable live video image based user interface (called TIUI). The tele-presence interaction module is used to provide video-chatting for an elderly or disabled person with the family members or caregivers, and also captures the live video of an environment for tele-operation and semi-autonomous navigation. The user interface developed in our lab allows an operator to access the system anywhere and directly touch the live video image of the wheelchair to push it as if he/she did it in the presence. This paper also discusses the evaluation of the user experience.