CVMar 24, 2023
Zolly: Zoom Focal Length Correctly for Perspective-Distorted Human Mesh ReconstructionWenjia Wang, Yongtao Ge, Haiyi Mei et al.
As it is hard to calibrate single-view RGB images in the wild, existing 3D human mesh reconstruction (3DHMR) methods either use a constant large focal length or estimate one based on the background environment context, which can not tackle the problem of the torso, limb, hand or face distortion caused by perspective camera projection when the camera is close to the human body. The naive focal length assumptions can harm this task with the incorrectly formulated projection matrices. To solve this, we propose Zolly, the first 3DHMR method focusing on perspective-distorted images. Our approach begins with analysing the reason for perspective distortion, which we find is mainly caused by the relative location of the human body to the camera center. We propose a new camera model and a novel 2D representation, termed distortion image, which describes the 2D dense distortion scale of the human body. We then estimate the distance from distortion scale features rather than environment context features. Afterwards, we integrate the distortion feature with image features to reconstruct the body mesh. To formulate the correct projection matrix and locate the human body position, we simultaneously use perspective and weak-perspective projection loss. Since existing datasets could not handle this task, we propose the first synthetic dataset PDHuman and extend two real-world datasets tailored for this task, all containing perspective-distorted human images. Extensive experiments show that Zolly outperforms existing state-of-the-art methods on both perspective-distorted datasets and the standard benchmark (3DPW).
CVJun 1, 2022
Point-Teaching: Weakly Semi-Supervised Object Detection with Point AnnotationsYongtao Ge, Qiang Zhou, Xinlong Wang et al.
Point annotations are considerably more time-efficient than bounding box annotations. However, how to use cheap point annotations to boost the performance of semi-supervised object detection remains largely unsolved. In this work, we present Point-Teaching, a weakly semi-supervised object detection framework to fully exploit the point annotations. Specifically, we propose a Hungarian-based point matching method to generate pseudo labels for point annotated images. We further propose multiple instance learning (MIL) approaches at the level of images and points to supervise the object detector with point annotations. Finally, we propose a simple-yet-effective data augmentation, termed point-guided copy-paste, to reduce the impact of the unmatched points. Experiments demonstrate the effectiveness of our method on a few datasets and various data regimes.
IVApr 8, 2025Code
POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D ReconstructionSongyan Zhang, Yongtao Ge, Jinyuan Tian et al.
3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by camera and object motion. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping RGB pixels from both dynamic and static regions across different views to 3D pointmaps within a unified coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, 3D point tracking, and pose estimation. Code and models are publicly available at https://github.com/wyddmw/POMATO.
CVAug 11, 2025Code
Generative Video MattingYongtao Ge, Kangyang Xie, Guangkai Xu et al.
Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at https://github.com/aim-uofa/GVM.
CVMar 10, 2024
What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?Guangkai Xu, Yongtao Ge, Mingyu Liu et al. · cmu
Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step stochastic diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) The stochastic nature of diffusion models has a slightly negative impact on deterministic visual perception tasks. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific image-level supervision is beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks. Different from the previous multi-step methods, our paradigm has a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for image-level supervision, which is critical to improving the fine-grained details of predictions. Comprehensive experiments on diverse dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method.
CVApr 23
WorldMark: A Unified Benchmark Suite for Interactive Video World ModelsXiaojie Xu, Zhengyuan Lin, Kang He et al.
Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.
CVMar 17, 2024
Diffusion Models are Efficient Data Generators for Human Mesh RecoveryYongtao Ge, Wenjia Wang, Yongfan Chen et al.
Despite remarkable progress having been made on the problem of 3D human pose and shape estimation (HPS), current state-of-the-art methods rely heavily on either confined indoor mocap datasets or datasets generated by a rendering engine using computer graphics (CG). Both categories of datasets exhibit inadequacies in furnishing adequate human identities and authentic in-the-wild background scenes, which are crucial for accurately simulating real-world distributions. In this work, we show that synthetic data created by generative models is complementary to CG-rendered data for achieving remarkable generalization performance on diverse real-world scenes. We propose an effective data generation pipeline based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. Specifically, we first collect a large-scale human-centric dataset with comprehensive annotations, e.g, text captions, the depth map, and surface normal images. To generate a wide variety of human images with initial labels, we train a customized, multi-condition ControlNet model. The key to this process is using a 3D parametric model, e.g, SMPL-X, to create various condition inputs easily. Our data generation pipeline is both flexible and customizable, making it adaptable to multiple real-world tasks, such as human interaction in complex scenes and humans captured by wide-angle lenses. By relying solely on generative models, we can produce large-scale, in-the-wild human images with high-quality annotations, significantly reducing the need for manual image collection and annotation. The generated dataset encompasses a wide range of viewpoints, environments, and human identities, ensuring its versatility across different scenarios. We hope that our work could pave the way for scaling up 3D human recovery to in-the-wild scenes.
CVJun 18, 2024
GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation ModelsYongtao Ge, Guangkai Xu, Zhiyue Zhao et al.
Recent advances in discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. While discriminative monocular geometry estimation methods rely on large-scale fine-tuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by leveraging pre-trained diffusion models and fine-tuning on even a small scale of synthetic training data. Frustratingly, these models are trained with different recipes on different datasets, making it hard to find out the critical factors that determine the evaluation performance. Besides, current geometry evaluation benchmarks have two main drawbacks that may prevent the development of the field, i.e., limited scene diversity and unfavorable label quality. To resolve the above issues, (1) we build fair and strong baselines in a unified codebase for evaluating and analyzing the geometry estimation models; (2) we evaluate monocular geometry estimators on more challenging benchmarks for geometry estimation task with diverse scenes and high-quality annotations. Our results reveal that pre-trained using large data, discriminative models such as DINOv2, can outperform generative counterparts with a small amount of high-quality synthetic data under the same training configuration, which suggests that fine-tuning data quality is a more important factor than the data scale and model architecture. Our observation also raises a question: if simply fine-tuning a general vision model such as DINOv2 using a small amount of synthetic depth data produces SOTA results, do we really need complex generative models for depth estimation? We believe this work can propel advancements in geometry estimation tasks as well as a wide range of downstream applications.
CVJan 19, 2022
Poseur: Direct Human Pose Regression with TransformersWeian Mao, Yongtao Ge, Chunhua Shen et al.
We propose a direct, regression-based approach to 2D human pose estimation from single images. We formulate the problem as a sequence prediction task, which we solve using a Transformer network. This network directly learns a regression mapping from images to the keypoint coordinates, without resorting to intermediate representations such as heatmaps. This approach avoids much of the complexity associated with heatmap-based approaches. To overcome the feature misalignment issues of previous regression-based methods, we propose an attention mechanism that adaptively attends to the features that are most relevant to the target keypoints, considerably improving the accuracy. Importantly, our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints. Experiments on MS-COCO and MPII, two predominant pose-estimation datasets, demonstrate that our method significantly improves upon the state-of-the-art in regression-based pose estimation. More notably, ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.
CVMar 29, 2021
TFPose: Direct Human Pose Estimation with TransformersWeian Mao, Yongtao Ge, Chunhua Shen et al.
We propose a human pose estimation framework that solves the task in the regression-based fashion. Unlike previous regression-based methods, which often fall behind those state-of-the-art methods, we formulate the pose estimation task into a sequence prediction problem that can effectively be solved by transformers. Our framework is simple and direct, bypassing the drawbacks of the heatmap-based pose estimation. Moreover, with the attention mechanism in transformers, our proposed framework is able to adaptively attend to the features most relevant to the target keypoints, which largely overcomes the feature misalignment issue of previous regression-based methods and considerably improves the performance. Importantly, our framework can inherently take advantages of the structured relationship between keypoints. Experiments on the MS-COCO and MPII datasets demonstrate that our method can significantly improve the state-of-the-art of regression-based pose estimation and perform comparably with the best heatmap-based pose estimation methods.