Ruomei Wang

CV
h-index21
8papers
59citations
Novelty57%
AI Score45

8 Papers

CVFeb 4
Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery

Jiaxin Cen, Xudong Mao, Guanghui Yue et al.

Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.

CVAug 5, 2025
DepthGait: Multi-Scale Cross-Level Feature Fusion of RGB-Derived Depth and Silhouette Sequences for Robust Gait Recognition

Xinzhu Li, Juepeng Zheng, Yikun Chen et al.

Robust gait recognition requires highly discriminative representations, which are closely tied to input modalities. While binary silhouettes and skeletons have dominated recent literature, these 2D representations fall short of capturing sufficient cues that can be exploited to handle viewpoint variations, and capture finer and meaningful details of gait. In this paper, we introduce a novel framework, termed DepthGait, that incorporates RGB-derived depth maps and silhouettes for enhanced gait recognition. Specifically, apart from the 2D silhouette representation of the human body, the proposed pipeline explicitly estimates depth maps from a given RGB image sequence and uses them as a new modality to capture discriminative features inherent in human locomotion. In addition, a novel multi-scale and cross-level fusion scheme has also been developed to bridge the modality gap between depth maps and silhouettes. Extensive experiments on standard benchmarks demonstrate that the proposed DepthGait achieves state-of-the-art performance compared to peer methods and attains an impressive mean rank-1 accuracy on the challenging datasets.

CVAug 5, 2025
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering

Yiran Meng, Junhong Ye, Wei Zhou et al.

Cross-video question answering presents significant challenges beyond traditional single-video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi-source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning. Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training. VideoForest integrates three key innovations: 1) a human-anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi-granularity spanning tree structure that hierarchically organizes visual content around person-level trajectories; and 3) a multi-agent reasoning framework that efficiently traverses this hierarchical structure to answer complex cross-video queries. To evaluate our approach, we develop CrossVideoQA, a comprehensive benchmark dataset specifically designed for person-centric cross-video analysis. Experimental results demonstrate VideoForest's superior performance in cross-video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning, significantly outperforming existing methods. Our work establishes a new paradigm for cross-video understanding by unifying multiple video streams through person-level features, enabling sophisticated reasoning across distributed visual information while maintaining computational efficiency.

MMApr 26, 2024
MorphText: Deep Morphology Regularized Arbitrary-shape Scene Text Detection

Chengpei Xu, Wenjing Jia, Ruomei Wang et al.

Bottom-up text detection methods play an important role in arbitrary-shape scene text detection but there are two restrictions preventing them from achieving their great potential, i.e., 1) the accumulation of false text segment detections, which affects subsequent processing, and 2) the difficulty of building reliable connections between text segments. Targeting these two problems, we propose a novel approach, named ``MorphText", to capture the regularity of texts by embedding deep morphology for arbitrary-shape text detection. Towards this end, two deep morphological modules are designed to regularize text segments and determine the linkage between them. First, a Deep Morphological Opening (DMOP) module is constructed to remove false text segment detections generated in the feature extraction process. Then, a Deep Morphological Closing (DMCL) module is proposed to allow text instances of various shapes to stretch their morphology along their most significant orientation while deriving their connections. Extensive experiments conducted on four challenging benchmark datasets (CTW1500, Total-Text, MSRA-TD500 and ICDAR2017) demonstrate that our proposed MorphText outperforms both top-down and bottom-up state-of-the-art arbitrary-shape scene text detection approaches.

MMAug 4, 2021
What's Wrong with the Bottom-up Methods in Arbitrary-shape Scene Text Detection

Chengpei Xu, Wenjing Jia, Tingcheng Cui et al.

The latest trend in the bottom-up perspective for arbitrary-shape scene text detection is to reason the links between text segments using Graph Convolutional Network (GCN). Notwithstanding, the performance of the best performing bottom-up method is still inferior to that of the best performing top-down method even with the help of GCN. We argue that this is not mainly caused by the limited feature capturing ability of the text proposal backbone or GCN, but by their failure to make a full use of visual-relational features for suppressing false detection, as well as the sub-optimal route-finding mechanism used for grouping text segments. In this paper, we revitalize the classic text detection frameworks by aggregating the visual-relational features of text with two effective false positive/negative suppression mechanisms. First, dense overlapping text segments depicting the `characterness' and `streamline' of text are generated for further relational reasoning and weakly supervised segment classification. Here, relational graph features are used for suppressing false positives/negatives. Then, to fuse the relational features with visual features, a Location-Aware Transfer (LAT) module is designed to transfer text's relational features into visual compatible features with a Fuse Decoding (FD) module to enhance the representation of text regions for the second step suppression. Finally, a novel multiple-text-map-aware contour-approximation strategy is developed, instead of the widely-used route-finding process. Experiments conducted on five benchmark datasets, i.e., CTW1500, Total-Text, ICDAR2015, MSRA-TD500, and MLT2017 demonstrate that our method outperforms the state-of-the-art performance when being embedded in a classic text detection framework, which revitalises the superb strength of the bottom-up methods.

CVAug 15, 2019
SFSegNet: Parse Freehand Sketches using Deep Fully Convolutional Networks

Junkun Jiang, Ruomei Wang, Shujin Lin et al.

Parsing sketches via semantic segmentation is attractive but challenging, because (i) free-hand drawings are abstract with large variances in depicting objects due to different drawing styles and skills; (ii) distorting lines drawn on the touchpad make sketches more difficult to be recognized; (iii) the high-performance image segmentation via deep learning technologies needs enormous annotated sketch datasets during the training stage. In this paper, we propose a Sketch-target deep FCN Segmentation Network(SFSegNet) for automatic free-hand sketch segmentation, labeling each sketch in a single object with multiple parts. SFSegNet has an end-to-end network process between the input sketches and the segmentation results, composed of 2 parts: (i) a modified deep Fully Convolutional Network(FCN) using a reweighting strategy to ignore background pixels and classify which part each pixel belongs to; (ii) affine transform encoders that attempt to canonicalize the shaking strokes. We train our network with the dataset that consists of 10,000 annotated sketches, to find an extensively applicable model to segment stokes semantically in one ground truth. Extensive experiments are carried out and segmentation results show that our method outperforms other state-of-the-art networks.

CVJan 20, 2018
Structured Inhomogeneous Density Map Learning for Crowd Counting

Hanhui Li, Xiangjian He, Hefeng Wu et al.

In this paper, we aim at tackling the problem of crowd counting in extremely high-density scenes, which contain hundreds, or even thousands of people. We begin by a comprehensive analysis of the most widely used density map-based methods, and demonstrate how easily existing methods are affected by the inhomogeneous density distribution problem, e.g., causing them to be sensitive to outliers, or be hard to optimized. We then present an extremely simple solution to the inhomogeneous density distribution problem, which can be intuitively summarized as extending the density map from 2D to 3D, with the extra dimension implicitly indicating the density level. Such solution can be implemented by a single Density-Aware Network, which is not only easy to train, but also can achieve the state-of-art performance on various challenging datasets.

CVDec 29, 2017
Learning Deep Similarity Models with Focus Ranking for Fabric Image Retrieval

Daiguo Deng, Ruomei Wang, Hefeng Wu et al.

Fabric image retrieval is beneficial to many applications including clothing searching, online shopping and cloth modeling. Learning pairwise image similarity is of great importance to an image retrieval task. With the resurgence of Convolutional Neural Networks (CNNs), recent works have achieved significant progresses via deep representation learning with metric embedding, which drives similar examples close to each other in a feature space, and dissimilar ones apart from each other. In this paper, we propose a novel embedding method termed focus ranking that can be easily unified into a CNN for jointly learning image representations and metrics in the context of fine-grained fabric image retrieval. Focus ranking aims to rank similar examples higher than all dissimilar ones by penalizing ranking disorders via the minimization of the overall cost attributed to similar samples being ranked below dissimilar ones. At the training stage, training samples are organized into focus ranking units for efficient optimization. We build a large-scale fabric image retrieval dataset (FIRD) with about 25,000 images of 4,300 fabrics, and test the proposed model on the FIRD dataset. Experimental results show the superiority of the proposed model over existing metric embedding models.