3 Papers

HCMay 13
Doppler Prompting for Stable mmWave-based Human Pose Estimation

Shuntian Zheng, Jiaqi Li, Xiaoman Lu et al.

Millimeter-wave (mmWave) enables privacy-preserving, illumination-robust human pose estimation (HPE), with each mmWave frame represented as a range-angle-Doppler tensor, providing spatial magnitude for localization and Doppler signatures for motion-related cues. However, existing mmWave-based HPE methods either underutilize or naïvely fuse Doppler signatures with spatial magnitude, disregarding their distinct physical semantics. As a result, non-human Doppler signatures can be misinterpreted as human motion cues, leading to jittery trajectories. We propose PULSE, which converts Doppler signatures into confidence-aware motion prompts and injects them into spatial magnitude reasoning through constrained interactions. By screening Doppler prompts before they influence prediction, PULSE first suppresses spurious spectral motion cues and then uses the screened prompts to stabilize prediction. Across three datasets spanning single- and multi-person settings, PULSE consistently improves pose accuracy and temporal stability, indicating that controlled Doppler prompting is a practical direction for stable mmWave HPE.

CVMay 8
A Two-Stage Motion-Aware Framework for mmWave-based Human Mesh Recovery

Hoang Hai Pham, Shuntian Zheng, Jiaqi Li et al.

Millimeter-wave (mmWave) radar has emerged as a promising sensing modality for human perception due to its robustness under challenging environmental conditions and strong privacy-preserving properties. However, recovering accurate 3D human body meshes from radar observations remains difficult due to severe signal clutter and the inherently partial nature of radar measurements. Previous works typically adopt end-to-end frameworks that directly regress human body parameters from raw radar data, without decoupling signal interpretation from geometric reasoning or exploiting temporal motion cues, limiting learning performance. To address this, we propose a two-stage framework for radar-based human body reconstruction. First, we introduce a human reflection extraction module that performs coarse-to-fine localization and voxel-wise segmentation to produce a confidence-weighted radar volume encoding voxel-level human likelihood. Second, we design a motion-aware mesh recovery network that reconstructs the human body by jointly modeling per-frame geometry and inter-frame dynamics using a dual-branch architecture. Extensive experiments demonstrate that the proposed method outperforms existing approaches while maintaining computational efficiency.

CVJan 28
Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization

Jiaqi Li, Guangming Wang, Shuntian Zheng et al.

Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP.