CVROJun 26, 2025

Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping

arXiv:2506.21234v11 citationsh-index: 1Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of generating precise robotic arm movements from video input, which is incremental as it builds on existing pose estimation and smoothing methods.

The paper tackles the problem of converting monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm, resulting in a real-time pipeline called ESFP that estimates, smooths, filters, and maps poses.

This paper presents ESFP, an end-to-end pipeline that converts monocular RGB video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a 24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence Transformer with self-attention-combines long-range temporal context with a differentiable forward-kinematics decoder, enforcing constant bone lengths and anatomical plausibility while jointly predicting joint means and full covariances. (3) Filtering: root-normalized trajectories are variance-weighted according to HPSTM's uncertainty estimates, suppressing residual noise. (4) Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist triples into the uArm's polar workspace, preserving wrist orientation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes