RODec 30, 2025
Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-trainingYi Liu, Sukai Wang, Dafeng Wei et al.
General-purpose robotic systems operating in open-world environments must achieve both broad generalization and high-precision action execution, a combination that remains challenging for existing Vision-Language-Action (VLA) models. While large Vision-Language Models (VLMs) improve semantic generalization, insufficient embodied reasoning leads to brittle behavior, and conversely, strong reasoning alone is inadequate without precise control. To provide a decoupled and quantitative assessment of this bottleneck, we introduce Embodied Reasoning Intelligence Quotient (ERIQ), a large-scale embodied reasoning benchmark in robotic manipulation, comprising 6K+ question-answer pairs across four reasoning dimensions. By decoupling reasoning from execution, ERIQ enables systematic evaluation and reveals a strong positive correlation between embodied reasoning capability and end-to-end VLA generalization. To bridge the gap from reasoning to precise execution, we propose FACT, a flow-matching-based action tokenizer that converts continuous control into discrete sequences while preserving high-fidelity trajectory reconstruction. The resulting GenieReasoner jointly optimizes reasoning and action in a unified space, outperforming both continuous-action and prior discrete-action baselines in real-world tasks. Together, ERIQ and FACT provide a principled framework for diagnosing and overcoming the reasoning-precision trade-off, advancing robust, general-purpose robotic manipulation. Project page: https://geniereasoner.github.io/GenieReasoner/
CVNov 17, 2018
Explicit Spatiotemporal Joint Relation Learning for Tracking Human PoseXiao Sun, Chuankang Li, Stephen Lin
We present a method for human pose tracking that is based on learning spatiotemporal relationships among joints. Beyond generating the heatmap of a joint in a given frame, our system also learns to predict the offset of the joint from a neighboring joint in the frame. Additionally, it is trained to predict the displacement of the joint from its position in the previous frame, in a manner that can account for possibly changing joint appearance, unlike optical flow. These relational cues in the spatial domain and temporal domain are inferred in a robust manner by attending only to relevant areas in the video frames. By explicitly learning and exploiting these joint relationships, our system achieves state-of-the-art performance on standard benchmarks for various pose tracking tasks including 3D body pose tracking in RGB video, 3D hand pose tracking in depth sequences, and 3D hand gesture tracking in RGB video.
CVSep 17, 2018
An Integral Pose Regression System for the ECCV2018 PoseTrack ChallengeXiao Sun, Chuankang Li, Stephen Lin
For the ECCV 2018 PoseTrack Challenge, we present a 3D human pose estimation system based mainly on the integral human pose regression method. We show a comprehensive ablation study to examine the key performance factors of the proposed system. Our system obtains 47mm MPJPE on the CHALL_H80K test dataset, placing second in the ECCV2018 3D human pose estimation challenge. Code will be released to facilitate future work.