ROAICVLGDec 7, 2025

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

arXiv:2512.06951v13 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the problem of enabling robots to execute complex, context-aware household tasks in photo-realistic simulations, representing an incremental improvement over existing methods.

The paper tackled the challenge of performing diverse long-horizon household tasks in simulation by developing a vision-action policy, which achieved a 26% q-score across 50 tasks, winning first place in the 2025 BEHAVIOR Challenge.

We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes