ROAICLCVLGFeb 28, 2024

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

Tsinghua
arXiv:2402.18137v219 citationsh-index: 21ICML
Originality Incremental advance
AI Analysis

This work addresses representation learning for autonomous robots, offering a versatile solution for unified representation and reward learning, though it appears incremental as it builds on existing methods like InfoNCE and Bradley-Terry models.

The paper tackles the problem of multimodal pretraining for autonomous robots by proposing DecisionNCE, a unified objective that extracts task progression from image sequences and aligns them with language instructions, demonstrating effectiveness in simulated and real robot policy learning tasks.

Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progressions; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding. Most existing methods approach these via separate objectives, which often reach sub-optimal solutions. In this paper, we propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences and seamlessly align them with language instructions. We discover that via implicit preferences, where a visual trajectory inherently aligns better with its corresponding language instruction than mismatched pairs, the popular Bradley-Terry model can transform into representation learning through proper reward reparameterizations. The resulted framework, DecisionNCE, mirrors an InfoNCE-style objective but is distinctively tailored for decision-making tasks, providing an embodied representation learning framework that elegantly extracts both local and global task progression features, with temporal consistency enforced through implicit time contrastive learning, while ensuring trajectory-level instruction grounding via multimodal joint encoding. Evaluation on both simulated and real robots demonstrates that DecisionNCE effectively facilitates diverse downstream policy learning tasks, offering a versatile solution for unified representation and reward learning. Project Page: https://2toinf.github.io/DecisionNCE/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes