CVMar 1, 2025

EigenActor: Variant Body-Object Interaction Generation Evolved from Invariant Action Basis Reasoning

arXiv:2503.00382v21 citationsh-index: 9IEEE Trans Pattern Anal Mach Intell
Originality Incremental advance
AI Analysis

This work addresses the cross-modality gap in text-to-HOI synthesis, which is important for applications like virtual reality and robotics, but it is incremental as it builds on existing methods by refining the generation strategy.

This paper tackles the problem of generating 3D human-object interactions from text instructions by proposing a two-stage method that first infers object-agnostic canonical body actions and then enriches object-specific interaction styles, resulting in significant performance improvements over state-of-the-art methods on three large-scale datasets with better semantic consistency and interaction realism.

This paper explores a cross-modality synthesis task that infers 3D human-object interactions (HOIs) from a given text-based instruction. Existing text-to-HOI synthesis methods mainly deploy a direct mapping from texts to object-specific 3D body motions, which may encounter a performance bottleneck since the huge cross-modality gap. In this paper, we observe that those HOI samples with the same interaction intention toward different targets, e.g., "lift a chair" and "lift a cup", always encapsulate similar action-specific body motion patterns while characterizing different object-specific interaction styles. Thus, learning effective action-specific motion priors and object-specific interaction priors is crucial for a text-to-HOI model and dominates its performances on text-HOI semantic consistency and body-object interaction realism. In light of this, we propose a novel body pose generation strategy for the text-to-HOI task: infer object-agnostic canonical body action first and then enrich object-specific interaction styles. Specifically, the first canonical body action inference stage focuses on learning intra-class shareable body motion priors and mapping given text-based semantics to action-specific canonical 3D body motions. Then, in the object-specific interaction inference stage, we focus on object affordance learning and enrich object-specific interaction styles on an inferred action-specific body motion basis. Extensive experiments verify that our proposed text-to-HOI synthesis system significantly outperforms other SOTA methods on three large-scale datasets with better semantic consistency and interaction realism performances.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes