Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization
For users needing precise pose control in subject customization, this work addresses the key bottleneck of 3D understanding in 2D-native backbones.
Pose-ICL introduces a tuning-free framework for pose-controllable subject customization that achieves significantly higher pose accuracy and identity consistency compared to existing methods, as demonstrated on 3D assets and real-world subjects.
Subject Customization is a foundational task in modern image generation. By providing a few reference images and a text prompt, users can generate images of a specific object in any desired scene. However, existing methods still struggle to achieve effective pose control for customized subjects. In practice, they often exhibit inaccurate poses or inconsistent cross-pose appearances. These limitations suggest that understanding objects in a volumetric manner remains a significant challenge for 2D-native backbones. To address this challenge, we propose Pose-ICL, a tuning-free framework that leverages 3D-aware In-Context Learning (ICL) to directly adapt to new subjects through multiple paired image-pose references. Its core mechanism,Surface-Anchored Position Embedding (SAPE), equips the model with explicit 3D awareness by anchoring image tokens to the surface coordinates of a volumetric bounding box. Dedicated refinements ensure its seamless compatibility with existing DiT models. Extensive evaluations on both 3D assets and real-world subjects demonstrate that Pose-ICL significantly outperforms current methods in both pose accuracy and identity consistency.