GR CVJan 20

Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

arXiv:2601.14207v12.31 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of content creation and scene assembly for users needing accurate object-object alignment, though it is incremental by combining existing techniques.

The paper tackles zero-shot 3D alignment of meshes using text prompts by optimizing relative pose with CLIP-driven gradients and geometry-aware objectives, outperforming baselines on a curated benchmark.

We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.

View on arXiv PDF

Similar