CVApr 14, 2025

Multi-Object Grounding via Hierarchical Contrastive Siamese Transformers

arXiv:2504.10048v16.21 citationsh-index: 1IJCNN

Originality Incremental advance

AI Analysis

This addresses the need for multi-object grounding in real-world scenarios, representing an incremental advance over single-object methods.

The paper tackles the problem of localizing multiple objects in 3D scenes based on natural language input, achieving a 9.5% improvement over previous state-of-the-art methods on challenging benchmarks.

Multi-object grounding in 3D scenes involves localizing multiple objects based on natural language input. While previous work has primarily focused on single-object grounding, real-world scenarios often demand the localization of several objects. To tackle this challenge, we propose Hierarchical Contrastive Siamese Transformers (H-COST), which employs a Hierarchical Processing strategy to progressively refine object localization, enhancing the understanding of complex language instructions. Additionally, we introduce a Contrastive Siamese Transformer framework, where two networks with the identical structure are used: one auxiliary network processes robust object relations from ground-truth labels to guide and enhance the second network, the reference network, which operates on segmented point-cloud data. This contrastive mechanism strengthens the model' s semantic understanding and significantly enhances its ability to process complex point-cloud data. Our approach outperforms previous state-of-the-art methods by 9.5% on challenging multi-object grounding benchmarks.

View on arXiv PDF

Similar