CV AI CL LG ROJun 7, 2024

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Jianing Yang, Xuweiyi Chen, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David F. Fouhey, Joyce Chai

arXiv:2406.05132v323.840 citationsHas Code

Originality Highly original

AI Analysis

This provides a foundational resource for the embodied AI community to develop more reliable 3D-LLMs, addressing a key bottleneck in robotics and agent systems.

The paper tackles the lack of large-scale datasets for integrating language with 3D perception in embodied AI by introducing 3D-GRAND, a dataset of 40,087 household scenes with 6.2 million densely-grounded instructions, which improves grounding and reduces hallucinations in 3D-LLMs, showing a scaling effect with dataset size and early sim-to-real transfer signals.

The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io

View on arXiv PDF Code

Similar