3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering
This work addresses data scarcity for embodied AI tasks in 3D environments, offering a scalable solution with publicly released datasets and code, though it appears incremental as it builds on existing datasets and methods.
The paper tackles the need for diverse and scalable data in indoor scene tasks like question answering and dense captioning by proposing 3D-MoRe, a novel paradigm that generates large-scale 3D-language datasets, resulting in improvements such as a 2.15% increase in CIDEr score on ScanQA and a 1.84% increase in CIDEr@0.5 on ScanRefer.
With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15\%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84\%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.