Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering
It addresses the problem of fragmented progress in 3D SQA for researchers by offering a systematic foundation to guide future work, though it is incremental as a survey.
This survey tackles the challenge of unifying analysis and comparison across datasets and baselines in 3D Scene Question Answering (3D SQA), providing the first comprehensive review that organizes work from datasets, methodologies, and evaluation metrics to identify patterns and propose future directions.
3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. In this survey, we provide the first comprehensive and systematic review of 3D SQA. We organize existing work from three perspectives: datasets, methodologies, and evaluation metrics. Beyond basic categorization, we identify shared architectural patterns across methods. Our survey further synthesizes core limitations and discusses how current trends, such as instruction tuning, multimodal alignment, and zero-shot, can shape future developments. Finally, we propose a range of promising research directions covering dataset construction, task generalization, interaction modeling, and unified evaluation protocols. This work aims to serve as a foundation for future research and foster progress toward more generalizable and intelligent 3D SQA systems.