Generating Context-Aware Natural Answers for Questions in 3D Scenes
This work addresses a limitation in 3D question answering by enabling more natural and context-aware responses, which is incremental as it builds on existing benchmarks but introduces a novel generation approach.
The paper tackles the problem of generating free-form natural answers for questions in 3D scenes, moving beyond pre-defined answer spaces, and achieves state-of-the-art results on the ScanQA benchmark with CIDEr scores of 72.22 and 66.57 on test sets.
3D question answering is a young field in 3D vision-language that is yet to be explored. Previous methods are limited to a pre-defined answer space and cannot generate answers naturally. In this work, we pivot the question answering task to a sequence generation task to generate free-form natural answers for questions in 3D scenes (Gen3DQA). To this end, we optimize our model directly on the language rewards to secure the global sentence semantics. Here, we also adapt a pragmatic language understanding reward to further improve the sentence quality. Our method sets a new SOTA on the ScanQA benchmark (CIDEr score 72.22/66.57 on the test sets).