GR CVJul 11, 2025

Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

Liu He, Xiao Zeng, Yizhi Song, Albert Y. C. Chen, Lu Xia, Shashwat Verma, Sankalp Dayal, Min Sun, Cheng-Hao Kuo, Daniel Aliaga

arXiv:2507.08513v2h-index: 5Has Code

Originality Incremental advance

AI Analysis

This addresses a specific bottleneck in MLLMs for applications requiring precise spatial understanding, though it is incremental as it builds on existing synthetic generation and tuning methods.

The paper tackled the problem of multimodal large language models (MLLMs) struggling with camera-object relations by generating a large-scale 3D visual instruction dataset, resulting in MLLMs fine-tuned on this dataset outperforming commercial models by an average accuracy improvement of 33.4% on recognition tasks.

Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on our proposed dataset outperform commercial models by a large margin, achieving an average accuracy improvement of 33.4% on camera-object relation recognition tasks. Our code, dataset, and benchmark will contribute to broad MLLM applications.

View on arXiv PDF

Similar