Probing Multimodal LLMs as World Models for Driving
This work addresses the problem of evaluating MLLMs for dynamic driving scenarios, highlighting gaps in their capabilities for autonomous driving applications, but it is incremental as it focuses on assessment rather than proposing new solutions.
The study assessed Multimodal Large Language Models (MLLMs) in autonomous driving, finding that while they interpret individual images well, they struggle to synthesize coherent narratives across frames, leading to inaccuracies in understanding ego vehicle dynamics, interactions, trajectory planning, and scene reasoning.
We provide a sober look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.