CVMMRODec 10, 2024

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving

arXiv:2412.07689v524 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for more generalizable and capable models in autonomous driving, though it appears incremental as it builds on existing large multimodal model frameworks.

The paper tackles the problem of limited generalization in autonomous driving models by proposing RoboTron-Drive, an all-in-one large multimodal model that achieves state-of-the-art performance on six public benchmarks and demonstrates strong zero-shot transfer on three unseen datasets.

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose RoboTron-Drive, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD datasets to finetune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves state-of-the-art performance across all tasks. We hope RoboTron-Drive as a promising solution for AD in the real world. Project page with code: https://github.com/zhijian11/RoboTron-Drive.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes