CVMay 21, 2025

TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving

arXiv:2505.15564v1h-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of computational resource constraints in vehicles for autonomous driving applications, though it appears incremental as it builds on existing VLM methods with optimizations.

The paper tackles the problem of deploying vision language models for visual question-answering in autonomous driving by introducing TinyDrive, a lightweight model that achieves state-of-the-art performance with 11.1% and 35.4% relative improvements in BLEU-4 and METEOR scores on a benchmark.

Vision Language Models (VLMs) employed for visual question-answering (VQA) in autonomous driving often require substantial computational resources that pose a challenge for their deployment in resource-constrained vehicles. To address this challenge, we introduce TinyDrive, a lightweight yet effective VLM for multi-view VQA in driving scenarios. Our model comprises two key components including a multiscale vision encoder and a dual-level prioritization mechanism for tokens and sequences. The multiscale encoder facilitates the processing of multi-view images at diverse resolutions through scale injection and cross-scale gating to generate enhanced visual representations. At the token level, we design a token routing mechanism that dynamically selects and process the most informative tokens based on learned importance scores. At the sequence level, we propose integrating normalized loss, uncertainty estimates, and a diversity metric to formulate sequence scores that rank and preserve samples within a sequence priority buffer. Samples with higher scores are more frequently selected for training. TinyDrive is first evaluated on our custom-curated VQA dataset, and it is subsequently tested on the public DriveLM benchmark, where it achieves state-of-the-art language understanding performance. Notably, it achieves relative improvements of 11.1% and 35.4% in BLEU-4 and METEOR scores, respectively, despite having a significantly smaller parameter count.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes