CVMar 14, 2025

DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models

arXiv:2503.11265v116.421 citationsh-index: 5

Originality Incremental advance

AI Analysis

This addresses a critical safety problem in autonomous driving by improving object detection accuracy, though it appears incremental as it builds on existing vision-language models.

The paper tackles the loss of detailed information in vision-language models for autonomous driving due to downsampling, which can miss small or distant objects, by proposing DynRsl-VLM with dynamic resolution processing and a new image-text alignment module, resulting in enhanced perception without exceeding computational limits.

Visual Question Answering (VQA) models, which fall under the category of vision-language models, conventionally execute multiple downsampling processes on image inputs to strike a balance between computational efficiency and model performance. Although this approach aids in concentrating on salient features and diminishing computational burden, it incurs the loss of vital detailed information, a drawback that is particularly damaging in end-to-end autonomous driving scenarios. Downsampling can lead to an inadequate capture of distant or small objects such as pedestrians, road signs, or obstacles, all of which are crucial for safe navigation. This loss of features negatively impacts an autonomous driving system's capacity to accurately perceive the environment, potentially escalating the risk of accidents. To tackle this problem, we put forward the Dynamic Resolution Vision Language Model (DynRsl-VLM). DynRsl-VLM incorporates a dynamic resolution image input processing approach that captures all entity feature information within an image while ensuring that the image input remains computationally tractable for the Vision Transformer (ViT). Moreover, we devise a novel image-text alignment module to replace the Q-Former, enabling simple and efficient alignment with text when dealing with dynamic resolution image inputs. Our method enhances the environmental perception capabilities of autonomous driving systems without overstepping computational constraints.

View on arXiv PDF

Similar