CVDec 10, 2024

Driving with InternVL: Oustanding Champion in the Track on Driving with Language of the Autonomous Grand Challenge at CVPR 2024

arXiv:2412.07247v17.67 citationsh-index: 15Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses autonomous driving tasks requiring language understanding, but it is incremental as it applies an existing model to a competition dataset with minor adaptations.

The authors tackled the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge by fine-tuning the open-source multimodal model InternVL-1.5 on the DriveLM-nuScenes dataset, achieving a score of 0.6002 on the final leaderboard.

This technical report describes the methods we employed for the Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We utilized a powerful open-source multimodal model, InternVL-1.5, and conducted a full-parameter fine-tuning on the competition dataset, DriveLM-nuScenes. To effectively handle the multi-view images of nuScenes and seamlessly inherit InternVL's outstanding multimodal understanding capabilities, we formatted and concatenated the multi-view images in a specific manner. This ensured that the final model could meet the specific requirements of the competition task while leveraging InternVL's powerful image understanding capabilities. Meanwhile, we designed a simple automatic annotation strategy that converts the center points of objects in DriveLM-nuScenes into corresponding bounding boxes. As a result, our single model achieved a score of 0.6002 on the final leadboard.

View on arXiv PDF

Similar