RO AI CVDec 21, 2023

LingoQA: Visual Question Answering for Autonomous Driving

Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, Elahe Arani, Oleg Sinavski

arXiv:2312.14115v435.3159 citationsh-index: 8Has CodeECCV

Originality Synthesis-oriented

AI Analysis

This addresses the need for better evaluation of vision-language models in autonomous driving, though it is incremental as it focuses on dataset creation and benchmarking.

The authors introduced LingoQA, a dataset and benchmark for visual question answering in autonomous driving, containing 28K video scenarios and 419K annotations, and found that state-of-the-art models like GPT-4V achieve only 59.6% truthfulness compared to 96.6% for humans.

We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark as an evaluation platform for vision-language models in autonomous driving.

View on arXiv PDF Code

Similar