Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases
This addresses the problem of evaluating LVLMs for interpretable self-driving in severe corner cases, offering a new benchmark and model, but it is incremental as it builds on existing LVLM and LLM methods.
The paper tackles the lack of automated evaluation for large vision-language models (LVLMs) in self-driving corner cases by proposing CODA-LM, a benchmark that uses text-only LLMs as judges and shows better alignment with human preferences, and builds CODA-VLM, which surpasses open-source models and outperforms GPT-4V by +21.42% on a regional perception task.
Large Vision-Language Models (LVLMs) have received widespread attention for advancing the interpretable self-driving. Existing evaluations of LVLMs primarily focus on multi-faceted capabilities in natural circumstances, lacking automated and quantifiable assessment for self-driving, let alone the severe road corner cases. In this work, we propose CODA-LM, the very first benchmark for the automatic evaluation of LVLMs for self-driving corner cases. We adopt a hierarchical data structure and prompt powerful LVLMs to analyze complex driving scenes and generate high-quality pre-annotations for the human annotators, while for LVLM evaluation, we show that using the text-only large language models (LLMs) as judges reveals even better alignment with human preferences than the LVLM judges. Moreover, with our CODA-LM, we build CODA-VLM, a new driving LVLM surpassing all open-sourced counterparts on CODA-LM. Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task. We hope CODA-LM can become the catalyst to promote interpretable self-driving empowered by LVLMs.