Challenges and Prospects in Vision and Language Research
This addresses the problem of misleading progress in AI evaluation for researchers and developers, but it is incremental as it builds on existing critiques.
The paper reviews how current vision-language systems achieve high performance due to dataset and evaluation flaws rather than genuine intelligence, and proposes a path forward for more robust benchmarks.
Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, rather than behaving as visual Turing tests, recent studies have demonstrated state-of-the-art systems are achieving good performance through flaws in datasets and evaluation procedures. We review the current state of affairs and outline a path forward.