LG AI CVFeb 24, 2022

Measuring CLEVRness: Blackbox testing of Visual Reasoning Models

Spyridon Mouselinos, Henryk Michalewski, Mateusz Malinowski

arXiv:2202.12162v24.64 citations

Originality Incremental advance

AI Analysis

This work addresses the critical issue of evaluating reasoning in AI systems for researchers and practitioners, highlighting potential biases in data-driven approaches, though it is incremental in extending existing testing frameworks.

The authors tackled the problem of assessing whether visual question answering models truly reason by proposing a black-box adversarial test on CLEVR models, showing that models performing at human levels can be easily fooled, casting doubt on their reasoning capabilities.

How can we measure the reasoning capabilities of intelligence systems? Visual question answering provides a convenient framework for testing the model's abilities by interrogating the model through questions about the scene. However, despite scores of various visual QA datasets and architectures, which sometimes yield even a super-human performance, the question of whether those architectures can actually reason remains open to debate. To answer this, we extend the visual question answering framework and propose the following behavioral test in the form of a two-player game. We consider black-box neural models of CLEVR. These models are trained on a diagnostic dataset benchmarking reasoning. Next, we train an adversarial player that re-configures the scene to fool the CLEVR model. We show that CLEVR models, which otherwise could perform at a human level, can easily be fooled by our agent. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets. Finally, we also propose a controlled experiment measuring the efficiency of such models to learn and perform reasoning.

View on arXiv PDF

Similar