CVAILGApr 16, 2025

FLIP Reasoning Challenge

ETH Zurich
arXiv:2504.12256v11 citationsh-index: 24Has Code
Originality Incremental advance
AI Analysis

It addresses the problem of evaluating reasoning capabilities in AI systems for researchers, providing a new multimodal benchmark, though it is incremental as it builds on existing datasets and methods.

This paper tackles the challenge of AI reasoning by introducing the FLIP dataset, a benchmark based on human verification tasks from the Idena blockchain, and finds that state-of-the-art models achieve up to 77.9% accuracy in zero-shot settings, significantly below human performance of 95.3%.

Over the past years, advances in artificial intelligence (AI) have demonstrated how AI can solve many perception and generation tasks, such as image classification and text writing, yet reasoning remains a challenge. This paper introduces the FLIP dataset, a benchmark for evaluating AI reasoning capabilities based on human verification tasks on the Idena blockchain. FLIP challenges present users with two orderings of 4 images, requiring them to identify the logically coherent one. By emphasizing sequential reasoning, visual storytelling, and common sense, FLIP provides a unique testbed for multimodal AI systems. Our experiments evaluate state-of-the-art models, leveraging both vision-language models (VLMs) and large language models (LLMs). Results reveal that even the best open-sourced and closed-sourced models achieve maximum accuracies of 75.5% and 77.9%, respectively, in zero-shot settings, compared to human performance of 95.3%. Captioning models aid reasoning models by providing text descriptions of images, yielding better results than when using the raw images directly, 69.6% vs. 75.2% for Gemini 1.5 Pro. Combining the predictions from 15 models in an ensemble increases the accuracy to 85.2%. These findings highlight the limitations of existing reasoning models and the need for robust multimodal benchmarks like FLIP. The full codebase and dataset will be available at https://github.com/aplesner/FLIP-Reasoning-Challenge.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes