CV AI CL LGDec 10, 2024

MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models

arXiv:2412.07148v13 citationsh-index: 5Journal of Open Source Software

Originality Incremental advance

AI Analysis

This work addresses a specific problem for researchers and practitioners in AI by enhancing VLM reasoning capabilities for complex visual question-answering, though it is incremental as it builds on existing PoE strategies by extending them to multi-modal contexts.

The paper tackles the problem of improving Vision-Language Models (VLMs) in multiple-choice visual reasoning tasks by introducing MM-PoE, a method that uses a process of elimination to exclude implausible choices before selecting answers, resulting in significant performance gains in zero-shot and few-shot settings across three benchmark datasets.

This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models, herein referred to as Multi-Modal Process of Elimination (MM-PoE). This novel methodology is engineered to augment the efficacy of Vision-Language Models (VLMs) in multiple-choice visual reasoning tasks. Diverging from conventional approaches that evaluate each option independently, MM-PoE employs a dual-step scoring paradigm that initially identifies and excludes implausible choices, subsequently concentrating on the most probable remaining options. This method emulates human test-taking strategies, where individuals typically eliminate clearly incorrect answers prior to selecting the optimal response. Our empirical evaluations, conducted across three benchmark datasets, reveal that MM-PoE significantly improves both zero-shot and few-shot performance of contemporary state-of-the-art VLMs. Critically, this approach not only broadens the application of the elimination process to multi-modal contexts but also allows few-shot experiments, thereby addressing two principal limitations concerning usage of PoE only in zero-shot settings and only with a language-only framework. As a result, MM-PoE not only refines the reasoning capabilities of VLMs but also broadens their applicability to complex visual question-answering scenarios. All code and documentation supporting our work are available at https://pypi.org/project/mm-poe/, enabling researchers and practitioners to easily integrate and further develop these techniques.

View on arXiv PDF

Similar