CL CVJul 3, 2024

VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

arXiv:2407.03000v215.426 citationsh-index: 5

Originality Synthesis-oriented

AI Analysis

This addresses the problem of ensuring VLMs incorporate human values in real-world decisions, providing a new benchmark for researchers, though it is incremental as it builds on existing multimodal evaluation frameworks.

The paper introduces VIVA, a benchmark for evaluating vision-grounded decision-making based on human values, revealing limitations in large vision language models (VLMs) in this area through experiments on 1,240 annotated images.

Large vision language models (VLMs) have demonstrated significant potential for integration into daily life, making it crucial for them to incorporate human values when making decisions in real-world situations. This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large VLMs focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,240 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.

View on arXiv PDF

Similar