HCAIFeb 13

How Multimodal Large Language Models Support Access to Visual Information: A Diary Study With Blind and Low Vision People

arXiv:2602.13469v2h-index: 34
Originality Incremental advance
AI Analysis

This addresses the problem of unreliable visual assistance for BLV people in daily life, but it is incremental as it builds on existing MLLM applications by proposing guidelines for improvement.

The study investigated how multimodal large language models (MLLMs) assist Blind and Low Vision (BLV) people in accessing visual information through a two-week diary study, finding that while participants rated the AI's interpretations as trustworthy and somewhat satisfying, it often produced incorrect answers (22.2%) or abstained (10.8%) from responding.

Multimodal large language models (MLLMs) are changing how Blind and Low Vision (BLV) people access visual information. Unlike traditional visual interpretation tools that only provide descriptions, MLLM-enabled applications offer conversational assistance, where users can ask questions to obtain goal-relevant details. However, evidence about their performance in the real-world and implications for BLV people's daily lives remains limited. To address this, we conducted a two-week diary study, where we captured 20 BLV participants' use of an MLLM-enabled visual interpretation application. Although participants rated the visual interpretations of the application as "trustworthy" (mean=3.76 out of 5, max=extremely trustworthy) and "somewhat satisfying" (mean=4.13 out of 5, max=very satisfying), the AI often produced incorrect answers (22.2%) or abstained (10.8%) from responding to users' requests. Our findings show that while MLLMs can improve visual interpretations' descriptive accuracy, supporting everyday use also depends on the "visual assistant" skill: behaviors for providing goal-directed, reliable assistance. We conclude by proposing the "visual assistant" skill and guidelines to help MLLM-enabled visual interpretation applications better support BLV people's access to visual information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes