HCMar 15

ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

Maryam Cheema, Sina Elahimanesh, Pooyan Fazli, Hasti Seifi

arXiv:2603.1466272.4h-index: 14

AI Analysis

This addresses the problem of personalized video access for blind and low vision users, offering an incremental improvement over prior AI systems.

The researchers tackled the lack of adaptive AI-generated audio descriptions for blind and low vision viewers by developing ViDscribe, a web platform with customization and conversational question answering, which improved effectiveness, enjoyment, and immersion in a longitudinal study with eight participants.

Advances in multimodal large language models enable automatic video narration and question answering (VQA), offering scalable alternatives to labor-intensive, human-authored audio descriptions (ADs) for blind and low vision (BLV) viewers. However, prior AI-driven AD systems rarely adapt to the diverse needs and preferences of BLV individuals across videos and are typically evaluated in controlled, single-session settings. We present ViDscribe, a web-based platform that integrates AI-generated ADs with six types of user customizations and a conversational VQA interface for YouTube videos. Through a longitudinal, in-the-wild study with eight BLV participants, we examine how users engage with customization and VQA features over time. Our results show sustained engagement with both features and that customized ADs improve effectiveness, enjoyment, and immersion compared to default ADs, highlighting the value of personalized, interactive video access for BLV users.

View on arXiv PDF

Similar