HCAIFeb 1

How well can VLMs rate audio descriptions: A multi-dimensional quantitative assessment framework

arXiv:2602.01390v1
Originality Incremental advance
AI Analysis

This addresses scalable quality control for audio descriptions, benefiting blind and low-vision audiences, but is incremental as it builds on existing methods with a new framework.

The paper tackled the problem of systematically evaluating audio description quality for full-length videos, developing a multi-dimensional assessment framework and finding that vision-language models can approximate expert ratings with high alignment but have less reliable reasoning than humans.

Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision audiences are excluded. While crowdsourced platforms and vision-language-models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving questions about what constitutes quality for full-length content and how to assess it at scale. To address these questions, we first developed a multi-dimensional assessment framework for uninterrupted, full-length video, grounded in professional guidelines and refined by accessibility specialists. Second, we integrated this framework into a comprehensive methodological workflow, utilizing Item Response Theory, to assess the proficiency of VLM and human raters against expert-established ground truth. Findings suggest that while VLMs can approximate ground-truth ratings with high alignment, their reasoning was found to be less reliable and actionable than that of human respondents. These insights show the potential of hybrid evaluation systems that leverage VLMs alongside human oversight, offering a path towards scalable AD quality control.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes