A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models
This study addresses the problem of underserved visual art descriptions for blind and low-vision audiences, particularly in multilingual museum settings with privacy constraints, by investigating the use of small on-premise vision-language models. It is an incremental step towards improving accessibility.
This pilot study explored curator-guided multilingual art description for blind and low-vision (BLV) audiences using a small vision-language model (Qwen2.5-VL-3B-Instruct) for German, Romanian, and Serbian. It found that language-specific LoRA adapters provided more stable controllability and visually grounded description quality for Romanian and Serbian, while a single multilingual adapter was competitive for German.
Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.