CLOct 13, 2025

Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications

arXiv:2510.11314v11 citationsh-index: 1Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)
Originality Incremental advance
AI Analysis

This work addresses visual accessibility for people with intellectual disabilities, though it is incremental as it applies structured prompting to an existing vision-language model.

This paper tackles the problem of generating accessible images from simplified texts for individuals with intellectual disabilities, finding that a Basic Object Focus prompt template achieved the highest semantic alignment and Retro style was rated most accessible by experts.

Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize aesthetics over accessibility, it is not clear how visual illustrations relate to text simplifications (TS) generated from them. This paper presents a structured vision-language model (VLM) prompting framework for generating accessible images from simplified texts. We designed five prompt templates, i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, and Grid Layout, each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits, spatial separation, and content restrictions. Using 400 sentence-level simplifications from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template effectiveness with CLIPScores, and Phase 2 involved human annotation of generated images across ten visual styles by four accessibility experts. Results show that the Basic Object Focus prompt template achieved the highest semantic alignment, indicating that visual minimalism enhances language accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective data source. Inter-annotator agreement varied across dimensions, with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall, our framework offers practical guidelines for accessible content generation and underscores the importance of structured prompting in AI-generated visual accessibility tools.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes