Toward accessible comics for blind and low vision readers
This addresses accessibility for blind and low vision readers by converting visual comic content into audio formats, though it is incremental as it builds on existing methods.
This work tackles the problem of making comics accessible to blind and low vision readers by generating accurate text descriptions of comic strips using fine-tuned large language models with contextual information from computer vision and OCR, which can then be converted to speech synthesis for audiobooks and eBooks.
This work explores how to fine-tune large language models using prompt engineering techniques with contextual information for generating an accurate text description of the full story, ready to be forwarded to off-the-shelve speech synthesis tools. We propose to use existing computer vision and optical character recognition techniques to build a grounded context from the comic strip image content, such as panels, characters, text, reading order and the association of bubbles and characters. Then we infer character identification and generate comic book script with context-aware panel description including character's appearance, posture, mood, dialogues etc. We believe that such enriched content description can be easily used to produce audiobook and eBook with various voices for characters, captions and playing sound effects.