AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages
This addresses the problem of limited AI accessibility for African language speakers by creating foundational resources, though it is incremental in applying existing methods to new data.
The authors tackled the lack of multimodal AI resources for African languages by developing AfriCaption, a framework for image captioning in 20 African languages, which includes a curated dataset, a dynamic pipeline, and a 0.5B parameter model, establishing the first scalable resource for these under-represented languages.
Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.