CL AI CVMar 14, 2025

RONA: Pragmatically Diverse Image Captioning with Coherence Relations

Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee

arXiv:2503.10997v218.813 citationsh-index: 3Has CodeProceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)

Originality Incremental advance

AI Analysis

This work addresses the need for more human-like and pragmatically diverse image captions in writing assistants, though it appears incremental as it builds on existing MLLM methods with a novel prompting approach.

The paper tackled the problem of generating diverse image captions by introducing RONA, a prompting strategy for Multi-modal Large Language Models that uses Coherence Relations to create pragmatic variations, resulting in better overall diversity and ground-truth alignment compared to baselines across multiple domains.

Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance caption diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. We propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as a controllable axis for pragmatic variations. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA

View on arXiv PDF Code

Similar