CVCLJul 26, 2025

The Devil is in the EOS: Sequence Training for Detailed Image Captioning

arXiv:2507.20077v2h-index: 5
Originality Incremental advance
AI Analysis

This addresses the issue of lack of detail in image captioning for users of vision-language models, though it is incremental as it builds on existing models.

The paper tackled the problem of image captioning models producing short, generic captions by identifying a bias towards the end-of-sequence token during training, and proposed an unsupervised method to reduce this bias, resulting in longer and more detailed captions with a substantial increase in caption length and relevant details across three benchmarks.

Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model's tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes