Surprisal reveals diversity gaps in image captioning and different scorers change the story
This work addresses the need for robust diversity evaluation in image captioning for researchers and practitioners, highlighting that conclusions can invert depending on the scorer used.
The paper tackled the problem of quantifying linguistic diversity in image captioning by introducing a surprisal-based metric, revealing that human captions show roughly twice the surprisal variance of models when scored with a caption-trained LM, but this pattern reverses with a general-language model.
We quantify linguistic diversity in image captioning with surprisal variance - the spread of token-level negative log-probabilities within a caption set. On the MSCOCO test set, we compare five state-of-the-art vision-and-language LLMs, decoded with greedy and nucleus sampling, to human captions. Measured with a caption-trained n-gram LM, humans display roughly twice the surprisal variance of models, but rescoring the same captions with a general-language model reverses the pattern. Our analysis introduces the surprisal-based diversity metric for image captioning. We show that relying on a single scorer can completely invert conclusions, thus, robust diversity evaluation must report surprisal under several scorers.