CLAICVMay 8, 2025

Image-Text Relation Prediction for Multilingual Tweets

arXiv:2505.05040v111 citationsh-index: 18
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of understanding media-text relations in social networks, particularly for low-resource languages like Latvian, but it is incremental as it builds on existing vision-language models and benchmarks.

The paper tackled the problem of predicting image-text relations in multilingual tweets by constructing a balanced benchmark dataset in Latvian and English, and found that recent vision-language models show improved capability but still have significant room for improvement.

Various social networks have been allowing media uploads for over a decade now. Still, it has not always been clear what is their relation with the posted text or even if there is any at all. In this work, we explore how multilingual vision-language models tackle the task of image-text relation prediction in different languages, and construct a dedicated balanced benchmark data set from Twitter posts in Latvian along with their manual translations into English. We compare our results to previous work and show that the more recently released vision-language model checkpoints are becoming increasingly capable at this task, but there is still much room for further improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes