Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability
This addresses a specific challenge in news media for readers and platforms, but it is incremental as it builds on existing vision-language models with a novel training approach.
This paper tackles the problem of assessing whether news thumbnail images represent the actors in the news text, finding that pretrained models like BLIP-2 struggle with this task. The authors propose CFT-CLIP, a contrastive learning framework that uses counterfactual text to enhance cross-modal matching, and show it boosts performance for this assessment.
This paper addresses the critical challenge of assessing the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the actors discussed in the news text. To serve the challenge, we introduce NewsTT, a manually annotated dataset of 1000 news thumbnail images and text pairs. We found that the pretrained vision and language models, such as BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, the pretrained models could have a limited capability to match news actors' visual and textual appearances. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability of vision and language models. We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis. We found that our simple method can boost the performance for assessing news thumbnail representativeness, supporting our assumption. Code and data can be accessed at https://github.com/ssu-humane/news-images-acl24.