Multi-Modal Framing Analysis of News
This work addresses the limitation of narrow, text-only frame analysis in computational social science by enabling multi-modal analysis for researchers studying media bias and political communication, though it is incremental as it builds on existing framing theory and models.
The paper tackles the problem of automated frame analysis in political communication by extending it to multi-modal analysis of both text and images in news, using large vision-language models to extract and contrast latent meanings, resulting in a scalable method for identifying partisan framing and providing a more complete picture of media bias.
Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-) language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.