CL CVDec 8, 2020

Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation

Jeff Da, Maxwell Forbes, Rowan Zellers, Anthony Zheng, Jena D. Hwang, Antoine Bosselut, Yejin Choi

arXiv:2012.04726v227.7718 citations

Originality Highly original

AI Analysis

This work tackles the critical problem of identifying intent in edited media, which is crucial for combating multimodal disinformation and understanding its implications for society.

This paper addresses the challenge of distinguishing harmful visual misinformation from harmless media edits by focusing on the intent behind the edit. They introduce the Edited Media Understanding (EMU) task and a dataset of 48k question-answer pairs, evaluating various models including their new PELICAN model, which achieves 40.35% human-rated accuracy for its answers.

Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem. Yet at the same time, the vast majority of media edits are harmless -- such as a filtered vacation photo. The difference between this example, and harmful edits that spread disinformation, is one of intent. Recognizing and describing this intent is a major challenge for today's AI systems. We present the task of Edited Media Understanding, requiring models to answer open-ended questions that capture the intent and implications of an image edit. We introduce a dataset for our task, EMU, with 48k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 40.35% of the time. At the same time, there is still much work to be done -- humans prefer human-annotated captions 93.56% of the time -- and we provide analysis that highlights areas for further progress.

View on arXiv PDF

Similar