The Case for Perspective in Multimodal Datasets
This addresses the problem of biased or incomplete multimodal data annotation for researchers in NLP and computer vision, though it appears incremental as it builds on existing FrameNet methods.
The paper argues for perspective-aware annotation in multimodal datasets, showing through experiments on Multi30k and Flickr 30k Entities that frame semantic similarity varies based on whether captions are translations and whether images are annotated with or without captions.
This paper argues in favor of the adoption of annotation practices for multimodal datasets that recognize and represent the inherently perspectivized nature of multimodal communication. To support our claim, we present a set of annotation experiments in which FrameNet annotation is applied to the Multi30k and the Flickr 30k Entities datasets. We assess the cosine similarity between the semantic representations derived from the annotation of both pictures and captions for frames. Our findings indicate that: (i) frame semantic similarity between captions of the same picture produced in different languages is sensitive to whether the caption is a translation of another caption or not, and (ii) picture annotation for semantic frames is sensitive to whether the image is annotated in presence of a caption or not.