Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset
This provides a new benchmark for researchers working on multimodal textual similarity systems, but it is incremental as it focuses on dataset creation.
The authors introduced vSTS, a new dataset for measuring textual similarity of sentences using multimodal information, and claimed it as a valid gold standard for evaluating such systems.
In this paper we introduce vSTS, a new dataset for measuring textual similarity of sentences using multimodal information. The dataset is comprised by images along with its respectively textual captions. We describe the dataset both quantitatively and qualitatively, and claim that it is a valid gold standard for measuring automatic multimodal textual similarity systems. We also describe the initial experiments combining the multimodal information.