LG CL CVFeb 22, 2021

Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case

Adam Dahlgren Lindström, Suna Bensch, Johanna Björklund, Frank Drewes

arXiv:2102.11115v1992 citations

Originality Synthesis-oriented

AI Analysis

This provides analysis tools for researchers working with multimodal embeddings, but it is incremental as it extends existing probing methods to a new domain.

The paper tackled the problem of understanding the inner workings of visual-semantic embeddings by generalizing probing tasks to this multimodal case, revealing up to a 12% accuracy increase compared to unimodal embeddings.

Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic embeddings comes from the distillation and enrichment of information through machine learning, their inner workings are poorly understood and there is a shortage of analysis tools. To address this problem, we generalize the notion of probing tasks to the visual-semantic case. To this end, we (i) discuss the formalization of probing tasks for embeddings of image-caption pairs, (ii) define three concrete probing tasks within our general framework, (iii) train classifiers to probe for those properties, and (iv) compare various state-of-the-art embeddings under the lens of the proposed probing tasks. Our experiments reveal an up to 12% increase in accuracy on visual-semantic embeddings compared to the corresponding unimodal embeddings, which suggest that the text and image dimensions represented in the former do complement each other.

View on arXiv PDF

Similar