CVCLLGApr 11, 2019

UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations

arXiv:1904.05521v29 citations
Originality Incremental advance
AI Analysis

This work addresses cross-modal retrieval robustness for AI systems, but it appears incremental as it builds on existing embedding methods with structured semantic enhancements.

The authors tackled the problem of learning joint visual-semantic embeddings by unifying concepts at multiple levels and using contrastive learning from image-caption pairs, resulting in robustness against text-domain adversarial attacks and improved cross-modal retrieval.

We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a joint space of visual and textual concepts. The space unifies the concepts at different levels, including objects, attributes, relations, and full scenes. A contrastive learning approach is proposed for the fine-grained alignment from only image-caption pairs. Moreover, we present an effective approach for enforcing the coverage of semantic components that appear in the sentence. We demonstrate the robustness of Unified VSE in defending text-domain adversarial attacks on cross-modal retrieval tasks. Such robustness also empowers the use of visual cues to resolve word dependencies in novel sentences.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes