CVJul 22, 2024

Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning

arXiv:2407.15613v25 citationsh-index: 29
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in document-based zero-shot learning for computer vision applications, offering an incremental improvement over existing methods.

The paper tackles the problem of suboptimal alignment between documents and images in zero-shot learning by proposing a network that extracts multi-view semantic concepts and aligns only matching parts, achieving state-of-the-art performance on three standard benchmarks with consistent improvements.

Recent work shows that documents from encyclopedias serve as helpful auxiliary information for zero-shot learning. Existing methods align the entire semantics of a document with corresponding images to transfer knowledge. However, they disregard that semantic information is not equivalent between them, resulting in a suboptimal alignment. In this work, we propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts. Specifically, we propose a semantic decomposition module to generate multi-view semantic embeddings from visual and textual sides, providing the basic concepts for partial alignment. To alleviate the issue of information redundancy among embeddings, we propose the local-to-semantic variance loss to capture distinct local details and multiple semantic diversity loss to enforce orthogonality among embeddings. Subsequently, two losses are introduced to partially align visual-semantic embedding pairs according to their semantic relevance at the view and word-to-patch levels. Consequently, we consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning. Qualitatively, we show that our model learns the interpretable partial association.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes