CVLGMay 17, 2022

Disentangling Visual Embeddings for Attributes and Objects

arXiv:2205.08536v197 citationsh-index: 45Has Code
Originality Highly original
AI Analysis

This addresses the challenge of recognizing novel attribute-object compositions in images, which is crucial for applications like robotics and image understanding, though it appears incremental as it builds on prior linguistic-based approaches by shifting focus back to visual features.

The paper tackles the problem of compositional zero-shot learning for object-attribute recognition by proposing a novel architecture that disentangles attribute and object features in the visual space, outperforming existing work with significant margins on three datasets including MIT-States, UT-Zappos, and a new benchmark based on VAW.

We study the problem of compositional zero-shot learning for object-attribute recognition. Prior works use visual features extracted with a backbone network, pre-trained for object classification and thus do not capture the subtly distinct features associated with attributes. To overcome this challenge, these studies employ supervision from the linguistic space, and use pre-trained word embeddings to better separate and compose attribute-object pairs for recognition. Analogous to linguistic embedding space, which already has unique and agnostic embeddings for object and attribute, we shift the focus back to the visual space and propose a novel architecture that can disentangle attribute and object features in the visual space. We use visual decomposed features to hallucinate embeddings that are representative for the seen and novel compositions to better regularize the learning of our model. Extensive experiments show that our method outperforms existing work with significant margin on three datasets: MIT-States, UT-Zappos, and a new benchmark created based on VAW. The code, models, and dataset splits are publicly available at https://github.com/nirat1606/OADis.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes