CVIRLGApr 9, 2024

A Dataset and Framework for Learning State-invariant Object Representations

arXiv:2404.06470v21 citationsh-index: 55
Originality Incremental advance
AI Analysis

This work addresses fine-grained object recognition and retrieval for 3D objects with state changes, which is an incremental advancement in computer vision.

The paper tackles the problem of learning object representations that are invariant to state changes (e.g., folded vs. open umbrella) by introducing a new dataset, ObjectsWithStateChange, and a curriculum learning strategy, resulting in improvements of 7.9% in recognition accuracy and 9.2% in retrieval mAP over state-of-the-art methods.

We add one more invariance - the state invariance - to the more commonly used other invariances for learning object representations for recognition and retrieval. By state invariance, we mean robust with respect to changes in the structural form of the objects, such as when an umbrella is folded, or when an item of clothing is tossed on the floor. In this work, we present a novel dataset, ObjectsWithStateChange, which captures state and pose variations in the object images recorded from arbitrary viewpoints. We believe that this dataset will facilitate research in fine-grained object recognition and retrieval of 3D objects that are capable of state changes. The goal of such research would be to train models capable of learning discriminative object embeddings that remain invariant to state changes while also staying invariant to transformations induced by changes in viewpoint, pose, illumination, etc. A major challenge in this regard is that instances of different objects (both within and across different categories) under various state changes may share similar visual characteristics and therefore may be close to one another in the learned embedding space, which would make it more difficult to discriminate between them. To address this, we propose a curriculum learning strategy that progressively selects object pairs with smaller inter-object distances in the learned embedding space during the training phase. This approach gradually samples harder-to-distinguish examples of visually similar objects, both within and across different categories. Our ablation related to the role played by curriculum learning indicates an improvement in object recognition accuracy of 7.9% and retrieval mAP of 9.2% over the state-of-the-art on our new dataset, as well as three other challenging multi-view datasets such as ModelNet40, ObjectPI, and FG3D.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes