CVAICLLGFeb 27, 2024

OSCaR: Object State Captioning and State Change Representation

arXiv:2402.17128v435 citationsh-index: 14Has CodeNAACL-HLT
Originality Synthesis-oriented
AI Analysis

This work addresses a crucial challenge in AI for interpreting object state changes in real-world settings, but it is incremental as it primarily provides a new dataset and benchmark for evaluation.

The paper tackles the problem of understanding object state changes in dynamic visual environments by introducing the OSCaR dataset and benchmark, which includes 14,084 annotated video segments with nearly 1,000 unique objects, and shows that multimodal large language models lack full comprehension of these changes.

The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes