CVAICLMar 12

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

arXiv:2603.11698v137.61 citationsh-index: 6Has Code
Predicted impact top 8% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a critical bottleneck in action understanding for text-to-video generation, establishing a diagnostic benchmark to advance state-aware models.

The paper tackles the problem of object state change (OSC) in text-to-video generation, which is largely unexplored in existing benchmarks, and finds that current models struggle with accurate and temporally consistent OSC, especially in novel and compositional settings.

Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes