CVSep 6, 2025

OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation

arXiv:2509.05661v11 citationsh-index: 6Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of integrating commonsense knowledge into SGA for applications like intelligent surveillance, though it is incremental as it builds on existing visual approaches by adding a linguistic component.

The paper tackles the problem of Scene Graph Anticipation (SGA) by proposing a decoupled linguistic framework that uses a text-based model to predict future scene graphs, achieving a 3.4% increase in short-term mean-Recall (@10) and a 21.9% improvement in long-term mean-Recall (@50) compared to state-of-the-art methods.

A scene graph is a structured represention of objects and their relationships in a scene. Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips, enabling applications as intelligent surveillance and human-machine collaboration. Existing SGA approaches primarily leverage visual cues, often struggling to integrate valuable commonsense knowledge, thereby limiting long-term prediction robustness. To explicitly leverage such commonsense knowledge, we propose a new approach to better understand the objects, concepts, and relationships in a scene graph. Our approach decouples the SGA task in two steps: first a scene graph capturing model is used to convert a video clip into a sequence of scene graphs, then a pure text-based model is used to predict scene graphs in future frames. Our focus in this work is on the second step, and we call it Linguistic Scene Graph Anticipation (LSGA) and believes it should have independent interest beyond the use in SGA discussed here. For LSGA, we introduce an Object-Oriented Two-Staged Method (OOTSM) where an Large Language Model (LLM) first forecasts object appearances and disappearances before generating detailed human-object relations. We conduct extensive experiments to evaluate OOTSM in two settings. For LSGA, we evaluate our fine-tuned open-sourced LLMs against zero-shot APIs (i.e., GPT-4o, GPT-4o-mini, and DeepSeek-V3) on a benchmark constructed from Action Genome annotations. For SGA, we combine our OOTSM with STTran++ from, and our experiments demonstrate effective state-of-the-art performance: short-term mean-Recall (@10) increases by 3.4% while long-term mean-Recall (@50) improves dramatically by 21.9%. Code is available at https://github.com/ZhuXMMM/OOTSM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes