ROApr 8

Spatio-Temporal Grounding of Large Language Models from Perception Streams

arXiv:2604.0759283.4h-index: 41
AI Analysis

This work addresses the challenge of fine-grained spatio-temporal reasoning for embodied AI agents, representing a novel method for a known bottleneck in the field.

The paper tackles the problem of enabling embodied AI agents to reason about object movements and interactions in 3D space over time by introducing the FESTS framework, which injects verifiable spatio-temporal supervision into LLMs using Spatial Regular Expressions; training a 3-billion-parameter model on 27k tuples boosts frame-level F1 from 48.5% to 87.5%, matching GPT-4.1 on complex reasoning while being much smaller.

Embodied-AI agents must reason about how objects move and interact in 3-D space over time, yet existing smaller frontier Large Language Models (LLMs) still mis-handle fine-grained spatial relations, metric distances, and temporal orderings. We introduce the general framework Formally Explainable Spatio-Temporal Scenes (FESTS) that injects verifiable spatio-temporal supervision into an LLM by compiling natural-language queries into Spatial Regular Expression (SpRE) -- a language combining regular expression syntax with S4u spatial logic and extended here with universal and existential quantification. The pipeline matches each SpRE against any structured video log and exports aligned (query, frames, match, explanation) tuples, enabling unlimited training data without manual labels. Training a 3-billion-parameter model on 27k such tuples boosts frame-level F1 from 48.5% to 87.5%, matching GPT-4.1 on complex spatio-temporal reasoning while remaining two orders of magnitude smaller, and, hence, enabling spatio-temporal intelligence for Video LLM.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes