CVAIOct 7, 2025

LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval

arXiv:2510.06512v1h-index: 86
Originality Incremental advance
AI Analysis

This work addresses the need for efficient temporal scoring in video and audio analysis, enabling better query matching and retrieval for applications like surveillance or media indexing, though it is incremental as it builds on existing temporal logic methods.

The paper tackles the problem of assigning scores to temporal properties over sequences from noisy local detectors, proposing LogSTOP to compute these scores efficiently using Linear Temporal Logic. It shows that LogSTOP outperforms baselines by at least 16% on query matching and up to 19% on ranked retrieval tasks.

Neural models such as YOLO and HuBERT can be used to detect local properties such as objects ("car") and emotions ("angry") in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., "does the speaker eventually sound happy in this audio clip?"), and ranked retrieval (e.g., "retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected"). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes