CVNov 28, 2023

Scene Summarization: Clustering Scene Videos into Spatially Diverse Frames

arXiv:2311.17940v31 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the need for efficient spatial understanding in applications like real estate browsing or navigation, though it is incremental as it builds on existing video summarization techniques.

The paper tackles the problem of summarizing long scene videos into a few spatially diverse keyframes to aid global spatial reasoning, and shows that their method, SceneSum, outperforms existing video summarization baselines on real and simulated indoor datasets.

Humans are remarkably efficient at forming spatial understanding from just a few visual observations. When browsing real estate or navigating unfamiliar spaces, they intuitively select a small set of views that summarize the spatial layout. Inspired by this ability, we introduce scene summarization, the task of condensing long, continuous scene videos into a compact set of spatially diverse keyframes that facilitate global spatial reasoning. Unlike conventional video summarization-which focuses on user-edited, fragmented clips and often ignores spatial continuity-our goal is to mimic how humans abstract spatial layout from sparse views. We propose SceneSum, a two-stage self-supervised pipeline that first clusters video frames using visual place recognition to promote spatial diversity, then selects representative keyframes from each cluster under resource constraints. When camera trajectories are available, a lightweight supervised loss further refines clustering and selection. Experiments on real and simulated indoor datasets show that SceneSum produces more spatially informative summaries and outperforms existing video summarization baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes