CVApr 10, 2025

SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

arXiv:2504.07867v12 citationsh-index: 72025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Originality Incremental advance
AI Analysis

This addresses the challenge of zero-shot video scene graph generation for egocentric kitchen environments, offering an incremental improvement over existing methods.

The paper tackled the problem of generating temporally consistent scene graphs in dynamic kitchen videos without training, by combining SAM2's tracking with Gemini's semantic understanding, resulting in an 8.33% improvement in mean recall over Gemini on EPIC-KITCHENS datasets.

Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding. SAM2 also improves upon Gemini's object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes