CVJul 4, 2024

QueryMamba: A Mamba-Based Encoder-Decoder Architecture with a Statistical Verb-Noun Interaction Module for Video Action Forecasting @ Ego4D Long-Term Action Anticipation Challenge 2024

arXiv:2407.04184v16 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses action forecasting in egocentric videos, which is incremental as it builds on existing methods with a novel module for improved accuracy.

The paper tackles video action forecasting by proposing QueryMamba, a Mamba-based encoder-decoder architecture with a statistical verb-noun interaction module, achieving second place in the Ego4D LTA challenge and first in noun prediction accuracy.

This report presents a novel Mamba-based encoder-decoder architecture, QueryMamba, featuring an integrated verb-noun interaction module that utilizes a statistical verb-noun co-occurrence matrix to enhance video action forecasting. This architecture not only predicts verbs and nouns likely to occur based on historical data but also considers their joint occurrence to improve forecast accuracy. The efficacy of this approach is substantiated by experimental results, with the method achieving second place in the Ego4D LTA challenge and ranking first in noun prediction accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes