CVCLDec 5, 2023

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding

MIT
arXiv:2312.02549v1135 citationsh-index: 34EMNLP
Originality Incremental advance
AI Analysis

This work addresses the challenge of accurately localizing video moments based on natural language queries, which is important for video understanding applications, and presents an incremental improvement over existing methods.

The paper tackles the problem of temporal language grounding by proposing an energy-based model framework and a novel Transformer architecture with damped exponential moving average to improve moment-query relation learning, achieving state-of-the-art results on four public datasets.

Temporal Language Grounding seeks to localize video moments that semantically correspond to a natural language query. Recent advances employ the attention mechanism to learn the relations between video moments and the text query. However, naive attention might not be able to appropriately capture such relations, resulting in ineffective distributions where target video moments are difficult to separate from the remaining ones. To resolve the issue, we propose an energy-based model framework to explicitly learn moment-query distributions. Moreover, we propose DemaFormer, a novel Transformer-based architecture that utilizes exponential moving average with a learnable damping factor to effectively encode moment-query inputs. Comprehensive experiments on four public temporal language grounding datasets showcase the superiority of our methods over the state-of-the-art baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes