CVLGMar 27, 2024

PLOT-TAL: Prompt Learning with Optimal Transport for Few-Shot Temporal Action Localization

arXiv:2403.18915v22 citationsh-index: 42025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Highly original
AI Analysis

This addresses the challenge of accurate action localization in videos with limited data, representing an incremental improvement over single-prompt tuning methods.

The paper tackles the problem of imprecise temporal boundaries in few-shot temporal action localization by proposing a multi-prompt ensemble framework with Optimal Transport, achieving state-of-the-art results on THUMOS'14 and EPIC-Kitchens benchmarks.

Few-shot temporal action localization (TAL) methods that adapt large models via single-prompt tuning often fail to produce precise temporal boundaries. This stems from the model learning a non-discriminative mean representation of an action from sparse data, which compromises generalization. We address this by proposing a new paradigm based on multi-prompt ensembles, where a set of diverse, learnable prompts for each action is encouraged to specialize on compositional sub-events. To enforce this specialization, we introduce PLOT-TAL, a framework that leverages Optimal Transport (OT) to find a globally optimal alignment between the prompt ensemble and the video's temporal features. Our method establishes a new state-of-the-art on the challenging few-shot benchmarks of THUMOS'14 and EPIC-Kitchens, without requiring complex meta-learning. The significant performance gains, particularly at high IoU thresholds, validate our hypothesis and demonstrate the superiority of learning distributed, compositional representations for precise temporal localization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes