CVMar 7, 2020

TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation

arXiv:2003.03530v122 citations
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in video action anticipation for computer vision applications, offering an incremental improvement over existing methods.

The paper tackles the problem of inefficient long-term information capture in video action anticipation by proposing a Temporal Transformer with Progressive Prediction (TTPP) framework, which outperforms state-of-the-art methods on three datasets (TVSeries, THUMOS-14, TV-Human-Interaction) and is more efficient.

Video action anticipation aims to predict future action categories from observed frames. Current state-of-the-art approaches mainly resort to recurrent neural networks to encode history information into hidden states, and predict future actions from the hidden representations. It is well known that the recurrent pipeline is inefficient in capturing long-term information which may limit its performance in predication task. To address this problem, this paper proposes a simple yet efficient Temporal Transformer with Progressive Prediction (TTPP) framework, which repurposes a Transformer-style architecture to aggregate observed features, and then leverages a light-weight network to progressively predict future features and actions. Specifically, predicted features along with predicted probabilities are accumulated into the inputs of subsequent prediction. We evaluate our approach on three action datasets, namely TVSeries, THUMOS-14, and TV-Human-Interaction. Additionally we also conduct a comprehensive study for several popular aggregation and prediction strategies. Extensive results show that TTPP not only outperforms the state-of-the-art methods but also more efficient.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes