CVJul 19, 2022

Action Quality Assessment with Temporal Parsing Transformer

arXiv:2207.09270v177 citationsh-index: 60
Originality Incremental advance
AI Analysis

This work addresses fine-grained action understanding for applications like sports analysis, but it is incremental as it builds on existing contrastive regression methods.

The paper tackles the problem of Action Quality Assessment (AQA) by proposing a temporal parsing transformer to decompose holistic video features into part-level representations, achieving state-of-the-art performance with significant improvements on three public benchmarks.

Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes