CVIVMar 8, 2025

End-to-End Action Segmentation Transformer

arXiv:2503.06316v32 citationsh-index: 12025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Highly original
AI Analysis

This work addresses action segmentation for video analysis, offering a novel end-to-end approach that improves efficiency and accuracy in this domain.

The paper tackled the problem of action segmentation in videos by introducing an end-to-end transformer that processes raw frames directly, eliminating the need for pre-extracted features, and achieved state-of-the-art performance on benchmarks like GTEA, 50Salads, Breakfast, and Assembly-101.

Most recent work on action segmentation relies on pre-computed frame features from models trained on other tasks and typically focuses on framewise encoding and labeling without explicitly modeling action segments. To overcome these limitations, we introduce the End-to-End Action Segmentation Transformer (EAST), which processes raw video frames directly -- eliminating the need for pre-extracted features and enabling true end-to-end training. Our contributions are as follows: (1) a lightweight adapter design for effective fine-tuning of large backbones; (2) an efficient segmentation-by-detection framework for leveraging action proposals predicted over a coarsely downsampled video; and (3) a novel action-proposal-based data augmentation strategy. EAST achieves SOTA performance on standard benchmarks, including GTEA, 50Salads, Breakfast, and Assembly-101.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes