End-to-End Action Segmentation Transformer
This work addresses action segmentation for video analysis, offering a novel end-to-end approach that improves efficiency and accuracy in this domain.
The paper tackled the problem of action segmentation in videos by introducing an end-to-end transformer that processes raw frames directly, eliminating the need for pre-extracted features, and achieved state-of-the-art performance on benchmarks like GTEA, 50Salads, Breakfast, and Assembly-101.
Most recent work on action segmentation relies on pre-computed frame features from models trained on other tasks and typically focuses on framewise encoding and labeling without explicitly modeling action segments. To overcome these limitations, we introduce the End-to-End Action Segmentation Transformer (EAST), which processes raw video frames directly -- eliminating the need for pre-extracted features and enabling true end-to-end training. Our contributions are as follows: (1) a lightweight adapter design for effective fine-tuning of large backbones; (2) an efficient segmentation-by-detection framework for leveraging action proposals predicted over a coarsely downsampled video; and (3) a novel action-proposal-based data augmentation strategy. EAST achieves SOTA performance on standard benchmarks, including GTEA, 50Salads, Breakfast, and Assembly-101.