CVFeb 15, 2025

Improving action segmentation via explicit similarity measurement

arXiv:2502.10713v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses action segmentation for video analysis, presenting an incremental improvement over existing methods.

The paper tackles the problem of action segmentation by proposing ASESM, which enhances segmentation accuracy through explicit similarity measurement across frames and predictions, achieving improved results on 50Salads, GTEA, and Breakfast datasets.

Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed boundary correction algorithm that operates based on feature similarity between consecutive frames to adjust the boundary locations iteratively through the learning process. The corrected prediction is then further refined through multiple stages of temporal convolutions. As post-processing, we optionally apply boundary correction again followed by a segment smoothing method that removes outlier classes within segments using similarity measurement between consecutive predictions. Additionally, we propose a fully unsupervised boundary detection-correction algorithm that identifies segment boundaries based solely on feature similarity without any training. Experiments on 50Salads, GTEA, and Breakfast datasets show the effectiveness of both the supervised and unsupervised algorithms. Code and models are made available on Github.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes