Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation
This work addresses the need for efficient and accurate temporal action segmentation in video analysis, offering an incremental improvement through a simple loss design that can be integrated into existing models.
The paper tackles the problem of improving fine-grained action segmentation without complex architectures by introducing a lightweight dual-loss training framework that enhances segmentation quality through boundary supervision and segment-level regularization. The method yields higher F1 and Edit scores across three benchmark datasets and models, with minimal impact on frame-wise accuracy.
Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.