Temporal Segment Transformer for Action Segmentation
This work addresses the challenge of long-range temporal modeling in video activity understanding for computer vision applications, representing an incremental improvement over existing predict-and-refine methods.
The paper tackled the problem of noisy and inaccurate segment representations in action segmentation from untrimmed videos by proposing a temporal segment transformer that uses attention for joint segment relation modeling and denoising, achieving state-of-the-art accuracy on benchmarks like 50Salads, GTEA, and Breakfast.
Recognizing human actions from untrimmed videos is an important task in activity understanding, and poses unique challenges in modeling long-range temporal relations. Recent works adopt a predict-and-refine strategy which converts an initial prediction to action segments for global context modeling. However, the generated segment representations are often noisy and exhibit inaccurate segment boundaries, over-segmentation and other problems. To deal with these issues, we propose an attention based approach which we call \textit{temporal segment transformer}, for joint segment relation modeling and denoising. The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments. The refined segment representations are used to predict action labels and adjust segment boundaries, and a final action segmentation is produced based on voting from segment masks. We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks. We also conduct extensive ablations to demonstrate the effectiveness of different components of our design.