Weakly-Supervised Dense Action Anticipation
This addresses the problem of reducing annotation costs for video action anticipation, though it is incremental as it builds on existing weakly-supervised approaches.
The paper tackles dense action anticipation, which forecasts future actions and durations, by introducing a weakly-supervised method that uses only a small set of fully-labeled sequences and mostly sequences with only the upcoming action labeled. The result shows competitive performance compared to fully supervised state-of-the-art models on benchmarks like Breakfast and 50Salads.
Dense anticipation aims to forecast future actions and their durations for long horizons. Existing approaches rely on fully-labelled data, i.e. sequences labelled with all future actions and their durations. We present a (semi-) weakly supervised method using only a small number of fully-labelled sequences and predominantly sequences in which only the (one) upcoming action is labelled. To this end, we propose a framework that generates pseudo-labels for future actions and their durations and adaptively refines them through a refinement module. Given only the upcoming action label as input, these pseudo-labels guide action/duration prediction for the future. We further design an attention mechanism to predict context-aware durations. Experiments on the Breakfast and 50Salads benchmarks verify our method's effectiveness; we are competitive even when compared to fully supervised state-of-the-art models. We will make our code available at: https://github.com/zhanghaotong1/WSLVideoDenseAnticipation.