CVApr 29

HOI-aware Adaptive Network for Weakly-supervised Action Segmentation

arXiv:2604.2622742.810 citationsh-index: 60
AI Analysis

For researchers in action segmentation, this work introduces a novel method to disambiguate similar actions using HOI, but it is incremental as it builds on existing weakly-supervised frameworks.

The paper tackles weakly-supervised action segmentation, specifically addressing ambiguity between similar actions (e.g., pouring juice vs. coffee). The proposed AdaAct network leverages human-object interactions (HOI) as video-level prior knowledge and adapts its parameters at test time, achieving state-of-the-art performance on Breakfast and 50Salads datasets.

In this paper, we propose an HOI-aware adaptive network named AdaAct for weakly-supervised action segmentation. Most existing methods learn a fixed network to predict the action of each frame with the neighboring frames. However, this would result in ambiguity when estimating similar actions, such as pouring juice and pouring coffee. To address this, we aim to exploit temporally global but spatially local human-object interactions (HOI) as video-level prior knowledge for action segmentation. The long-term HOI sequence provides crucial contextual information to distinguish ambiguous actions, where our network dynamically adapts to the given HOI sequence at test time. More specifically, we first design a video HOI encoder that extracts, selects, and integrates the most representative HOI throughout the video. Then, we propose a two-branch HyperNetwork to learn an adaptive temporal encoder, which automatically adjusts the parameters based on the HOI information of various videos on the fly. Extensive experiments on two widely-used datasets including Breakfast and 50Salads demonstrate the effectiveness of our method under different evaluation metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes