CVDec 23, 2024

Hierarchical Vector Quantization for Unsupervised Action Segmentation

arXiv:2412.17640v217 citationsh-index: 5AAAI
Originality Incremental advance
AI Analysis

This work addresses the problem of segmenting long, untrimmed videos into consistent semantic segments for applications in video analysis, but it is incremental as it builds on existing representation learning and clustering approaches.

The paper tackles unsupervised temporal action segmentation by proposing Hierarchical Vector Quantization (HVQ) to handle large variations within temporal segments, resulting in improved performance over state-of-the-art methods on three public datasets in terms of F1 score, recall, and a new Jensen-Shannon Distance metric.

In this work, we address unsupervised temporal action segmentation, which segments a set of long, untrimmed videos into semantically meaningful segments that are consistent across videos. While recent approaches combine representation learning and clustering in a single step for this task, they do not cope with large variations within temporal segments of the same class. To address this limitation, we propose a novel method, termed Hierarchical Vector Quantization (HVQ), that consists of two subsequent vector quantization modules. This results in a hierarchical clustering where the additional subclusters cover the variations within a cluster. We demonstrate that our approach captures the distribution of segment lengths much better than the state of the art. To this end, we introduce a new metric based on the Jensen-Shannon Distance (JSD) for unsupervised temporal action segmentation. We evaluate our approach on three public datasets, namely Breakfast, YouTube Instructional and IKEA ASM. Our approach outperforms the state of the art in terms of F1 score, recall and JSD.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes