Multi Activity Sequence Alignment via Implicit Clustering
This addresses the problem of needing multiple models for different activities in sequence alignment, offering a more efficient solution for applications in computer vision and robotics.
The paper tackles the limitation of existing self-supervised temporal sequence alignment methods that require separate models for each activity by proposing a framework that aligns sequences across multiple activities via implicit clustering and dual augmentation, achieving state-of-the-art results on three diverse datasets.
Self-supervised temporal sequence alignment can provide rich and effective representations for a wide range of applications. However, existing methods for achieving optimal performance are mostly limited to aligning sequences of the same activity only and require separate models to be trained for each activity. We propose a novel framework that overcomes these limitations using sequence alignment via implicit clustering. Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences. This coupled with our proposed dual augmentation technique enhances the network's ability to learn generalizable and discriminative representations. Our experiments show that our proposed method outperforms state-of-the-art results and highlight the generalization capability of our framework with multi activity and different modalities on three diverse datasets, H2O, PennAction, and IKEA ASM. We will release our code upon acceptance.