Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport
This addresses the problem of learning procedures from videos for applications like robotics or activity recognition, but appears incremental as it builds on prior optimal transport methods.
The paper tackles self-supervised procedure learning from unlabeled videos by proposing a framework that uses fused Gromov-Wasserstein optimal transport with contrastive regularization to address issues like order variations and degenerate solutions, achieving superior performance on benchmarks.
We study self-supervised procedure learning, which discovers key steps and their order from a set of unlabeled videos. Previous methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised framework, which utilizes a fused Gromov-Wasserstein optimal transport with a structural prior for frame-to-frame mapping. However, optimizing only for the above temporal alignment may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and thus every video is assigned to just one key step. To address that issue, we integrate a contrastive regularization, which maps different frames to various points, avoiding trivial solutions. Finally, extensive experiments on egocentric and third-person benchmarks demonstrate our superior performance over prior works, including OPEL which relies on a classical Kantorovich optimal transport with an optimality prior.