LGAICLOct 6, 2020

SHERLock: Self-Supervised Hierarchical Event Representation Learning

arXiv:2010.02556v2
AI Analysis

This addresses the challenge of efficiently representing complex experiences in AI, though it appears incremental as it builds on existing self-supervised and hierarchical learning approaches.

The paper tackled the problem of learning hierarchical temporal event representations from long-horizon visual and textual data without explicit supervision, achieving a +15.3 improvement over unsupervised baselines in alignment with human-annotated events and comparable results to supervised methods on datasets like Chess Openings and YouCook2.

Temporal event representations are an essential aspect of learning among humans. They allow for succinct encoding of the experiences we have through a variety of sensory inputs. Also, they are believed to be arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Additionally, these representations are acquired in a self-supervised manner. Analogously, here we propose a model that learns temporal representations from long-horizon visual demonstration data and associated textual descriptions, without explicit temporal supervision. Our method produces a hierarchy of representations that align more closely with ground-truth human-annotated events (+15.3) than state-of-the-art unsupervised baselines. Our results are comparable to heavily-supervised baselines in complex visual domains such as Chess Openings, YouCook2 and TutorialVQA datasets. Finally, we perform ablation studies illustrating the robustness of our approach. We release our code and demo visualizations in the Supplementary Material.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes