CVLGMLDec 14, 2016

Disentangling Space and Time in Video with Hierarchical Variational Auto-encoders

arXiv:1612.04440v222 citations
AI Analysis

This work addresses the challenge of learning separable representations from unsupervised video data, which is incremental as it builds on existing generative models to factor temporal invariances.

The paper tackled the problem of disentangling static object identity and dynamic pose/style information in video data using a probabilistic approach with a hierarchical variational auto-encoder, achieving improved performance in transfer learning tasks on datasets of moving characters and rotating 3D objects.

There are many forms of feature information present in video data. Principle among them are object identity information which is largely static across multiple video frames, and object pose and style information which continuously transforms from frame to frame. Most existing models confound these two types of representation by mapping them to a shared feature space. In this paper we propose a probabilistic approach for learning separable representations of object identity and pose information using unsupervised video data. Our approach leverages a deep generative model with a factored prior distribution that encodes properties of temporal invariances in the hidden feature set. Learning is achieved via variational inference. We present results of learning identity and pose information on a dataset of moving characters as well as a dataset of rotating 3D objects. Our experimental results demonstrate our model's success in factoring its representation, and demonstrate that the model achieves improved performance in transfer learning tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes