$β$-Multivariational Autoencoder for Entangled Representation Learning in Video Frames
This addresses the problem of posterior collapse in learning entangled representations for video-based object-tracking, though it appears incremental as it builds on existing autoencoder and U-Net methods.
The paper tackles the challenge of learning a Multivariate Gaussian prior from video frames for single object-tracking, proposing the β-Multivariational Autoencoder (βMVAE) and its U-Net variant (βMVUnet), which improve posterior estimation and segmentation on a test set after training on over 85k frames.
It is crucial to choose actions from an appropriate distribution while learning a sequential decision-making process in which a set of actions is expected given the states and previous reward. Yet, if there are more than two latent variables and every two variables have a covariance value, learning a known prior from data becomes challenging. Because when the data are big and diverse, many posterior estimate methods experience posterior collapse. In this paper, we propose the $β$-Multivariational Autoencoder ($β$MVAE) to learn a Multivariate Gaussian prior from video frames for use as part of a single object-tracking in form of a decision-making process. We present a novel formulation for object motion in videos with a set of dependent parameters to address a single object-tracking task. The true values of the motion parameters are obtained through data analysis on the training set. The parameters population is then assumed to have a Multivariate Gaussian distribution. The $β$MVAE is developed to learn this entangled prior $p = N(μ, Σ)$ directly from frame patches where the output is the object masks of the frame patches. We devise a bottleneck to estimate the posterior's parameters, i.e. $μ', Σ'$. Via a new reparameterization trick, we learn the likelihood $p(\hat{x}|z)$ as the object mask of the input. Furthermore, we alter the neural network of $β$MVAE with the U-Net architecture and name the new network $β$Multivariational U-Net ($β$MVUnet). Our networks are trained from scratch via over 85k video frames for 24 ($β$MVUnet) and 78 ($β$MVAE) million steps. We show that $β$MVUnet enhances both posterior estimation and segmentation functioning over the test set. Our code and the trained networks are publicly released.