CVJan 17, 2018

RED-Net: A Recurrent Encoder-Decoder Network for Video-based Face Alignment

arXiv:1801.06066v123 citations
Originality Highly original
AI Analysis

This addresses the problem of accurate and efficient face alignment in videos for applications like video analysis, with incremental improvements in method design.

The paper tackles real-time face alignment in videos by proposing a recurrent encoder-decoder network that uses spatial and temporal recurrent learning with feature disentangling, achieving superior accuracy over state-of-the-art methods in standard datasets.

We propose a novel method for real-time face alignment in videos based on a recurrent encoder-decoder network model. Our proposed model predicts 2D facial point heat maps regularized by both detection and regression loss, while uniquely exploiting recurrent learning at both spatial and temporal dimensions. At the spatial level, we add a feedback loop connection between the combined output response map and the input, in order to enable iterative coarse-to-fine face alignment using a single network model, instead of relying on traditional cascaded model ensembles. At the temporal level, we first decouple the features in the bottleneck of the network into temporal-variant factors, such as pose and expression, and temporal-invariant factors, such as identity information. Temporal recurrent learning is then applied to the decoupled temporal-variant features. We show that such feature disentangling yields better generalization and significantly more accurate results at test time. We perform a comprehensive experimental analysis, showing the importance of each component of our proposed model, as well as superior results over the state of the art and several variations of our method in standard datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes