CVDec 5, 2017

Learning to Forecast Videos of Human Activity with Multi-granularity Models and Adaptive Rendering

arXiv:1712.01955v18 citations
Originality Incremental advance
AI Analysis

This addresses the problem of generating realistic future video frames for complex human interactions, which is incremental as it builds on existing video forecasting techniques.

The paper tackles video forecasting of complex multi-person human activities by proposing a hierarchical temporal model for pose prediction and adaptive appearance rendering networks, achieving superior video generation compared to state-of-the-art methods.

We propose an approach for forecasting video of complex human activity involving multiple people. Direct pixel-level prediction is too simple to handle the appearance variability in complex activities. Hence, we develop novel intermediate representations. An architecture combining a hierarchical temporal model for predicting human poses and encoder-decoder convolutional neural networks for rendering target appearances is proposed. Our hierarchical model captures interactions among people by adopting a dynamic group-based interaction mechanism. Next, our appearance rendering network encodes the targets' appearances by learning adaptive appearance filters using a fully convolutional network. Finally, these filters are placed in encoder-decoder neural networks to complete the rendering. We demonstrate that our model can generate videos that are superior to state-of-the-art methods, and can handle complex human activity scenarios in video forecasting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes