Future Video Synthesis with Object Motion Prediction
This work addresses video synthesis for applications like autonomous driving or surveillance by improving prediction quality, though it appears incremental as it builds on existing decoupling approaches.
The paper tackles the problem of predicting future video frames by decoupling background and moving objects, using non-rigid deformation for the background and affine transformation for objects, resulting in reduced tearing or distortion artifacts and outperforming state-of-the-art methods on Cityscapes and KITTI datasets in visual quality and accuracy.
We present an approach to predict future video frames given a sequence of continuous video frames in the past. Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics by decoupling the background scene and moving objects. The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects. The anticipated appearances are combined to create a reasonable video in the future. With this procedure, our method exhibits much less tearing or distortion artifact compared to other approaches. Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.