Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Time
This addresses the problem of efficient and consistent video editing for creators, though it is incremental as it builds on existing neural representation techniques.
The paper tackles video decomposition for layer-based editing by developing a neural model that separates videos into layers with texture maps, masks, and multiplicative residuals for lighting variations, enabling efficient editing propagation at 71 fps on a GPU and learning 1080p videos in 25s per frame.
We present a video decomposition method that facilitates layer-based editing of videos with spatiotemporally varying lighting and motion effects. Our neural model decomposes an input video into multiple layered representations, each comprising a 2D texture map, a mask for the original video, and a multiplicative residual characterizing the spatiotemporal variations in lighting conditions. A single edit on the texture maps can be propagated to the corresponding locations in the entire video frames while preserving other contents' consistencies. Our method efficiently learns the layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing and allows real-time rendering of the edited result at 71 fps on a single GPU. Qualitatively, we run our method on various videos to show its effectiveness in generating high-quality editing effects. Quantitatively, we propose to adopt feature-tracking evaluation metrics for objectively assessing the consistency of video editing. Project page: https://lightbulb12294.github.io/hashing-nvd/