CVGRMay 15, 2019

Synthetic Defocus and Look-Ahead Autofocus for Casual Videography

arXiv:1905.06326v350 citations
Originality Incremental advance
AI Analysis

This addresses the challenge for casual videographers who desire cinematic focus but lack the resources of professional cinema setups, offering a novel solution that is incremental in building on existing machine learning methods.

The paper tackles the problem of achieving cinematic shallow depth of field and accurate focus in casual videography by presenting a system that synthetically renders refocusable video from deep depth-of-field smartphone footage and uses look-ahead autofocus with AI modules, enabling transitions onto speakers before they begin to speak.

In cinema, large camera lenses create beautiful shallow depth of field (DOF), but make focusing difficult and expensive. Accurate cinema focus usually relies on a script and a person to control focus in realtime. Casual videographers often crave cinematic focus, but fail to achieve it. We either sacrifice shallow DOF, as in smartphone videos; or we struggle to deliver accurate focus, as in videos from larger cameras. This paper is about a new approach in the pursuit of cinematic focus for casual videography. We present a system that synthetically renders refocusable video from a deep DOF video shot with a smartphone, and analyzes future video frames to deliver context-aware autofocus for the current frame. To create refocusable video, we extend recent machine learning methods designed for still photography, contributing a new dataset for machine training, a rendering model better suited to cinema focus, and a filtering solution for temporal coherence. To choose focus accurately for each frame, we demonstrate autofocus that looks at upcoming video frames and applies AI-assist modules such as motion, face, audio and saliency detection. We also show that autofocus benefits from machine learning and a large-scale video dataset with focus annotation, where we use our RVR-LAAF GUI to create this sizable dataset efficiently. We deliver, for example, a shallow DOF video where the autofocus transitions onto each person before she begins to speak. This is impossible for conventional camera autofocus because it would require seeing into the future.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes