Floors are Flat: Leveraging Semantics for Real-Time Surface Normal Prediction
This work addresses the need for real-time surface normal prediction in computer vision applications, but it is incremental as it builds on existing methods with specific optimizations.
The paper tackles the problem of predicting surface normals and semantic labels from single RGB images by proposing four insights to improve deep learning model performance, resulting in consistently improved results on several datasets with a model running at 12 fps on a standard mobile phone.
We propose 4 insights that help to significantly improve the performance of deep learning models that predict surface normals and semantic labels from a single RGB image. These insights are: (1) denoise the "ground truth" surface normals in the training set to ensure consistency with the semantic labels; (2) concurrently train on a mix of real and synthetic data, instead of pretraining on synthetic and finetuning on real; (3) jointly predict normals and semantics using a shared model, but only backpropagate errors on pixels that have valid training labels; (4) slim down the model and use grayscale instead of color inputs. Despite the simplicity of these steps, we demonstrate consistently improved results on several datasets, using a model that runs at 12 fps on a standard mobile phone.