Orazio Gallo

CV
h-index29
27papers
3,586citations
Novelty61%
AI Score45

27 Papers

CVOct 15, 2024Code
nvTorchCam: An Open-source Library for Camera-Agnostic Differentiable Geometric Vision

Daniel Lichy, Hang Su, Abhishek Badki et al.

We introduce nvTorchCam, an open-source library under the Apache 2.0 license, designed to make deep learning algorithms camera model-independent. nvTorchCam abstracts critical camera operations such as projection and unprojection, allowing developers to implement algorithms once and apply them across diverse camera models--including pinhole, fisheye, and 360 equirectangular panoramas, which are commonly used in automotive and real estate capture applications. Built on PyTorch, nvTorchCam is fully differentiable and supports GPU acceleration and batching for efficient computation. Furthermore, deep learning models trained for one camera type can be directly transferred to other camera types without requiring additional modification. In this paper, we provide an overview of nvTorchCam, its functionality, and present various code examples and diagrams to demonstrate its usage. Source code and installation instructions can be found on the nvTorchCam GitHub page at https://github.com/NVlabs/nvTorchCam.

CVApr 16, 2021Code
Noise-Aware Video Saliency Prediction

Ekta Prashnani, Orazio Gallo, Joohwan Kim et al.

We tackle the problem of predicting saliency maps for videos of dynamic scenes. We note that the accuracy of the maps reconstructed from the gaze data of a fixed number of observers varies with the frame, as it depends on the content of the scene. This issue is particularly pressing when a limited number of observers are available. In such cases, directly minimizing the discrepancy between the predicted and measured saliency maps, as traditional deep-learning methods do, results in overfitting to the noisy data. We propose a noise-aware training (NAT) paradigm that quantifies and accounts for the uncertainty arising from frame-specific gaze data inaccuracy. We show that NAT is especially advantageous when limited training data is available, with experiments across different models, loss functions, and datasets. We also introduce a video game-based saliency dataset, with rich temporal semantics, and multiple gaze attractors per frame. The dataset and source code are available at https://github.com/NVlabs/NAT-saliency.

CVOct 23, 2018Code
A Fusion Approach for Multi-Frame Optical Flow Estimation

Zhile Ren, Orazio Gallo, Deqing Sun et al.

To date, top-performing optical flow estimation methods only take pairs of consecutive frames into account. While elegant and appealing, the idea of using more than two frames has not yet produced state-of-the-art results. We present a simple, yet effective fusion approach for multi-frame optical flow that benefits from longer-term temporal cues. Our method first warps the optical flow from previous frames to the current, thereby yielding multiple plausible estimates. It then fuses the complementary information carried by these estimates into a new optical flow field. At the time of writing, our method ranks first among published results in the MPI Sintel and KITTI 2015 benchmarks. Our models will be available on https://github.com/NVlabs/PWC-Net.

CVJan 17, 2025
FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen, Matthew Trepte, Joseph Aribido et al.

Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation. Project page: https://nvlabs.github.io/FoundationStereo/

CVJan 17, 2025
Zero-Shot Monocular Scene Flow Estimation in the Wild

Yiqing Liang, Abhishek Badki, Hang Su et al.

Large models have shown generalization across datasets for many low-level vision tasks, like depth estimation, but no such general models exist for scene flow. Even though scene flow has wide potential use, it is not used in practice because current predictive models do not generalize well. We identify three key challenges and propose solutions for each. First, we create a method that jointly estimates geometry and motion for accurate prediction. Second, we alleviate scene flow data scarcity with a data recipe that affords us 1M annotated training samples across diverse synthetic scenes. Third, we evaluate different parameterizations for scene flow prediction and adopt a natural and effective parameterization. Our resulting model outperforms existing methods as well as baselines built on large-scale models in terms of 3D end-point error, and shows zero-shot generalization to the casually captured videos from DAVIS and the robotic manipulation scenes from RoboTAP. Overall, our approach makes scene flow prediction more practical in-the-wild.

CVFeb 18, 2025
L4P: Towards Unified Low-Level 4D Vision Perception

Abhishek Badki, Hang Su, Bowen Wen et al.

The spatio-temporal relationship between the pixels of a video carries critical information for low-level 4D perception tasks. A single model that reasons about it should be able to solve several such tasks well. Yet, most state-of-the-art methods rely on architectures specialized for the task at hand. We present L4P, a feedforward, general-purpose architecture that solves low-level 4D perception tasks in a unified framework. L4P leverages a pre-trained ViT-based video encoder and combines it with per-task heads that are lightweight and therefore do not require extensive training. Despite its general and feedforward formulation, our method is competitive with existing specialized methods on both dense tasks, such as depth or optical flow estimation, and sparse tasks, such as 2D/3D tracking. Moreover, it solves all tasks at once in a time comparable to that of single-task methods.

CVOct 3, 2025
Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani et al.

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim's likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

CVJan 24, 2024
FoVA-Depth: Field-of-View Agnostic Depth Estimation for Cross-Dataset Generalization

Daniel Lichy, Hang Su, Abhishek Badki et al.

Wide field-of-view (FoV) cameras efficiently capture large portions of the scene, which makes them attractive in multiple domains, such as automotive and robotics. For such applications, estimating depth from multiple images is a critical task, and therefore, a large amount of ground truth (GT) data is available. Unfortunately, most of the GT data is for pinhole cameras, making it impossible to properly train depth estimation models for large-FoV cameras. We propose the first method to train a stereo depth estimation model on the widely available pinhole data, and to generalize it to data captured with larger FoVs. Our intuition is simple: We warp the training data to a canonical, large-FoV representation and augment it to allow a single network to reason about diverse types of distortions that otherwise would prevent generalization. We show strong generalization ability of our approach on both indoor and outdoor datasets, which was not possible with previous methods.

CVMay 31, 2023
Zero-shot Pose Transfer for Unrigged Stylized 3D Characters

Jiashun Wang, Xueting Li, Sifei Liu et al.

Transferring the pose of a reference avatar to stylized 3D characters of various shapes is a fundamental task in computer graphics. Existing methods either require the stylized characters to be rigged, or they use the stylized character in the desired pose as ground truth at training. We present a zero-shot approach that requires only the widely available deformed non-stylized avatars in training, and deforms stylized characters of significantly different shapes at inference. Classical methods achieve strong generalization by deforming the mesh at the triangle level, but this requires labelled correspondences. We leverage the power of local deformation, but without requiring explicit correspondence labels. We introduce a semi-supervised shape-understanding module to bypass the need for explicit correspondences at test time, and an implicit pose deformation module that deforms individual surface points to match the target pose. Furthermore, to encourage realistic and accurate deformation of stylized characters, we introduce an efficient volume-based test-time training procedure. Because it does not need rigging, nor the deformed stylized character at training time, our model generalizes to categories with scarce annotation, such as stylized quadrupeds. Extensive experiments demonstrate the effectiveness of the proposed method compared to the state-of-the-art approaches trained with comparable or more supervision. Our project page is available at https://jiashunwang.github.io/ZPT

CVMay 5, 2023
Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos

Ekta Prashnani, Koki Nagano, Shalini De Mello et al.

Modern avatar generators allow anyone to synthesize photorealistic real-time talking avatars, ushering in a new era of avatar-based human communication, such as with immersive AR/VR interactions or videoconferencing with limited bandwidths. Their safe adoption, however, requires a mechanism to verify if the rendered avatar is trustworthy: does it use the appearance of an individual without their consent? We term this task avatar fingerprinting. To tackle it, we first introduce a large-scale dataset of real and synthetic videos of people interacting on a video call, where the synthetic videos are generated using the facial appearance of one person and the expressions of another. We verify the identity driving the expressions in a synthetic video, by learning motion signatures that are independent of the facial appearance shown. Our solution, the first in this space, achieves an average AUC of 0.85. Critical to its practical use, it also generalizes to new generators never seen in training (average AUC of 0.83). The proposed dataset and other resources can be found at: https://research.nvidia.com/labs/nxp/avatar-fingerprinting/.

CVDec 21, 2021
Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects

Atsuhiro Noguchi, Umar Iqbal, Jonathan Tremblay et al.

Rendering articulated objects while controlling their poses is critical to applications such as virtual reality or animation for movies. Manipulating the pose of an object, however, requires the understanding of its underlying structure, that is, its joints and how they interact with each other. Unfortunately, assuming the structure to be known, as existing methods do, precludes the ability to work on new object categories. We propose to learn both the appearance and the structure of previously unseen articulated objects by observing them move from multiple views, with no joints annotation supervision, or information about the structure. We observe that 3D points that are static relative to one another should belong to the same part, and that adjacent parts that move relative to each other must be connected by a joint. To leverage this insight, we model the object parts in 3D as ellipsoids, which allows us to identify joints. We combine this explicit representation with an implicit one that compensates for the approximation introduced. We show that our method works for different structures, from quadrupeds, to single-arm robots, to humans.

CVDec 15, 2021
Efficient Geometry-aware 3D Generative Adversarial Networks

Eric R. Chan, Connor Z. Lin, Matthew A. Chan et al.

Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. We introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQ Cats, among other experiments.

CVMay 12, 2021
Neural Trajectory Fields for Dynamic Novel View Synthesis

Chaoyang Wang, Ben Eckart, Simon Lucey et al.

Recent approaches to render photorealistic views from a limited set of photographs have pushed the boundaries of our interactions with pictures of static scenes. The ability to recreate moments, that is, time-varying sequences, is perhaps an even more interesting scenario, but it remains largely unsolved. We introduce DCT-NeRF, a coordinatebased neural representation for dynamic scenes. DCTNeRF learns smooth and stable trajectories over the input sequence for each point in space. This allows us to enforce consistency between any two frames in the sequence, which results in high quality reconstruction, particularly in dynamic regions.

CVJan 12, 2021
Binary TTC: A Temporal Geofence for Autonomous Navigation

Abhishek Badki, Orazio Gallo, Jan Kautz et al.

Time-to-contact (TTC), the time for an object to collide with the observer's plane, is a powerful tool for path planning: it is potentially more informative than the depth, velocity, and acceleration of objects in the scene -- even for humans. TTC presents several advantages, including requiring only a monocular, uncalibrated camera. However, regressing TTC for each pixel is not straightforward, and most existing methods make over-simplifying assumptions about the scene. We address this challenge by estimating TTC via a series of simpler, binary classifications. We predict with low latency whether the observer will collide with an obstacle within a certain time, which is often more critical than knowing exact, per-pixel TTC. For such scenarios, our method offers a temporal geofence in 6.4 ms -- over 25x faster than existing methods. Our approach can also estimate per-pixel TTC with arbitrarily fine quantization (including continuous values), when the computational budget allows for it. To the best of our knowledge, our method is the first to offer TTC information (binary or coarsely quantized) at sufficiently high frame-rates for practical use.

CVAug 20, 2020
Generative View Synthesis: From Single-view Semantics to Novel-view Images

Tewodros Habtegebrial, Varun Jampani, Orazio Gallo et al.

Content creation, central to applications such as virtual reality, can be a tedious and time-consuming. Recent image synthesis methods simplify this task by offering tools to generate new views from as little as a single input image, or by converting a semantic map into a photorealistic image. We propose to push the envelope further, and introduce Generative View Synthesis (GVS), which can synthesize multiple photorealistic views of a scene given a single semantic map. We show that the sequential application of existing techniques, e.g., semantics-to-image translation followed by monocular view synthesis, fail at capturing the scene's structure. In contrast, we solve the semantics-to-image translation in concert with the estimation of the 3D layout of the scene, thus producing geometrically consistent novel views that preserve semantic structures. We first lift the input 2D semantic map onto a 3D layered representation of the scene in feature space, thereby preserving the semantic labels of 3D geometric structures. We then project the layered features onto the target views to generate the final novel-view images. We verify the strengths of our method and compare it with several advanced baselines on three different datasets. Our approach also allows for style manipulation and image editing operations, such as the addition or removal of objects, with simple manipulations of the input style images and semantic maps respectively. Visit the project page at https://gvsnet.github.io.

CVMay 14, 2020
Bi3D: Stereo Depth Estimation via Binary Classifications

Abhishek Badki, Alejandro Troccoli, Kihwan Kim et al.

Stereo-based depth estimation is a cornerstone of computer vision, with state-of-the-art methods delivering accurate results in real time. For several applications such as autonomous navigation, however, it may be useful to trade accuracy for lower latency. We present Bi3D, a method that estimates depth via a series of binary classifications. Rather than testing if objects are at a particular depth $D$, as existing stereo methods do, it classifies them as being closer or farther than $D$. This property offers a powerful mechanism to balance accuracy and latency. Given a strict time budget, Bi3D can detect objects closer than a given distance in as little as a few milliseconds, or estimate depth with arbitrarily coarse quantization, with complexity linear with the number of quantization levels. Bi3D can also use the allotted quantization levels to get continuous depth, but in a specific depth range. For standard stereo (i.e., continuous depth on the whole range), our method is close to or on par with state-of-the-art, finely tuned stereo methods.

CVApr 2, 2020
Novel View Synthesis of Dynamic Scenes with Globally Coherent Depths from a Monocular Camera

Jae Shin Yoon, Kihwan Kim, Orazio Gallo et al.

This paper presents a new method to synthesize an image from arbitrary views and times given a collection of images of a dynamic scene. A key challenge for the novel view synthesis arises from dynamic scene reconstruction where epipolar geometry does not apply to the local motion of dynamic contents. To address this challenge, we propose to combine the depth from single view (DSV) and the depth from multi-view stereo (DMV), where DSV is complete, i.e., a depth is assigned to every pixel, yet view-variant in its scale, while DMV is view-invariant yet incomplete. Our insight is that although its scale and quality are inconsistent with other views, the depth estimation from a single view can be used to reason about the globally coherent geometry of dynamic contents. We cast this problem as learning to correct the scale of DSV, and to refine each depth with locally consistent motions between views to form a coherent depth estimation. We integrate these tasks into a depth fusion network in a self-supervised fashion. Given the fused depth maps, we synthesize a photorealistic virtual view in a specific location and time with our deep blending network that completes the scene and renders the virtual view. We evaluate our method of depth estimation and view synthesis on diverse real-world dynamic scenes and show the outstanding performance over existing methods.

CVJan 6, 2020
Meshlet Priors for 3D Mesh Reconstruction

Abhishek Badki, Orazio Gallo, Jan Kautz et al.

Estimating a mesh from an unordered set of sparse, noisy 3D points is a challenging problem that requires carefully selected priors. Existing hand-crafted priors, such as smoothness regularizers, impose an undesirable trade-off between attenuating noise and preserving local detail. Recent deep-learning approaches produce impressive results by learning priors directly from the data. However, the priors are learned at the object level, which makes these algorithms class-specific and even sensitive to the pose of the object. We introduce meshlets, small patches of mesh that we use to learn local shape priors. Meshlets act as a dictionary of local features and thus allow to use learned priors to reconstruct object meshes in any pose and from unseen classes, even when the noise is large and the samples sparse.

CVJul 31, 2019
Video Stitching for Linear Camera Arrays

Wei-Sheng Lai, Orazio Gallo, Jinwei Gu et al.

Despite the long history of image and video stitching research, existing academic and commercial solutions still produce strong artifacts. In this work, we propose a wide-baseline video stitching algorithm for linear camera arrays that is temporally stable and tolerant to strong parallax. Our key insight is that stitching can be cast as a problem of learning a smooth spatial interpolation between the input videos. To solve this problem, inspired by pushbroom cameras, we introduce a fast pushbroom interpolation layer and propose a novel pushbroom stitching network, which learns a dense flow field to smoothly align the multiple input videos for spatial interpolation. Our approach outperforms the state-of-the-art by a significant margin, as we show with a user study, and has immediate applications in many areas such as virtual reality, immersive telepresence, autonomous driving, and video surveillance.

CVApr 10, 2019
Pixel-Adaptive Convolutional Neural Networks

Hang Su, Varun Jampani, Deqing Sun et al.

Convolutions are the fundamental building block of CNNs. The fact that their weights are spatially shared is one of the main reasons for their widespread use, but it also is a major limitation, as it makes convolutions content agnostic. We propose a pixel-adaptive convolution (PAC) operation, a simple yet effective modification of standard convolutions, in which the filter weights are multiplied with a spatially-varying kernel that depends on learnable, local pixel features. PAC is a generalization of several popular filtering techniques and thus can be used for a wide range of use cases. Specifically, we demonstrate state-of-the-art performance when PAC is used for deep joint image upsampling. PAC also offers an effective alternative to fully-connected CRF (Full-CRF), called PAC-CRF, which performs competitively, while being considerably faster. In addition, we also demonstrate that PAC can be used as a drop-in replacement for convolution layers in pre-trained networks, resulting in consistent performance improvements.

CVDec 12, 2018
Extreme View Synthesis

Inchang Choi, Orazio Gallo, Alejandro Troccoli et al.

We present Extreme View Synthesis, a solution for novel view extrapolation that works even when the number of input images is small--as few as two. In this context, occlusions and depth uncertainty are two of the most pressing issues, and worsen as the degree of extrapolation increases. We follow the traditional paradigm of performing depth-based warping and refinement, with a few key improvements. First, we estimate a depth probability volume, rather than just a single depth value for each pixel of the novel view. This allows us to leverage depth uncertainty in challenging regions, such as depth discontinuities. After using it to get an initial estimate of the novel view, we explicitly combine learned image priors and the depth uncertainty to synthesize a refined image with less artifacts. Our method is the first to show visually pleasing results for baseline magnifications of up to 30X.

CVJul 26, 2018
Tackling 3D ToF Artifacts Through Learning and the FLAT Dataset

Qi Guo, Iuri Frosio, Orazio Gallo et al.

Scene motion, multiple reflections, and sensor noise introduce artifacts in the depth reconstruction performed by time-of-flight cameras. We propose a two-stage, deep-learning approach to address all of these sources of artifacts simultaneously. We also introduce FLAT, a synthetic dataset of 2000 ToF measurements that capture all of these nonidealities, and allows to simulate different camera hardware. Using the Kinect 2 camera as a baseline, we show improved reconstruction errors over state-of-the-art methods, on both simulated and real data.

CVJan 16, 2018
Reblur2Deblur: Deblurring Videos via Self-Supervised Learning

Huaijin Chen, Jinwei Gu, Orazio Gallo et al.

Motion blur is a fundamental problem in computer vision as it impacts image quality and hinders inference. Traditional deblurring algorithms leverage the physics of the image formation model and use hand-crafted priors: they usually produce results that better reflect the underlying scene, but present artifacts. Recent learning-based methods implicitly extract the distribution of natural images directly from the data and use it to synthesize plausible images. Their results are impressive, but they are not always faithful to the content of the latent image. We present an approach that bridges the two. Our method fine-tunes existing deblurring neural networks in a self-supervised fashion by enforcing that the output, when blurred based on the optical flow between subsequent frames, matches the input blurry image. We show that our method significantly improves the performance of existing methods on several datasets both visually and in terms of image quality metrics. The supplementary material is https://goo.gl/nYPjEQ

CVDec 6, 2017
Separating Reflection and Transmission Images in the Wild

Patrick Wieschollek, Orazio Gallo, Jinwei Gu et al.

The reflections caused by common semi-reflectors, such as glass windows, can impact the performance of computer vision algorithms. State-of-the-art methods can remove reflections on synthetic data and in controlled scenarios. However, they are based on strong assumptions and do not generalize well to real-world images. Contrary to a common misconception, real-world images are challenging even when polarization information is used. We present a deep learning approach to separate the reflected and the transmitted components of the recorded irradiance, which explicitly uses the polarization properties of light. To train it, we introduce an accurate synthetic data generation pipeline, which simulates realistic reflections, including those generated by curved and non-ideal surfaces, non-static scenes, and high-dynamic-range scenes.

CVDec 3, 2016
Deep Learning with Energy-efficient Binary Gradient Cameras

Suren Jayasuriya, Orazio Gallo, Jinwei Gu et al.

Power consumption is a critical factor for the deployment of embedded computer vision systems. We explore the use of computational cameras that directly output binary gradient images to reduce the portion of the power consumption allocated to image sensing. We survey the accuracy of binary gradient cameras on a number of computer vision tasks using deep learning. These include object recognition, head pose regression, face detection, and gesture recognition. We show that, for certain applications, accuracy can be on par or even better than what can be achieved on traditional images. We are also the first to recover intensity information from binary spatial gradient images--useful for applications with a human observer in the loop, such as surveillance. Our results, which we validate with a prototype binary gradient camera, point to the potential of gradient-based computer vision systems.

CVNov 28, 2015
Loss Functions for Neural Networks for Image Processing

Hang Zhao, Orazio Gallo, Iuri Frosio et al.

Neural networks are becoming central in several areas of computer vision and image processing and different architectures have been proposed to solve specific problems. The impact of the loss layer of neural networks, however, has not received much attention in the context of image processing: the default and virtually only choice is L2. In this paper, we bring attention to alternative choices for image restoration. In particular, we show the importance of perceptually-motivated losses when the resulting image is to be evaluated by a human observer. We compare the performance of several losses, and propose a novel, differentiable error function. We show that the quality of the results improves significantly with better loss functions, even when the network architecture is left unchanged.

CVApr 7, 2015
Locally Non-rigid Registration for Mobile HDR Photography

Orazio Gallo, Alejandro Troccoli, Jun Hu et al.

Image registration for stack-based HDR photography is challenging. If not properly accounted for, camera motion and scene changes result in artifacts in the composite image. Unfortunately, existing methods to address this problem are either accurate, but too slow for mobile devices, or fast, but prone to failing. We propose a method that fills this void: our approach is extremely fast---under 700ms on a commercial tablet for a pair of 5MP images---and prevents the artifacts that arise from insufficient registration quality.