IVAug 1, 2022
Fast Two-step Blind Optical Aberration CorrectionThomas Eboli, Jean-Michel Morel, Gabriele Facciolo
The optics of any camera degrades the sharpness of photographs, which is a key visual quality criterion. This degradation is characterized by the point-spread function (PSF), which depends on the wavelengths of light and is variable across the imaging field. In this paper, we propose a two-step scheme to correct optical aberrations in a single raw or JPEG image, i.e., without any prior information on the camera or lens. First, we estimate local Gaussian blur kernels for overlapping patches and sharpen them with a non-blind deblurring technique. Based on the measurements of the PSFs of dozens of lenses, these blur kernels are modeled as RGB Gaussians defined by seven parameters. Second, we remove the remaining lateral chromatic aberrations (not contemplated in the first step) with a convolutional neural network, trained to minimize the red/green and blue/green residual images. Experiments on both synthetic and real images show that the combination of these two stages yields a fast state-of-the-art blind optical aberration compensation technique that competes with commercial non-blind algorithms.
CVJul 4, 2023Code
IAdet: Simplest human-in-the-loop object detectionFranco Marchesoni-Acland, Gabriele Facciolo
This work proposes a strategy for training models while annotating data named Intelligent Annotation (IA). IA involves three modules: (1) assisted data annotation, (2) background model training, and (3) active selection of the next datapoints. Under this framework, we open-source the IAdet tool, which is specific for single-class object detection. Additionally, we devise a method for automatically evaluating such a human-in-the-loop system. For the PASCAL VOC dataset, the IAdet tool reduces the database annotation time by $25\%$ while providing a trained model for free. These results are obtained for a deliberately very simple IAdet design. As a consequence, IAdet is susceptible to multiple easy improvements, paving the way for powerful human-in-the-loop object detection systems.
32.5CVJun 1
Generalization Limits in Vehicle Re-IdentificationAnis Yassine Ben Mabrouk, Antoine Tadros, Rafael Grompone von Gioi et al.
Vehicle re-identification focuses on retrieving images of the same vehicle from a gallery given a query image. Upon closer inspection of commonly used datasets, we observe that vehicles with few visual differences-e.g., the same make, model, and color-appear in both the training and test sets. As a result, methods that effectively memorize the training data tend to perform well on these test sets but struggle to generalize to other datasets. In this paper, we address this issue by proposing a novel evaluation approach that more effectively measures generalization capability to unseen vehicle types. To further study generalization performance, we also propose splitting the evaluation based on view, allowing us to differentiate the effect of viewpoint robustness from that of same-view re-identification. Our findings reveal that most state-of-the-art methods struggle with unseen vehicle types, and that their robustness to viewpoint changes and attention to detail are limited to vehicle types seen during training.
CVMar 16, 2022
Sat-NeRF: Learning Multi-View Satellite Photogrammetry With Transient Objects and Shadow Modeling Using RPC CamerasRoger Marí, Gabriele Facciolo, Thibaud Ehret
We introduce the Satellite Neural Radiance Field (Sat-NeRF), a new end-to-end model for learning multi-view satellite photogrammetry in the wild. Sat-NeRF combines some of the latest trends in neural rendering with native satellite camera models, represented by rational polynomial coefficient (RPC) functions. The proposed method renders new views and infers surface models of similar quality to those obtained with traditional state-of-the-art stereo pipelines. Multi-date images exhibit significant changes in appearance, mainly due to varying shadows and transient objects (cars, vegetation). Robustness to these challenges is achieved by a shadow-aware irradiance model and uncertainty weighting to deal with transient phenomena that cannot be explained by the position of the sun. We evaluate Sat-NeRF using WorldView-3 images from different locations and stress the advantages of applying a bundle adjustment to the satellite camera models prior to training. This boosts the network performance and can optionally be used to extract additional cues for depth supervision.
CVMay 4, 2022
Self-Supervised Super-Resolution for Multi-Exposure Push-Frame SatellitesNgoc Long Nguyen, Jérémy Anger, Axel Davy et al.
Modern Earth observation satellites capture multi-exposure bursts of push-frame images that can be super-resolved via computational means. In this work, we propose a super-resolution method for such multi-exposure sequences, a problem that has received very little attention in the literature. The proposed method can handle the signal-dependent noise in the inputs, process sequences of any length, and be robust to inaccuracies in the exposure times. Furthermore, it can be trained end-to-end with self-supervision, without requiring ground truth high resolution frames, which makes it especially suited to handle real data. Central to our method are three key contributions: i) a base-detail decomposition for handling errors in the exposure times, ii) a noise-level-aware feature encoding for improved fusion of frames with varying signal-to-noise ratio and iii) a permutation invariant fusion strategy by temporal pooling operators. We evaluate the proposed method on synthetic and real data and show that it outperforms by a significant margin existing single-exposure approaches that we adapted to the multi-exposure case.
IVMar 10, 2023
Handheld Burst Super-Resolution Meets Multi-Exposure Satellite ImageryJamy Lafenetre, Ngoc Long Nguyen, Gabriele Facciolo et al.
Image resolution is an important criterion for many applications based on satellite imagery. In this work, we adapt a state-of-the-art kernel regression technique for smartphone camera burst super-resolution to satellites. This technique leverages the local structure of the image to optimally steer the fusion kernels, limiting blur in the final high-resolution prediction, denoising the image, and recovering details up to a zoom factor of 2. We extend this approach to the multi-exposure case to predict from a sequence of multi-exposure low-resolution frames a high-resolution and noise-free one. Experiments on both single and multi-exposure scenarios show the merits of the approach. Since the fusion is learning-free, the proposed method is ensured to not hallucinate details, which is crucial for many remote sensing applications.
IVApr 25, 2022
Self-supervision versus synthetic datasets: which is the lesser evil in the context of video denoising?Valéry Dewil, Aranud Barral, Gabriele Facciolo et al.
Supervised training has led to state-of-the-art results in image and video denoising. However, its application to real data is limited since it requires large datasets of noisy-clean pairs that are difficult to obtain. For this reason, networks are often trained on realistic synthetic data. More recently, some self-supervised frameworks have been proposed for training such denoising networks directly on the noisy data without requiring ground truth. On synthetic denoising problems supervised training outperforms self-supervised approaches, however in recent years the gap has become narrower, especially for video. In this paper, we propose a study aiming to determine which is the best approach to train denoising networks for real raw videos: supervision on synthetic realistic data or self-supervision on real data. A complete study with quantitative results in case of natural videos with real motion is impossible since no dataset with clean-noisy pairs exists. We address this issue by considering three independent experiments in which we compare the two frameworks. We found that self-supervision on the real data outperforms supervision on synthetic data, and that in normal illumination conditions the drop in performance is due to the synthetic ground truth generation, not the noise model.
33.0LGMay 26
SPHERE-JEPA: Spherical Prediction with Homogeneous EmbeddingsLéo Nicollier, Max Dunitz, Marc Pic et al.
A fundamental open question in self-supervised learning (SSL) is the explicit characterization of the optimal geometry of the learned representations. Recently, LeJEPA identified isotropic Gaussian embeddings as optimal for minimizing downstream prediction risk in Euclidean spaces. However, the corresponding problem for distributions supported on lower-dimensional manifolds, such as the hypersphere, remains unexplored. In this work, we demonstrate that extending this minimax analysis to smooth distributions on Riemannian manifolds fundamentally changes the optimal solution. We show that, under a worst-case formulation, both k-nearest neighbors and kernel ridge regression induce hyperspherical uniformity. More precisely, we show that uniform distributions on manifolds are optimal for k-nearest neighbors, and that the uniform distribution on the sphere is optimal for kernel ridge regression with both the exponential dot-product kernel and the linear kernel. This theoretical insight reveals a fundamental limitation of Gaussian embeddings: their non-uniform density induces anisotropic k-NN neighborhoods, severely biasing the estimator. To correct this, we introduce SPHERE-JEPA, a theoretically grounded SSL framework. We adapt LeJEPA's Cram{é}r-Wold projection mechanism to enforce hyperspherical uniformity rather than a Gaussian prior. Empirically, SPHERE-JEPA yields significant improvements, boosting texture retrieval mAP by over 6%, while consistently matching or outperforming LeJEPA on standard benchmarks-including a +1.8% linear probing gain on ImageNet-1K (ViT-B/14).
CVApr 14, 2023
L1BSR: Exploiting Detector Overlap for Self-Supervised Single-Image Super-Resolution of Sentinel-2 L1B ImageryNgoc Long Nguyen, Jérémy Anger, Axel Davy et al.
High-resolution satellite imagery is a key element for many Earth monitoring applications. Satellites such as Sentinel-2 feature characteristics that are favorable for super-resolution algorithms such as aliasing and band-misalignment. Unfortunately the lack of reliable high-resolution (HR) ground truth limits the application of deep learning methods to this task. In this work we propose L1BSR, a deep learning-based method for single-image super-resolution and band alignment of Sentinel-2 L1B 10m bands. The method is trained with self-supervision directly on real L1B data by leveraging overlapping areas in L1B images produced by adjacent CMOS detectors, thus not requiring HR ground truth. Our self-supervised loss is designed to enforce the super-resolved output image to have all the bands correctly aligned. This is achieved via a novel cross-spectral registration network (CSR) which computes an optical flow between images of different spectral bands. The CSR network is also trained with self-supervision using an Anchor-Consistency loss, which we also introduce in this work. We demonstrate the performance of the proposed approach on synthetic and real L1B data, where we show that it obtains comparable results to supervised methods.
IVAug 13, 2024
How to Best Combine Demosaicing and Denoising?Yu Guo, Qiyu Jin, Jean-Michel Morel et al.
Image demosaicing and denoising play a critical role in the raw imaging pipeline. These processes have often been treated as independent, without considering their interactions. Indeed, most classic denoising methods handle noisy RGB images, not raw images. Conversely, most demosaicing methods address the demosaicing of noise free images. The real problem is to jointly denoise and demosaic noisy raw images. But the question of how to proceed is still not yet clarified. In this paper, we carry-out extensive experiments and a mathematical analysis to tackle this problem by low complexity algorithms. Indeed, both problems have been only addressed jointly by end-to-end heavy weight convolutional neural networks (CNNs), which are currently incompatible with low power portable imaging devices and remain by nature domain (or device) dependent. Our study leads us to conclude that, with moderate noise, demosaicing should be applied first, followed by denoising. This requires a simple adaptation of classic denoising algorithms to demosaiced noise, which we justify and specify. Although our main conclusion is ``demosaic first, then denoise'', we also discover that for high noise, there is a moderate PSNR gain by a more complex strategy: partial CFA denoising followed by demosaicing, and by a second denoising on the RGB image. These surprising results are obtained by a black-box optimization of the pipeline, which could be applied to any other pipeline. We validate our results on simulated and real noisy CFA images obtained from several benchmarks.
CVNov 18, 2022
Improving Pixel-Level Contrastive Learning by Leveraging Exogenous Depth InformationAhmed Ben Saad, Kristina Prokopetc, Josselin Kherroubi et al.
Self-supervised representation learning based on Contrastive Learning (CL) has been the subject of much attention in recent years. This is due to the excellent results obtained on a variety of subsequent tasks (in particular classification), without requiring a large amount of labeled samples. However, most reference CL algorithms (such as SimCLR and MoCo, but also BYOL and Barlow Twins) are not adapted to pixel-level downstream tasks. One existing solution known as PixPro proposes a pixel-level approach that is based on filtering of pairs of positive/negative image crops of the same image using the distance between the crops in the whole image. We argue that this idea can be further enhanced by incorporating semantic information provided by exogenous data as an additional selection filter, which can be used (at training time) to improve the selection of the pixel-level positive/negative samples. In this paper we will focus on the depth information, which can be obtained by using a depth estimation network or measured from available data (stereovision, parallax motion, LiDAR, etc.). Scene depth can provide meaningful cues to distinguish pixels belonging to different objects based on their depth. We show that using this exogenous information in the contrastive loss leads to improved results and that the learned representations better follow the shapes of objects. In addition, we introduce a multi-scale loss that alleviates the issue of finding the training parameters adapted to different object sizes. We demonstrate the effectiveness of our ideas on the Breakout Segmentation on Borehole Images where we achieve an improvement of 1.9\% over PixPro and nearly 5\% over the supervised baseline. We further validate our technique on the indoor scene segmentation tasks with ScanNet and outdoor scenes with CityScapes ( 1.6\% and 1.1\% improvement over PixPro respectively).
CVNov 20, 2023
On the Importance of Large Objects in CNN Based Object Detection AlgorithmsAhmed Ben Saad, Gabriele Facciolo, Axel Davy
Object detection models, a prominent class of machine learning algorithms, aim to identify and precisely locate objects in images or videos. However, this task might yield uneven performances sometimes caused by the objects sizes and the quality of the images and labels used for training. In this paper, we highlight the importance of large objects in learning features that are critical for all sizes. Given these findings, we propose to introduce a weighting term into the training loss. This term is a function of the object area size. We show that giving more weight to large objects leads to improved detection scores across all object sizes and so an overall improvement in Object Detectors performances (+2 p.p. of mAP on small objects, +2 p.p. on medium and +4 p.p. on large on COCO val 2017 with InternImage-T). Additional experiments and ablation studies with different models and on a different dataset further confirm the robustness of our findings.
CVFeb 22, 2023
On The Role of Alias and Band-Shift for Sentinel-2 Super-ResolutionNgoc Long Nguyen, Jérémy Anger, Lara Raad et al.
In this work, we study the problem of single-image super-resolution (SISR) of Sentinel-2 imagery. We show that thanks to its unique sensor specification, namely the inter-band shift and alias, that deep-learning methods are able to recover fine details. By training a model using a simple $L_1$ loss, results are free of hallucinated details. For this study, we build a dataset of pairs of images Sentinel-2/PlanetScope to train and evaluate our super-resolution (SR) model.
CVJun 29, 2022
Regularization of NeRFs using differential geometryThibaud Ehret, Roger Marí, Gabriele Facciolo
Neural radiance fields, or NeRF, represent a breakthrough in the field of novel view synthesis and 3D modeling of complex scenes from multi-view image collections. Numerous recent works have shown the importance of making NeRF models more robust, by means of regularization, in order to train with possibly inconsistent and/or very sparse data. In this work, we explore how differential geometry can provide elegant regularization tools for robustly training NeRF-like models, which are modified so as to represent continuous and infinitely differentiable functions. In particular, we present a generic framework for regularizing different types of NeRFs observations to improve the performance in challenging conditions. We also show how the same formalism can also be used to natively encourage the regularity of surfaces by means of Gaussian or mean curvatures.
CVAug 22, 2024
Adapting MIMO video restoration networks to low latency constraintsValéry Dewil, Zhe Zheng, Arnaud Barral et al.
MIMO (multiple input, multiple output) approaches are a recent trend in neural network architectures for video restoration problems, where each network evaluation produces multiple output frames. The video is split into non-overlapping stacks of frames that are processed independently, resulting in a very appealing trade-off between output quality and computational cost. In this work we focus on the low-latency setting by limiting the number of available future frames. We find that MIMO architectures suffer from problems that have received little attention so far, namely (1) the performance drops significantly due to the reduced temporal receptive field, particularly for frames at the borders of the stack, (2) there are strong temporal discontinuities at stack transitions which induce a step-wise motion artifact. We propose two simple solutions to alleviate these problems: recurrence across MIMO stacks to boost the output quality by implicitly increasing the temporal receptive field, and overlapping of the output stacks to smooth the temporal discontinuity at stack transitions. These modifications can be applied to any MIMO architecture. We test them on three state-of-the-art video denoising networks with different computational cost. The proposed contributions result in a new state-of-the-art for low-latency networks, both in terms of reconstruction error and temporal consistency. As an additional contribution, we introduce a new benchmark consisting of drone footage that highlights temporal consistency issues that are not apparent in the standard benchmarks.
LGJul 4, 2023
Optimal and Efficient Binary Questioning for Human-in-the-Loop AnnotationFranco Marchesoni-Acland, Jean-Michel Morel, Josselin Kherroubi et al.
Even though data annotation is extremely important for interpretability, research and development of artificial intelligence solutions, most research efforts such as active learning or few-shot learning focus on the sample efficiency problem. This paper studies the neglected complementary problem of getting annotated data given a predictor. For the simple binary classification setting, we present the spectrum ranging from optimal general solutions to practical efficient methods. The problem is framed as the full annotation of a binary classification dataset with the minimal number of yes/no questions when a predictor is available. For the case of general binary questions the solution is found in coding theory, where the optimal questioning strategy is given by the Huffman encoding of the possible labelings. However, this approach is computationally intractable even for small dataset sizes. We propose an alternative practical solution based on several heuristics and lookahead minimization of proxy cost functions. The proposed solution is analysed, compared with optimal solutions and evaluated on several synthetic and real-world datasets. On these datasets, the method allows a significant improvement ($23-86\%$) in annotation efficiency.
CVJul 9, 2023
Reducing False Alarms in Video Surveillance by Deep Feature Statistical ModelingXavier Bou, Aitor Artola, Thibaud Ehret et al.
Detecting relevant changes is a fundamental problem of video surveillance. Because of the high variability of data and the difficulty of properly annotating changes, unsupervised methods dominate the field. Arguably one of the most critical issues to make them practical is to reduce their false alarm rate. In this work, we develop a method-agnostic weakly supervised a-contrario validation process, based on high dimensional statistical modeling of deep features, to reduce the number of false alarms of any change detection algorithm. We also raise the insufficiency of the conventionally used pixel-wise evaluation, as it fails to precisely capture the performance needs of most real applications. For this reason, we complement pixel-wise metrics with object-wise metrics and evaluate the impact of our approach at both pixel and object levels, on six methods and several sequences from different datasets. Experimental results reveal that the proposed a-contrario validation is able to largely reduce the number of false alarms at both pixel and object levels.
16.0CVMar 23
Deep S2P: Integrating Learning Based Stereo Matching Into the Satellite Stereo PipelineElías Masquil, Thibaud Ehret, Pablo Musé et al.
Digital Surface Model generation from satellite imagery is a core task in Earth observation and is commonly addressed using classical stereoscopic matching algorithms in satellite pipelines as in the Satellite Stereo Pipeline (S2P). While recent learning-based stereo matchers achieve state-of-the-art performance on standard benchmarks, their integration into operational satellite pipelines remains challenging due to differences in viewing geometry and disparity assumptions. In this work, we integrate several modern learning-based stereo matchers, including StereoAnywhere, MonSter, Foundation Stereo, and a satellite fine-tuned variant of MonSter, into the Satellite Stereo Pipeline, adapting the rectification stage to enforce compatible disparity polarity and range. We release the corresponding code to enable reproducible use of these methods in large-scale Earth observation workflows. Experiments on satellite imagery show consistent improvements over classical cost-volume-based approaches in terms of Digital Surface Model accuracy, although commonly used metrics such as mean absolute error exhibit saturation effects. Qualitative results reveal substantially improved geometric detail and sharper structures, highlighting the need for evaluation strategies that better reflect perceptual and structural fidelity. At the same time, performance over challenging surface types such as vegetation remains limited across all evaluated models, indicating open challenges for learning-based stereo in natural environments.
CVJan 30
Diachronic Stereo Matching for Multi-Date Satellite ImageryElías Masquil, Luca Savant Aira, Roger Marí et al.
Recent advances in image-based satellite 3D reconstruction have progressed along two complementary directions. On one hand, multi-date approaches using NeRF or Gaussian-splatting jointly model appearance and geometry across many acquisitions, achieving accurate reconstructions on opportunistic imagery with numerous observations. On the other hand, classical stereoscopic reconstruction pipelines deliver robust and scalable results for simultaneous or quasi-simultaneous image pairs. However, when the two images are captured months apart, strong seasonal, illumination, and shadow changes violate standard stereoscopic assumptions, causing existing pipelines to fail. This work presents the first Diachronic Stereo Matching method for satellite imagery, enabling reliable 3D reconstruction from temporally distant pairs. Two advances make this possible: (1) fine-tuning a state-of-the-art deep stereo network that leverages monocular depth priors, and (2) exposing it to a dataset specifically curated to include a diverse set of diachronic image pairs. In particular, we start from a pretrained MonSter model, trained initially on a mix of synthetic and real datasets such as SceneFlow and KITTI, and fine-tune it on a set of stereo pairs derived from the DFC2019 remote sensing challenge. This dataset contains both synchronic and diachronic pairs under diverse seasonal and illumination conditions. Experiments on multi-date WorldView-3 imagery demonstrate that our approach consistently surpasses classical pipelines and unadapted deep stereo models on both synchronic and diachronic settings. Fine-tuning on temporally diverse images, together with monocular priors, proves essential for enabling 3D reconstruction from previously incompatible acquisition dates. Left image (winter) Right image (autumn) DSM geometry Ours (1.23 m) Zero-shot (3.99 m) LiDAR GT Figure 1. Output geometry for a winter-autumn image pair from Omaha (OMA 331 test scene). Our method recovers accurate geometry despite the diachronic nature of the pair, exhibiting strong appearance changes, which cause existing zero-shot methods to fail. Missing values due to perspective shown in black. Mean altitude error in parentheses; lower is better.
CVFeb 17
An Industrial Dataset for Scene Acquisitions and Functional Schematics AlignmentFlavien Armangeon, Thibaud Ehret, Enric Meinhardt-Llopis et al.
Aligning functional schematics with 2D and 3D scene acquisitions is crucial for building digital twins, especially for old industrial facilities that lack native digital models. Current manual alignment using images and LiDAR data does not scale due to tediousness and complexity of industrial sites. Inconsistencies between schematics and reality, and the scarcity of public industrial datasets, make the problem both challenging and underexplored. This paper introduces IRIS-v2, a comprehensive dataset to support further research. It includes images, point clouds, 2D annotated boxes and segmentation masks, a CAD model, 3D pipe routing information, and the P&ID (Piping and Instrumentation Diagram). The alignment is experimented on a practical case study, aiming at reducing the time required for this task by combining segmentation and graph matching.
CVJan 5
Remote Sensing Change Detection via Weak Temporal SupervisionXavier Bou, Elliot Vincent, Gabriele Facciolo et al.
Semantic change detection in remote sensing aims to identify land cover changes between bi-temporal image pairs. Progress in this area has been limited by the scarcity of annotated datasets, as pixel-level annotation is costly and time-consuming. To address this, recent methods leverage synthetic data or generate artificial change pairs, but out-of-domain generalization remains limited. In this work, we introduce a weak temporal supervision strategy that leverages additional temporal observations of existing single-temporal datasets, without requiring any new annotations. Specifically, we extend single-date remote sensing datasets with new observations acquired at different times and train a change detection model by assuming that real bi-temporal pairs mostly contain no change, while pairing images from different locations to generate change examples. To handle the inherent noise in these weak labels, we employ an object-aware change map generation and an iterative refinement process. We validate our approach on extended versions of the FLAIR and IAILD aerial datasets, achieving strong zero-shot and low-data regime performance across different benchmarks. Lastly, we showcase results over large areas in France, highlighting the scalability potential of our method.
66.4CVMar 25
Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head SelectionAdhemar de Senneville, Xavier Bou, Jérémy Anger et al.
Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.
36.1IVMar 25
Comparative analysis of dual-form networks for live land monitoring using multi-modal satellite image time seriesIris Dumeur, Jérémy Anger, Gabriele Facciolo
Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.
CVDec 2, 2025
Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery AloneTristan Amadei, Enric Meinhardt-Llopis, Benedicte Bascle et al.
Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.
CVNov 15, 2024Code
Structure Tensor Representation for Robust Oriented Object DetectionXavier Bou, Gabriele Facciolo, Rafael Grompone von Gioi et al.
Oriented object detection predicts orientation in addition to object location and bounding box. Precisely predicting orientation remains challenging due to angular periodicity, which introduces boundary discontinuity issues and symmetry ambiguities. Inspired by classical works on edge and corner detection, this paper proposes to represent orientation in oriented bounding boxes as a structure tensor. This representation combines the strengths of Gaussian-based methods and angle-coder solutions, providing a simple yet efficient approach that is robust to angular periodicity issues without additional hyperparameters. Extensive evaluations across five datasets demonstrate that the proposed structure tensor representation outperforms previous methods in both fully-supervised and weakly supervised tasks, achieving high precision in angular prediction with minimal computational overhead. Thus, this work establishes structure tensors as a robust and modular alternative for encoding orientation in oriented object detection. We make our code publicly available, allowing for seamless integration into existing object detectors.
IVMar 6, 2024
Fast, nonlocal and neural: a lightweight high quality solution to image denoisingYu Guo, Axel Davy, Gabriele Facciolo et al.
With the widespread application of convolutional neural networks (CNNs), the traditional model based denoising algorithms are now outperformed. However, CNNs face two problems. First, they are computationally demanding, which makes their deployment especially difficult for mobile terminals. Second, experimental evidence shows that CNNs often over-smooth regular textures present in images, in contrast to traditional non-local models. In this letter, we propose a solution to both issues by combining a nonlocal algorithm with a lightweight residual CNN. This solution gives full latitude to the advantages of both models. We apply this framework to two GPU implementations of classic nonlocal algorithms (NLM and BM3D) and observe a substantial gain in both cases, performing better than the state-of-the-art with low computational requirements. Our solution is between 10 and 20 times faster than CNNs with equivalent performance and attains higher PSNR. In addition the final method shows a notable gain on images containing complex textures like the ones of the MIT Moire dataset.
CVDec 20, 2023
Radar Fields: An Extension of Radiance Fields to SARThibaud Ehret, Roger Marí, Dawa Derksen et al.
Radiance fields have been a major breakthrough in the field of inverse rendering, novel view synthesis and 3D modeling of complex scenes from multi-view image collections. Since their introduction, it was shown that they could be extended to other modalities such as LiDAR, radio frequencies, X-ray or ultrasound. In this paper, we show that, despite the important difference between optical and synthetic aperture radar (SAR) image formation models, it is possible to extend radiance fields to radar images thus presenting the first "radar fields". This allows us to learn surface models using only collections of radar images, similar to how regular radiance fields are learned and with the same computational complexity on average. Thanks to similarities in how both fields are defined, this work also shows a potential for hybrid methods combining both optical and SAR images.
CVMar 8, 2024
Exploring Robust Features for Few-Shot Object Detection in Satellite ImageryXavier Bou, Gabriele Facciolo, Rafael Grompone von Gioi et al.
The goal of this paper is to perform object detection in satellite imagery with only a few examples, thus enabling users to specify any object class with minimal annotation. To this end, we explore recent methods and ideas from open-vocabulary detection for the remote sensing domain. We develop a few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier. A large-scale pre-trained model is used to build class-reference embeddings or prototypes, which are compared to region proposal contents for label prediction. In addition, we propose to fine-tune prototypes on available training images to boost performance and learn differences between similar classes, such as aircraft types. We perform extensive evaluations on two remote sensing datasets containing challenging and rare objects. Moreover, we study the performance of both visual and image-text features, namely DINOv2 and CLIP, including two CLIP models specifically tailored for remote sensing applications. Results indicate that visual features are largely superior to vision-language models, as the latter lack the necessary domain-specific vocabulary. Lastly, the developed detector outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters.
CVJul 11, 2025
Normalized vs Diplomatic Annotation: A Case Study of Automatic Information Extraction from Handwritten Uruguayan Birth CertificatesNatalia Bottaioli, Solène Tarride, Jérémy Anger et al.
This study evaluates the recently proposed Document Attention Network (DAN) for extracting key-value information from Uruguayan birth certificates, handwritten in Spanish. We investigate two annotation strategies for automatically transcribing handwritten documents, fine-tuning DAN with minimal training data and annotation effort. Experiments were conducted on two datasets containing the same images (201 scans of birth certificates written by more than 15 different writers) but with different annotation methods. Our findings indicate that normalized annotation is more effective for fields that can be standardized, such as dates and places of birth, whereas diplomatic annotation performs much better for fields containing names and surnames, which can not be standardized.
CVDec 17, 2024
Gaussian Splatting for Efficient Satellite Image PhotogrammetryLuca Savant Aira, Gabriele Facciolo, Thibaud Ehret
Recently, Gaussian splatting has emerged as a strong alternative to NeRF, demonstrating impressive 3D modeling capabilities while requiring only a fraction of the training and rendering time. In this paper, we show how the standard Gaussian splatting framework can be adapted for remote sensing, retaining its high efficiency. This enables us to achieve state-of-the-art performance in just a few minutes, compared to the day-long optimization required by the best-performing NeRF-based Earth observation methods. The proposed framework incorporates remote-sensing improvements from EO-NeRF, such as radiometric correction and shadow modeling, while introducing novel components, including sparsity, view consistency, and opacity regularizations.
CVApr 9, 2025
S-EO: A Large-Scale Dataset for Geometry-Aware Shadow Detection in Remote Sensing ApplicationsElías Masquil, Roger Marí, Thibaud Ehret et al.
We introduce the S-EO dataset: a large-scale, high-resolution dataset, designed to advance geometry-aware shadow detection. Collected from diverse public-domain sources, including challenge datasets and government providers such as USGS, our dataset comprises 702 georeferenced tiles across the USA, each covering 500x500 m. Each tile includes multi-date, multi-angle WorldView-3 pansharpened RGB images, panchromatic images, and a ground-truth DSM of the area obtained from LiDAR scans. For each image, we provide a shadow mask derived from geometry and sun position, a vegetation mask based on the NDVI index, and a bundle-adjusted RPC model. With approximately 20,000 images, the S-EO dataset establishes a new public resource for shadow detection in remote sensing imagery and its applications to 3D reconstruction. To demonstrate the dataset's impact, we train and evaluate a shadow detector, showcasing its ability to generalize, even to aerial images. Finally, we extend EO-NeRF - a state-of-the-art NeRF approach for satellite imagery - to leverage our shadow predictions for improved 3D reconstructions.
CVNov 20, 2025
EOGS++: Earth Observation Gaussian Splatting with Internal Camera Refinement and Direct Panchromatic RenderingPierrick Bournez, Luca Savant Aira, Thibaud Ehret et al.
Recently, 3D Gaussian Splatting has been introduced as a compelling alternative to NeRF for Earth observation, offering com- petitive reconstruction quality with significantly reduced training times. In this work, we extend the Earth Observation Gaussian Splatting (EOGS) framework to propose EOGS++, a novel method tailored for satellite imagery that directly operates on raw high-resolution panchromatic data without requiring external preprocessing. Furthermore, leveraging optical flow techniques we embed bundle adjustment directly within the training process, avoiding reliance on external optimization tools while improving camera pose estimation. We also introduce several improvements to the original implementation, including early stopping and TSDF post-processing, all contributing to sharper reconstructions and better geometric accuracy. Experiments on the IARPA 2016 and DFC2019 datasets demonstrate that EOGS++ achieves state-of-the-art performance in terms of reconstruction quality and effi- ciency, outperforming the original EOGS method and other NeRF-based methods while maintaining the computational advantages of Gaussian Splatting. Our model demonstrates an improvement from 1.33 to 1.19 mean MAE errors on buildings compared to the original EOGS models
CVJul 24, 2025
Towards Large Scale Geostatistical Methane Monitoring with Part-based Object DetectionAdhemar de Senneville, Xavier Bou, Thibaud Ehret et al.
Object detection is one of the main applications of computer vision in remote sensing imagery. Despite its increasing availability, the sheer volume of remote sensing data poses a challenge when detecting rare objects across large geographic areas. Paradoxically, this common challenge is crucial to many applications, such as estimating environmental impact of certain human activities at scale. In this paper, we propose to address the problem by investigating the methane production and emissions of bio-digesters in France. We first introduce a novel dataset containing bio-digesters, with small training and validation sets, and a large test set with a high imbalance towards observations without objects since such sites are rare. We develop a part-based method that considers essential bio-digester sub-elements to boost initial detections. To this end, we apply our method to new, unseen regions to build an inventory of bio-digesters. We then compute geostatistical estimates of the quantity of methane produced that can be attributed to these infrastructures in a given area at a given time.
CVApr 9, 2024
Leveraging edge detection and neural networks for better UAV localizationTheo Di Piazza, Enric Meinhardt-Llopis, Gabriele Facciolo et al.
We propose a novel method for geolocalizing Unmanned Aerial Vehicles (UAVs) in environments lacking Global Navigation Satellite Systems (GNSS). Current state-of-the-art techniques employ an offline-trained encoder to generate a vector representation (embedding) of the UAV's current view, which is then compared with pre-computed embeddings of geo-referenced images to determine the UAV's position. Here, we demonstrate that the performance of these methods can be significantly enhanced by preprocessing the images to extract their edges, which exhibit robustness to seasonal and illumination variations. Furthermore, we establish that utilizing edges enhances resilience to orientation and altitude inaccuracies. Additionally, we introduce a confidence criterion for localization. Our findings are substantiated through synthetic experiments.
CVMay 25, 2023
Collaborative Blind Image DeblurringThomas Eboli, Jean-Michel Morel, Gabriele Facciolo
Blurry images usually exhibit similar blur at various locations across the image domain, a property barely captured in nowadays blind deblurring neural networks. We show that when extracting patches of similar underlying blur is possible, jointly processing the stack of patches yields superior accuracy than handling them separately. Our collaborative scheme is implemented in a neural architecture with a pooling layer on the stack dimension. We present three practical patch extraction strategies for image sharpening, camera shake removal and optical aberration correction, and validate the proposed approach on both synthetic and real-world benchmarks. For each blur instance, the proposed collaborative strategy yields significant quantitative and qualitative improvements.
CVMar 1, 2021
Automatic Stockpile Volume Monitoring using Multi-view Stereo from SkySat ImageryRoger Marí, Carlo de Franchis, Enric Meinhardt-Llopis et al.
This paper proposes a system for automatic surface volume monitoring from time series of SkySat pushframe imagery. A specific challenge of building and comparing large 3D models from SkySat data is to correct inconsistencies between the camera models associated to the multiple views that are necessary to cover the area at a given time, where these camera models are represented as Rational Polynomial Cameras (RPCs). We address the problem by proposing a date-wise RPC refinement, able to handle dynamic areas covered by sets of partially overlapping views. The cameras are refined by means of a rotation that compensates for errors due to inaccurate knowledge of the satellite attitude. The refined RPCs are then used to reconstruct multiple consistent Digital Surface Models (DSMs) from different stereo pairs at each date. RPC refinement strengthens the consistency between the DSMs of each date, which is extremely beneficial to accurately measure volumes in the 3D surface models. The system is tested in a real case scenario, to monitor large coal stockpiles. Our volume estimates are validated with measurements collected on site in the same period of time.
IVFeb 26, 2021
Robust Rational Polynomial Camera Modelling for SAR and Pushbroom ImagingRoland Akiki, Roger Marí, Carlo de Franchis et al.
The Rational Polynomial Camera (RPC) model can be used to describe a variety of image acquisition systems in remote sensing, notably optical and Synthetic Aperture Radar (SAR) sensors. RPC functions relate 3D to 2D coordinates and vice versa, regardless of physical sensor specificities, which has made them an essential tool to harness satellite images in a generic way. This article describes a terrain-independent algorithm to accurately derive a RPC model from a set of 3D-2D point correspondences based on a regularized least squares fit. The performance of the method is assessed by varying the point correspondences and the size of the area that they cover. We test the algorithm on SAR and optical data, to derive RPCs from physical sensor models or from other RPC models after composition with corrective functions.
CVFeb 3, 2021
Parallax estimation for push-frame satellite imagery: application to super-resolution and 3D surface modeling from Skysat productsJérémy Anger, Thibaud Ehret, Gabriele Facciolo
Recent constellations of satellites, including the Skysat constellation, are able to acquire bursts of images. This new acquisition mode allows for modern image restoration techniques, including multi-frame super-resolution. As the satellite moves during the acquisition of the burst, elevation changes in the scene translate into noticeable parallax. This parallax hinders the results of the restoration. To cope with this issue, we propose a novel parallax estimation method. The method is composed of a linear Plane+Parallax decomposition of the apparent motion and a multi-frame optical flow algorithm that exploits all frames simultaneously. Using SkySat L1A images, we show that the estimated per-pixel displacements are important for applying multi-frame super-resolution on scenes containing elevation changes and that can also be used to estimate a coarse 3D surface model.
CVJan 25, 2021
Proba-V-ref: Repurposing the Proba-V challenge for reference-aware super resolutionNgoc Long Nguyen, Jérémy Anger, Axel Davy et al.
The PROBA-V Super-Resolution challenge distributes real low-resolution image series and corresponding high-resolution targets to advance research on Multi-Image Super Resolution (MISR) for satellite images. However, in the PROBA-V dataset the low-resolution image corresponding to the high-resolution target is not identified. We argue that in doing so, the challenge ranks the proposed methods not only by their MISR performance, but mainly by the heuristics used to guess which image in the series is the most similar to the high-resolution target. We demonstrate this by improving the performance obtained by the two winners of the challenge only by using a different reference image, which we compute following a simple heuristic. Based on this, we propose PROBA-V-REF a variant of the PROBA-V dataset, in which the reference image in the low-resolution series is provided, and show that the ranking between the methods changes in this setting. This is relevant to many practical use cases of MISR where the goal is to super-resolve a specific image of the series, i.e. the reference is known. The proposed PROBA-V-REF should better reflect the performance of the different methods for this reference-aware MISR problem.
CVSep 14, 2020
Joint Demosaicking and Denoising Benefits from a Two-stage Training StrategyYu Guo, Qiyu Jin, Gabriele Facciolo et al.
Image demosaicking and denoising are the first two key steps of the color image production pipeline. The classical processing sequence has for a long time consisted of applying denoising first, and then demosaicking. Applying the operations in this order leads to oversmoothing and checkerboard effects. Yet, it was difficult to change this order, because once the image is demosaicked, the statistical properties of the noise are dramatically changed and hard to handle by traditional denoising models. In this paper, we address this problem by a hybrid machine learning method. We invert the traditional color filter array (CFA) processing pipeline by first demosaicking and then denoising. Our demosaicking algorithm, trained on noiseless images, combines a traditional method and a residual convolutional neural network (CNN). This first stage retains all known information, which is the key point to obtain faithful final results. The noisy demosaicked image is then passed through a second CNN restoring a noiseless full-color image. This pipeline order completely avoids checkerboard effects and restores fine image detail. Although CNNs can be trained to solve jointly demosaicking-denoising end-to-end, we find that this two-stage training performs better and is less prone to failure. It is shown experimentally to improve on the state of the art, both quantitatively and in terms of visual quality.
IVApr 24, 2020
A Review of an Old Dilemma: Demosaicking First, or Denoising First?Qiyu Jin, Gabriele Facciolo, Jean-Michel Morel
Image denoising and demosaicking are the most important early stages in digital camera pipelines. They constitute a severely ill-posed problem that aims at reconstructing a full color image from a noisy color filter array (CFA) image. In most of the literature, denoising and demosaicking are treated as two independent problems, without considering their interaction, or asking which should be applied first. Several recent works have started addressing them jointly in works that involve heavy weight CNNs, thus incompatible with low power portable imaging devices. Hence, the question of how to combine denoising and demosaicking to reconstruct full color images remains very relevant: Is denoising to be applied first, or should that be demosaicking first? In this paper, we review the main variants of these strategies and carry-out an extensive evaluation to find the best way to reconstruct full color images from a noisy mosaic. We conclude that demosaicking should applied first, followed by denoising. Yet we prove that this requires an adaptation of classic denoising algorithms to demosaicked noise, which we justify and specify.
CVApr 15, 2020
Self-Supervised training for blind multi-frame video denoisingValéry Dewil, Jérémy Anger, Axel Davy et al.
We propose a self-supervised approach for training multi-frame video denoising networks. These networks predict frame t from a window of frames around t. Our self-supervised approach benefits from the video temporal consistency by penalizing a loss between the predicted frame t and a neighboring target frame, which are aligned using an optical flow. We use the proposed strategy for online internal learning, where a pre-trained network is fine-tuned to denoise a new unknown noise type from a single video. After a few frames, the proposed fine-tuning reaches and sometimes surpasses the performance of a state-of-the-art network trained with supervision. In addition, for a wide range of noise types, it can be applied blindly without knowing the noise distribution. We demonstrate this by showing results on blind denoising of different synthetic and realistic noises.
CVMay 13, 2019
Joint Demosaicking and Denoising by Fine-Tuning of Bursts of Raw ImagesThibaud Ehret, Axel Davy, Pablo Arias et al.
Demosaicking and denoising are the first steps of any camera image processing pipeline and are key for obtaining high quality RGB images. A promising current research trend aims at solving these two problems jointly using convolutional neural networks. Due to the unavailability of ground truth data these networks cannot be currently trained using real RAW images. Instead, they resort to simulated data. In this paper we present a method to learn demosaicking directly from mosaicked images, without requiring ground truth RGB data. We apply this to learn joint demosaicking and denoising only from RAW images, thus enabling the use of real data. In addition we show that for this application fine-tuning a network to a specific burst improves the quality of restoration for both demosaicking and denoising.
CVApr 19, 2019
Assessing the Sharpness of Satellite Images: Study of the PlanetScope ConstellationJérémy Anger, Carlo de Franchis, Gabriele Facciolo
New micro-satellite constellations enable unprecedented systematic monitoring applications thanks to their wide coverage and short revisit capabilities. However, the large volumes of images that they produce have uneven qualities, creating the need for automatic quality assessment methods. In this work, we quantify the sharpness of images from the PlanetScope constellation by estimating the blur kernel from each image. Once the kernel has been estimated, it is possible to compute an absolute measure of sharpness which allows to discard low quality images and deconvolve blurry images before any further processing. The method is fully blind and automatic, and since it does not require the knowledge of any satellite specifications it can be ported to other constellations.
CVApr 19, 2019
Efficient Blind Deblurring under High Noise LevelsJérémy Anger, Mauricio Delbracio, Gabriele Facciolo
The goal of blind image deblurring is to recover a sharp image from a motion blurred one without knowing the camera motion. Current state-of-the-art methods have a remarkably good performance on images with no noise or very low noise levels. However, the noiseless assumption is not realistic considering that low light conditions are the main reason for the presence of motion blur due to requiring longer exposure times. In fact, motion blur and high to moderate noise often appear together. Most works approach this problem by first estimating the blur kernel $k$ and then deconvolving the noisy blurred image. In this work, we first show that current state-of-the-art kernel estimation methods based on the $\ell_0$ gradient prior can be adapted to handle high noise levels while keeping their efficiency. Then, we show that a fast non-blind deconvolution method can be significantly improved by first denoising the blurry image. The proposed approach yields results that are equivalent to those obtained with much more computationally demanding methods.
CVNov 30, 2018
Model-blind Video Denoising Via Frame-to-frame TrainingThibaud Ehret, Axel Davy, Jean-Michel Morel et al.
Modeling the processing chain that has produced a video is a difficult reverse engineering task, even when the camera is available. This makes model based video processing a still more complex task. In this paper we propose a fully blind video denoising method, with two versions off-line and on-line. This is achieved by fine-tuning a pre-trained AWGN denoising network to the video with a novel frame-to-frame training strategy. Our denoiser can be used without knowledge of the origin of the video or burst and the post processing steps applied from the camera sensor. The on-line process only requires a couple of frames before achieving visually-pleasing results for a wide range of perturbations. It nonetheless reaches state of the art performance for standard Gaussian noise, and can be used off-line with still better performance.
CVNov 30, 2018
Non-Local Video Denoising by CNNAxel Davy, Thibaud Ehret, Jean-Michel Morel et al.
Non-local patch based methods were until recently state-of-the-art for image denoising but are now outperformed by CNNs. Yet they are still the state-of-the-art for video denoising, as video redundancy is a key factor to attain high denoising performance. The problem is that CNN architectures are hardly compatible with the search for self-similarities. In this work we propose a new and efficient way to feed video self-similarities to a CNN. The non-locality is incorporated into the network via a first non-trainable layer which finds for each patch in the input image its most similar patches in a search region. The central values of these patches are then gathered in a feature vector which is assigned to each image pixel. This information is presented to a CNN which is trained to predict the clean image. We apply the proposed architecture to image and video denoising. For the latter patches are searched for in a 3D spatio-temporal volume. The proposed architecture achieves state-of-the-art results. To the best of our knowledge, this is the first successful application of a CNN to video denoising.
CVJun 4, 2018
Modeling Realistic Degradations in Non-blind DeconvolutionJérémy Anger, Mauricio Delbracio, Gabriele Facciolo
Most image deblurring methods assume an over-simplistic image formation model and as a result are sensitive to more realistic image degradations. We propose a novel variational framework, that explicitly handles pixel saturation, noise, quantization, as well as non-linear camera response function due to e.g., gamma correction. We show that accurately modeling a more realistic image acquisition pipeline leads to significant improvements, both in terms of image quality and PSNR. Furthermore, we show that incorporating the non-linear response in both the data and the regularization terms of the proposed energy leads to a more detailed restoration than a naive inversion of the non-linear curve. The minimization of the proposed energy is performed using stochastic optimization. A dataset consisting of realistically degraded images is created in order to evaluate the method.
CVJan 19, 2017
Accurate Motion Estimation through Random Sample Aggregated ConsensusMartin Rais, Gabriele Facciolo, Enric Meinhardt-Llopis et al.
We reconsider the classic problem of estimating accurately a 2D transformation from point matches between images containing outliers. RANSAC discriminates outliers by randomly generating minimalistic sampled hypotheses and verifying their consensus over the input data. Its response is based on the single hypothesis that obtained the largest inlier support. In this article we show that the resulting accuracy can be improved by aggregating all generated hypotheses. This yields RANSAAC, a framework that improves systematically over RANSAC and its state-of-the-art variants by statistically aggregating hypotheses. To this end, we introduce a simple strategy that allows to rapidly average 2D transformations, leading to an almost negligible extra computational cost. We give practical applications on projective transforms and homography+distortion models and demonstrate a significant performance gain in both cases.