CVSep 6, 2023Code
PDiscoNet: Semantically consistent part discovery for fine-grained recognitionRobert van der Klis, Stephan Alaniz, Massimiliano Mancini et al.
Fine-grained classification often requires recognizing specific object parts, such as beak shape and wing patterns for birds. Encouraging a fine-grained classification model to first detect such parts and then using them to infer the class could help us gauge whether the model is indeed looking at the right details better than with interpretability methods that provide a single attribution map. We propose PDiscoNet to discover object parts by using only image-level class labels along with priors encouraging the parts to be: discriminative, compact, distinct from each other, equivariant to rigid transforms, and active in at least some of the images. In addition to using the appropriate losses to encode these priors, we propose to use part-dropout, where full part feature vectors are dropped at once to prevent a single part from dominating in the classification, and part feature vector modulation, which makes the information coming from each part distinct from the perspective of the classifier. Our results on CUB, CelebA, and PartImageNet show that the proposed method provides substantially better part discovery performance than previous methods while not requiring any additional hyper-parameter tuning and without penalizing the classification performance. The code is available at https://github.com/robertdvdk/part_detection.
LGAug 21, 2023
Sparse Linear Concept Discovery ModelsKonstantinos P. Panousis, Dino Ienco, Diego Marcos
The recent mass adoption of DNNs, even in safety-critical scenarios, has shifted the focus of the research community towards the creation of inherently intrepretable models. Concept Bottleneck Models (CBMs) constitute a popular approach where hidden layers are tied to human understandable concepts allowing for investigation and correction of the network's decisions. However, CBMs usually suffer from: (i) performance degradation and (ii) lower interpretability than intended due to the sheer amount of concepts contributing to each decision. In this work, we propose a simple yet highly intuitive interpretable framework based on Contrastive Language Image models and a single sparse linear layer. In stark contrast to related approaches, the sparsity in our framework is achieved via principled Bayesian arguments by inferring concept presence via a data-driven Bernoulli distribution. As we experimentally show, our framework not only outperforms recent CBM approaches accuracy-wise, but it also yields high per example concept sparsity, facilitating the individual investigation of the emerging concepts.
CVAug 23, 2023
Masking Strategies for Background Bias Removal in Computer Vision ModelsAnanthu Aniraj, Cassio F. Dantas, Dino Ienco et al.
Models for fine-grained image classification tasks, where the difference between some classes can be extremely subtle and the number of samples per class tends to be low, are particularly prone to picking up background-related biases and demand robust methods to handle potential examples with out-of-distribution (OOD) backgrounds. To gain deeper insights into this critical problem, our research investigates the impact of background-induced bias on fine-grained image classification, evaluating standard backbone models such as Convolutional Neural Network (CNN) and Vision Transformers (ViT). We explore two masking strategies to mitigate background-induced bias: Early masking, which removes background information at the (input) image level, and late masking, which selectively masks high-level spatial features corresponding to the background. Extensive experiments assess the behavior of CNN and ViT models under different masking strategies, with a focus on their generalization to OOD backgrounds. The obtained findings demonstrate that both proposed strategies enhance OOD performance compared to the baseline models, with early masking consistently exhibiting the best OOD performance. Notably, a ViT variant employing GAP-Pooled Patch token-based classification combined with early masking achieves the highest OOD robustness.
CVDec 4, 2025Code
RAMEN: Resolution-Adjustable Multimodal Encoder for Earth ObservationNicolas Houdré, Diego Marcos, Hugo Riffaud de Turckheim et al.
Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.
LGJan 4, 2023
Towards Explainable Land Cover Mapping: a Counterfactual-based StrategyCassio F. Dantas, Diego Marcos, Dino Ienco
Counterfactual explanations are an emerging tool to enhance interpretability of deep learning models. Given a sample, these methods seek to find and display to the user similar samples across the decision boundary. In this paper, we propose a generative adversarial counterfactual approach for satellite image time series in a multi-class setting for the land cover classification task. One of the distinctive features of the proposed approach is the lack of prior assumption on the targeted class for a given counterfactual explanation. This inherent flexibility allows for the discovery of interesting information on the relationship between land cover classes. The other feature consists of encouraging the counterfactual to differ from the original sample only in a small and compact temporal segment. These time-contiguous perturbations allow for a much sparser and, thus, interpretable solution. Furthermore, plausibility/realism of the generated counterfactual explanations is enforced via the proposed adversarial learning strategy.
LGOct 3, 2023
Coarse-to-Fine Concept Bottleneck ModelsKonstantinos P. Panousis, Dino Ienco, Diego Marcos
Deep learning algorithms have recently gained significant attention due to their impressive performance. However, their high complexity and un-interpretable mode of operation hinders their confident deployment in real-world safety-critical tasks. This work targets ante hoc interpretability, and specifically Concept Bottleneck Models (CBMs). Our goal is to design a framework that admits a highly interpretable decision making process with respect to human understandable concepts, on two levels of granularity. To this end, we propose a novel two-level concept discovery formulation leveraging: (i) recent advances in vision-language models, and (ii) an innovative formulation for coarse-to-fine concept selection via data-driven and sparsity-inducing Bayesian arguments. Within this framework, concept information does not solely rely on the similarity between the whole image and general unstructured concepts; instead, we introduce the notion of concept hierarchy to uncover and exploit more granular concept information residing in patch-specific regions of the image scene. As we experimentally show, the proposed construction not only outperforms recent CBM approaches, but also yields a principled framework towards interpetability.
LGAug 5, 2024
DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial LearningDino Ienco, Cassio Fraga Dantas
Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch, more precisely, training and test data do not cover the same set of data modalities. Traditional approaches for CMKD are based on a teacher/student paradigm where a teacher is trained on multi-modal data with the aim to successively distill knowledge from a multi-modal teacher to a single-modal student. Despite the widespread adoption of such paradigm, recent research has highlighted its inherent limitations in the context of cross-modal knowledge transfer.Taking a step beyond the teacher/student paradigm, here we introduce a new framework for cross-modal knowledge distillation, named DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation), that explicitly models different types of per-modality information with the aim to transfer knowledge from multi-modal data to a single-modal classifier. To this end, DisCoM-KD effectively combines disentanglement representation learning with adversarial domain adaptation to simultaneously extract, foreach modality, domain-invariant, domain-informative and domain-irrelevant features according to a specific downstream task. Unlike the traditional teacher/student paradigm, our framework simultaneously learns all single-modal classifiers, eliminating the need to learn each student model separately as well as the teacher classifier. We evaluated DisCoM-KD on three standard multi-modal benchmarks and compared its behaviourwith recent SOTA knowledge distillation frameworks. The findings clearly demonstrate the effectiveness of DisCoM-KD over competitors considering mismatch scenarios involving both overlapping and non-overlapping modalities. These results offer insights to reconsider the traditional paradigm for distilling information from multi-modal data to single-modal neural networks.
CVJul 5, 2024
PDiscoFormer: Relaxing Part Discovery Constraints with Vision TransformersAnanthu Aniraj, Cassio F. Dantas, Dino Ienco et al.
Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.
CVJun 10, 2025Code
Inherently Faithful Attention Maps for Vision TransformersAnanthu Aniraj, Cassio F. Dantas, Dino Ienco et al.
We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. Code: https://github.com/ananthu-aniraj/ifam
CVAug 16, 2025Code
TimeSenCLIP: A Vision-Language Model for Remote Sensing Using Single-Pixel Time SeriesPallavi Jain, Diego Marcos, Dino Ienco et al.
Vision-language models have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) via zero-shot classification and retrieval. However, current approaches face two key challenges: reliance on large spatial tiles that increase computational cost, and dependence on text-based supervision, which is often not readily available. In this work, we present TimeSenCLIP, a lightweight framework that reevaluate the role of spatial context by evaluating the effectiveness of a single pixel by leveraging its temporal and spectral dimensions, for classifying LULC and ecosystem types. By leveraging spectral and temporal information from Sentinel-2 imagery and cross-view learning with geo-tagged ground-level photos, we minimises the need for caption-based training while preserving semantic alignment between overhead (satellite) and ground perspectives. Our approach is grounded in the LUCAS and Sen4Map datasets, and evaluated on classification tasks including LULC, crop type, and ecosystem type. We demonstrate that single pixel inputs, when combined with temporal and spectral cues, are sufficient for thematic mapping, offering a scalable and efficient alternative for large-scale remote sensing applications. Code is available at https://github.com/pallavijain-pj/TimeSenCLIP
AIJun 20, 2024Code
Semi Supervised Heterogeneous Domain Adaptation via Disentanglement and Pseudo-LabellingCassio F. Dantas, Raffaele Gaetano, Dino Ienco
Semi-supervised domain adaptation methods leverage information from a source labelled domain with the goal of generalizing over a scarcely labelled target domain. While this setting already poses challenges due to potential distribution shifts between domains, an even more complex scenario arises when source and target data differs in modality representation (e.g. they are acquired by sensors with different characteristics). For instance, in remote sensing, images may be collected via various acquisition modes (e.g. optical or radar), different spectral characteristics (e.g. RGB or multi-spectral) and spatial resolutions. Such a setting is denoted as Semi-Supervised Heterogeneous Domain Adaptation (SSHDA) and it exhibits an even more severe distribution shift due to modality heterogeneity across domains.To cope with the challenging SSHDA setting, here we introduce SHeDD (Semi-supervised Heterogeneous Domain Adaptation via Disentanglement) an end-to-end neural framework tailored to learning a target domain classifier by leveraging both labelled and unlabelled data from heterogeneous data sources. SHeDD is designed to effectively disentangle domain-invariant representations, relevant for the downstream task, from domain-specific information, that can hinder the cross-modality transfer. Additionally, SHeDD adopts an augmentation-based consistency regularization mechanism that takes advantages of reliable pseudo-labels on the unlabelled target samples to further boost its generalization ability on the target domain. Empirical evaluations on two remote sensing benchmarks, encompassing heterogeneous data in terms of acquisition modes and spectral/spatial resolutions, demonstrate the quality of SHeDD compared to both baseline and state-of-the-art competing approaches. Our code is publicly available here: https://github.com/tanodino/SSHDA/
CVJun 19, 2024Code
Towards a multimodal framework for remote sensing image change retrieval and captioningRoger Ferrod, Luigi Di Caro, Dino Ienco
Recently, there has been increasing interest in multimodal applications that integrate text with other modalities, such as images, audio and video, to facilitate natural language interactions with multimodal AI systems. While applications involving standard modalities have been extensively explored, there is still a lack of investigation into specific data modalities such as remote sensing (RS) data. Despite the numerous potential applications of RS data, including environmental protection, disaster monitoring and land planning, available solutions are predominantly focused on specific tasks like classification, captioning and retrieval. These solutions often overlook the unique characteristics of RS data, such as its capability to systematically provide information on the same geographical areas over time. This ability enables continuous monitoring of changes in the underlying landscape. To address this gap, we propose a novel foundation model for bi-temporal RS image pairs, in the context of change detection analysis, leveraging Contrastive Learning and the LEVIR-CC dataset for both captioning and text-image retrieval. By jointly training a contrastive encoder and captioning decoder, our model add text-image retrieval capabilities, in the context of bi-temporal change detection, while maintaining captioning performances that are comparable to the state of the art. We release the source code and pretrained weights at: https://github.com/rogerferrod/RSICRC.
CVMay 7
Metonymy in vision models undermines attention-based interpretabilityAnanthu Aniraj, Cassio F. Dantas, Dino Ienco et al.
Part-based reasoning is a classical strategy to make a computer vision model directly focus on the object parts that are relevant to the downstream task. In the context of deep learning, this also serves to improve by-design interpretability, often by using part-centric attention mechanisms on top of a latent image representation provided by a standard, black-box model. This approach is based on a locality assumption: that the latent representation of an object part encodes primarily information about the corresponding image region. In this work, we test this basic assumption, measuring intra-object leakage in vision models using part-based attribute annotations. Through a comprehensive experimental evaluation, we show that modern pretrained vision transformers violate the locality assumption and exhibit a strong intra-object leakage, in which each part encodes information from the whole object, a visual metonymy that compromises the faithfulness of attention-based interpretable-by-design methods for part-based reasoning, ultimately rendering them uninterpretable. In addition, we establish an upper bound using a two-stage approach that prevents leakage by design. We then show that this inherently disentangled feature extraction improves attribute-driven part discovery on a variety of tasks, confirming the practical impact of intra-object leakage. Our results uncover a neglected issue affecting the interpretability of part-based representations, such as those in CBMs relying on part-centric concepts, highlighting that two-stage approaches offer a promising way to mitigate it.
CVDec 11, 2024
SenCLIP: Enhancing zero-shot land-use mapping for Sentinel-2 with ground-level promptingPallavi Jain, Dino Ienco, Roberto Interdonato et al.
Pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive zero-shot classification capabilities with free-form prompts and even show some generalization in specialized domains. However, their performance on satellite imagery is limited due to the underrepresentation of such data in their training sets, which predominantly consist of ground-level images. Existing prompting techniques for satellite imagery are often restricted to generic phrases like a satellite image of ..., limiting their effectiveness for zero-shot land-use and land-cover (LULC) mapping. To address these challenges, we introduce SenCLIP, which transfers CLIPs representation to Sentinel-2 imagery by leveraging a large dataset of Sentinel-2 images paired with geotagged ground-level photos from across Europe. We evaluate SenCLIP alongside other SOTA remote sensing VLMs on zero-shot LULC mapping tasks using the EuroSAT and BigEarthNet datasets with both aerial and ground-level prompting styles. Our approach, which aligns ground-level representations with satellite imagery, demonstrates significant improvements in classification accuracy across both prompt styles, opening new possibilities for applying free-form textual descriptions in zero-shot LULC mapping.
LGApr 17, 2024
Reuse out-of-year data to enhance land cover mapping via feature disentanglement and contrastive learningCassio F. Dantas, Raffaele Gaetano, Claudia Paris et al.
Timely up-to-date land use/land cover (LULC) maps play a pivotal role in supporting agricultural territory management, environmental monitoring and facilitating well-informed and sustainable decision-making. Typically, when creating a land cover (LC) map, precise ground truth data is collected through time-consuming and expensive field campaigns. This data is then utilized in conjunction with satellite image time series (SITS) through advanced machine learning algorithms to get the final map. Unfortunately, each time this process is repeated (e.g., annually over a region to estimate agricultural production or potential biodiversity loss), new ground truth data must be collected, leading to the complete disregard of previously gathered reference data despite the substantial financial and time investment they have required. How to make value of historical data, from the same or similar study sites, to enhance the current LULC mapping process constitutes a significant challenge that could enable the financial and human-resource efforts invested in previous data campaigns to be valued again. Aiming to tackle this important challenge, we here propose a deep learning framework based on recent advances in domain adaptation and generalization to combine remote sensing and reference data coming from two different domains (e.g. historical data and fresh ones) to ameliorate the current LC mapping process. Our approach, namely REFeD (data Reuse with Effective Feature Disentanglement for land cover mapping), leverages a disentanglement strategy, based on contrastive learning, where invariant and specific per-domain features are derived to recover the intrinsic information related to the downstream LC mapping task and alleviate possible distribution shifts between domains. Additionally, REFeD is equipped with an effective supervision scheme where feature disentanglement is further enforced via multiple levels of supervision at different granularities. The experimental assessment over two study areas covering extremely diverse and contrasted landscapes, namely Koumbia (located in the West-Africa region, in Burkina Faso) and Centre Val de Loire (located in centre Europe, France), underlines the quality of our framework and the obtained findings demonstrate that out-of-year information coming from the same (or similar) study site, at different periods of time, can constitute a valuable additional source of information to enhance the LC mapping process.
CVOct 22, 2025
Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaborationFrancisco Mena, Dino Ienco, Cassio F. Dantas et al.
Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.
CVMay 30, 2025
Revisiting Cross-Modal Knowledge Distillation: A Disentanglement Approach for RGBD Semantic SegmentationRoger Ferrod, Cássio F. Dantas, Luigi Di Caro et al.
Multi-modal RGB and Depth (RGBD) data are predominant in many domains such as robotics, autonomous driving and remote sensing. The combination of these multi-modal data enhances environmental perception by providing 3D spatial context, which is absent in standard RGB images. Although RGBD multi-modal data can be available to train computer vision models, accessing all sensor modalities during the inference stage may be infeasible due to sensor failures or resource constraints, leading to a mismatch between data modalities available during training and inference. Traditional Cross-Modal Knowledge Distillation (CMKD) frameworks, developed to address this task, are typically based on a teacher/student paradigm, where a multi-modal teacher distills knowledge into a single-modality student model. However, these approaches face challenges in teacher architecture choices and distillation process selection, thus limiting their adoption in real-world scenarios. To overcome these issues, we introduce CroDiNo-KD (Cross-Modal Disentanglement: a New Outlook on Knowledge Distillation), a novel cross-modal knowledge distillation framework for RGBD semantic segmentation. Our approach simultaneously learns single-modality RGB and Depth models by exploiting disentanglement representation, contrastive learning and decoupled data augmentation with the aim to structure the internal manifolds of neural network models through interaction and collaboration. We evaluated CroDiNo-KD on three RGBD datasets across diverse domains, considering recent CMKD frameworks as competitors. Our findings illustrate the quality of CroDiNo-KD, and they suggest reconsidering the conventional teacher/student paradigm to distill information from multi-modal data to single-modality neural networks.
CVApr 16, 2025
Geographical Context Matters: Bridging Fine and Coarse Spatial Information to Enhance Continental Land Cover MappingBabak Ghassemi, Cassio Fraga-Dantas, Raffaele Gaetano et al.
Land use and land cover mapping from Earth Observation (EO) data is a critical tool for sustainable land and resource management. While advanced machine learning and deep learning algorithms excel at analyzing EO imagery data, they often overlook crucial geospatial metadata information that could enhance scalability and accuracy across regional, continental, and global scales. To address this limitation, we propose BRIDGE-LC (Bi-level Representation Integration for Disentangled GEospatial Land Cover), a novel deep learning framework that integrates multi-scale geospatial information into the land cover classification process. By simultaneously leveraging fine-grained (latitude/longitude) and coarse-grained (biogeographical region) spatial information, our lightweight multi-layer perceptron architecture learns from both during training but only requires fine-grained information for inference, allowing it to disentangle region-specific from region-agnostic land cover features while maintaining computational efficiency. To assess the quality of our framework, we use an open-access in-situ dataset and adopt several competing classification approaches commonly considered for large-scale land cover mapping. We evaluated all approaches through two scenarios: an extrapolation scenario in which training data encompasses samples from all biogeographical regions, and a leave-one-region-out scenario where one region is excluded from training. We also explore the spatial representation learned by our model, highlighting a connection between its internal manifold and the geographical information used during training. Our results demonstrate that integrating geospatial information improves land cover mapping performance, with the most substantial gains achieved by jointly leveraging both fine- and coarse-grained spatial information.
CVApr 30, 2020
Attentive Weakly Supervised land cover mapping for object-based satellite image time series data with spatial interpretationDino Ienco, Yawogan Jean Eudes Gbodjo, Roberto Interdonato et al.
Nowadays, modern Earth Observation systems continuously collect massive amounts of satellite information. The unprecedented possibility to acquire high resolution Satellite Image Time Series (SITS) data (series of images with high revisit time period on the same geographical area) is opening new opportunities to monitor the different aspects of the Earth Surface but, at the same time, it is raising up new challenges in term of suitable methods to analyze and exploit such huge amount of rich and complex image data. One of the main task associated to SITS data analysis is related to land cover mapping where satellite data are exploited via learning methods to recover the Earth Surface status aka the corresponding land cover classes. Due to operational constraints, the collected label information, on which machine learning strategies are trained, is often limited in volume and obtained at coarse granularity carrying out inexact and weak knowledge that can affect the whole process. To cope with such issues, in the context of object-based SITS land cover mapping, we propose a new deep learning framework, named TASSEL (aTtentive weAkly Supervised Satellite image time sEries cLassifier), that is able to intelligently exploit the weak supervision provided by the coarse granularity labels. Furthermore, our framework also produces an additional side-information that supports the model interpretability with the aim to make the black box gray. Such side-information allows to associate spatial interpretation to the model decision via visual inspection.
CVApr 4, 2020
Fine grained classification for multi-source land cover mappingYawogan Jean Eudes Gbodjo, Dino Ienco, Louise Leroux et al.
Nowadays, there is a general agreement on the need to better characterize agricultural monitoring systems in response to the global changes. Timely and accurate land use/land cover mapping can support this vision by providing useful information at fine scale. Here, a deep learning approach is proposed to deal with multi-source land cover mapping at object level. The approach is based on an extension of Recurrent Neural Network enriched via an attention mechanism dedicated to multi-temporal data context. Moreover, a new hierarchical pretraining strategy designed to exploit specific domain knowledge available under hierarchical relationships within land cover classes is introduced. Experiments carried out on the Reunion island - a french overseas department - demonstrate the significance of the proposal compared to remote sensing standard approaches for land cover mapping.
LGNov 20, 2019
Object-based multi-temporal and multi-source land cover mapping leveraging hierarchical class relationshipsYawogan Jean Eudes Gbodjo, Dino Ienco, Louise Leroux et al.
European satellite missions Sentinel-1 (S1) and Sentinel-2 (S2) provide at highspatial resolution and high revisit time, respectively, radar and optical imagesthat support a wide range of Earth surface monitoring tasks such as LandUse/Land Cover mapping. A long-standing challenge in the remote sensingcommunity is about how to efficiently exploit multiple sources of information and leverage their complementary. In this particular case, get the most out ofradar and optical satellite image time series (SITS). Here, we propose to dealwith land cover mapping through a deep learning framework especially tailoredto leverage the multi-source complementarity provided by radar and opticalSITS. The proposed architecture is based on an extension of Recurrent NeuralNetwork (RNN) enriched via a customized attention mechanism capable to fitthe specificity of SITS data. In addition, we propose a new pretraining strategythat exploits domain expert knowledge to guide the model parameter initial-ization. Thorough experimental evaluations involving several machine learningcompetitors, on two contrasted study sites, have demonstrated the suitabilityof our new attention mechanism combined with the extend RNN model as wellas the benefit/limit to inject domain expert knowledge in the neural networktraining process.
LGNov 4, 2019
Supervised level-wise pretraining for recurrent neural network initialization in multi-class classificationDino Ienco, Roberto Interdonato, Raffaele Gaetano
Recurrent Neural Networks (RNNs) can be seriously impacted by the initial parameters assignment, which may result in poor generalization performances on new unseen data. With the objective to tackle this crucial issue, in the context of RNN based classification, we propose a new supervised layer-wise pretraining strategy to initialize network parameters. The proposed approach leverages a data-aware strategy that sets up a taxonomy of classification problems automatically derived by the model behavior. To the best of our knowledge, despite the great interest in RNN-based classification, this is the first data-aware strategy dealing with the initialization of such models. The proposed strategy has been tested on four benchmarks coming from two different domains, i.e., Speech Recognition and Remote Sensing. Results underline the significance of our approach and point out that data-aware strategies positively support the initialization of Recurrent Neural Network based classification models.
CVDec 13, 2018
Combining Sentinel-1 and Sentinel-2 Time Series via RNN for object-based land cover classificationDino Ienco, Raffaele Gaetano, Roberto Interdonato et al.
Radar and Optical Satellite Image Time Series (SITS) are sources of information that are commonly employed to monitor earth surfaces for tasks related to ecology, agriculture, mobility, land management planning and land cover monitoring. Many studies have been conducted using one of the two sources, but how to smartly combine the complementary information provided by radar and optical SITS is still an open challenge. In this context, we propose a new neural architecture for the combination of Sentinel-1 (S1) and Sentinel-2 (S2) imagery at object level, applied to a real-world land cover classification task. Experiments carried out on the Reunion Island, a overseas department of France in the Indian Ocean, demonstrate the significance of our proposal.
CVSep 20, 2018
DuPLO: A DUal view Point deep Learning architecture for time series classificatiOnRoberto Interdonato, Dino Ienco, Raffaele Gaetano et al.
Nowadays, modern Earth Observation systems continuously generate huge amounts of data. A notable example is represented by the Sentinel-2 mission, which provides images at high spatial resolution (up to 10m) with high temporal revisit period (every 5 days), which can be organized in Satellite Image Time Series (SITS). While the use of SITS has been proved to be beneficial in the context of Land Use/Land Cover (LULC) map generation, unfortunately, machine learning approaches commonly leveraged in remote sensing field fail to take advantage of spatio-temporal dependencies present in such data. Recently, new generation deep learning methods allowed to significantly advance research in this field. These approaches have generally focused on a single type of neural network, i.e., Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), which model different but complementary information: spatial autocorrelation (CNNs) and temporal dependencies (RNNs). In this work, we propose the first deep learning architecture for the analysis of SITS data, namely \method{} (DUal view Point deep Learning architecture for time series classificatiOn), that combines Convolutional and Recurrent neural networks to exploit their complementarity. Our hypothesis is that, since CNNs and RNNs capture different aspects of the data, a combination of both models would produce a more diverse and complete representation of the information for the underlying land cover classification task. Experiments carried out on two study sites characterized by different land cover characteristics (i.e., the \textit{Gard} site in France and the \textit{Reunion Island} in the Indian Ocean), demonstrate the significance of our proposal.
CVJun 29, 2018
MRFusion: A Deep Learning architecture to fuse PAN and MS imagery for land cover mappingRaffaele Gaetano, Dino Ienco, Kenji Ose et al.
Nowadays, Earth Observation systems provide a multitude of heterogeneous remote sensing data. How to manage such richness leveraging its complementarity is a crucial chal- lenge in modern remote sensing analysis. Data Fusion techniques deal with this point proposing method to combine and exploit complementarity among the different data sensors. Considering optical Very High Spatial Resolution (VHSR) images, satellites obtain both Multi Spectral (MS) and panchro- matic (PAN) images at different spatial resolution. VHSR images are extensively exploited to produce land cover maps to deal with agricultural, ecological, and socioeconomic issues as well as assessing ecosystem status, monitoring biodiversity and provid- ing inputs to conceive food risk monitoring systems. Common techniques to produce land cover maps from such VHSR images typically opt for a prior pansharpening of the multi-resolution source for a full resolution processing. Here, we propose a new deep learning architecture to jointly use PAN and MS imagery for a direct classification without any prior image fusion or resampling process. By managing the spectral information at its native spatial resolution, our method, named MRFusion, aims at avoiding the possible infor- mation loss induced by pansharpening or any other hand-crafted preprocessing. Moreover, the proposed architecture is suitably designed to learn non-linear transformations of the sources with the explicit aim of taking as much as possible advantage of the complementarity of PAN and MS imagery. Experiments are carried out on two-real world scenarios depicting large areas with different land cover characteristics. The characteristics of the proposed scenarios underline the applicability and the generality of our method in operational settings.
CVAug 11, 2017
Deep Recurrent Neural Networks for mapping winter vegetation quality coverage via multi-temporal SAR Sentinel-1Dinh Ho Tong Minh, Dino Ienco, Raffaele Gaetano et al.
Mapping winter vegetation quality coverage is a challenge problem of remote sensing. This is due to the cloud coverage in winter period, leading to use radar rather than optical images. The objective of this paper is to provide a better understanding of the capabilities of radar Sentinel-1 and deep learning concerning about mapping winter vegetation quality coverage. The analysis presented in this paper is carried out on multi-temporal Sentinel-1 data over the site of La Rochelle, France, during the campaign in December 2016. This dataset were processed in order to produce an intensity radar data stack from October 2016 to February 2017. Two deep Recurrent Neural Network (RNN) based classifier methods were employed. We found that the results of RNNs clearly outperformed the classical machine learning approaches (Support Vector Machine and Random Forest). This study confirms that the time series radar Sentinel-1 and RNNs could be exploited for winter vegetation quality cover mapping.
CVApr 13, 2017
Land Cover Classification via Multi-temporal Spatial Data by Recurrent Neural NetworksDino Ienco, Raffaele Gaetano, Claire Dupaquier et al.
Nowadays, modern earth observation programs produce huge volumes of satellite images time series (SITS) that can be useful to monitor geographical areas through time. How to efficiently analyze such kind of information is still an open question in the remote sensing field. Recently, deep learning methods proved suitable to deal with remote sensing data mainly for scene classification (i.e. Convolutional Neural Networks - CNNs - on single images) while only very few studies exist involving temporal deep learning approaches (i.e Recurrent Neural Networks - RNNs) to deal with remote sensing time series. In this letter we evaluate the ability of Recurrent Neural Networks, in particular the Long-Short Term Memory (LSTM) model, to perform land cover classification considering multi-temporal spatial data derived from a time series of satellite images. We carried out experiments on two different datasets considering both pixel-based and object-based classification. The obtained results show that Recurrent Neural Networks are competitive compared to state-of-the-art classifiers, and may outperform classical approaches in presence of low represented and/or highly mixed classes. We also show that using the alternative feature representation generated by LSTM can improve the performances of standard classifiers.
MMNov 16, 2015
Deep learning is a good steganalysis tool when embedding key is reused for different images, even if there is a cover source-mismatchLionel Pibre, Pasquet Jérôme, Dino Ienco et al.
Since the BOSS competition, in 2010, most steganalysis approaches use a learning methodology involving two steps: feature extraction, such as the Rich Models (RM), for the image representation, and use of the Ensemble Classifier (EC) for the learning step. In 2015, Qian et al. have shown that the use of a deep learning approach that jointly learns and computes the features, is very promising for the steganalysis. In this paper, we follow-up the study of Qian et al., and show that, due to intrinsic joint minimization, the results obtained from a Convolutional Neural Network (CNN) or a Fully Connected Neural Network (FNN), if well parameterized, surpass the conventional use of a RM with an EC. First, numerous experiments were conducted in order to find the best " shape " of the CNN. Second, experiments were carried out in the clairvoyant scenario in order to compare the CNN and FNN to an RM with an EC. The results show more than 16% reduction in the classification error with our CNN or FNN. Third, experiments were also performed in a cover-source mismatch setting. The results show that the CNN and FNN are naturally robust to the mismatch problem. In Addition to the experiments, we provide discussions on the internal mechanisms of a CNN, and weave links with some previously stated ideas, in order to understand the impressive results we obtained.