CVApr 29, 2023Code
MH-DETR: Video Moment and Highlight Detection with Cross-modal TransformerYifang Xu, Yunzhuo Sun, Yang Li et al.
With the increasing demand for video understanding, video moment and highlight detection (MHD) has emerged as a critical research topic. MHD aims to localize all moments and predict clip-wise saliency scores simultaneously. Despite progress made by existing DETR-based methods, we observe that these methods coarsely fuse features from different modalities, which weakens the temporal intra-modal context and results in insufficient cross-modal interaction. To address this issue, we propose MH-DETR (Moment and Highlight Detection Transformer) tailored for MHD. Specifically, we introduce a simple yet efficient pooling operator within the uni-modal encoder to capture global intra-modal context. Moreover, to obtain temporally aligned cross-modal features, we design a plug-and-play cross-modal interaction module between the encoder and decoder, seamlessly integrating visual and textual features. Comprehensive experiments on QVHighlights, Charades-STA, Activity-Net, and TVSum datasets show that MH-DETR outperforms existing state-of-the-art methods, demonstrating its effectiveness and superiority. Our code is available at https://github.com/YoucanBaby/MH-DETR.
CVFeb 12
EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation DataNils Lehmann, Yi Wang, Zhitong Xiong et al.
State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.
CVNov 14, 2024Code
LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language InterpretationZhenshi Li, Dilxat Muhtar, Feng Gu et al.
Automatically and rapidly understanding Earth's surface is fundamental to our grasp of the living environment and informed decision-making. This underscores the need for a unified system with comprehensive capabilities in analyzing Earth's surface to address a wide range of human needs. The emergence of multimodal large language models (MLLMs) has great potential in boosting the efficiency and convenience of intelligent Earth observation. These models can engage in human-like conversations, serve as unified platforms for understanding images, follow diverse instructions, and provide insightful feedbacks. In this study, we introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images, designed to expertly perform a wide range of RS understanding tasks aligned with human instructions. LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment. To further enhance RS-oriented vision-language alignment, we propose a large-scale RS image-caption dataset, generated through feature-guided image recaptioning. Additionally, we introduce an instruction dataset specifically designed to improve spatial recognition abilities. Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks. We also evaluate different MLLM performances in complex RS perception and instruction following using a complicated multi-choice question evaluation benchmark, providing a reliable guide for future model selection and improvement. Data, code, and models will be available at https://github.com/NJU-LHRS/LHRS-Bot.
CVMar 26
GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote SensingXuran Hu, Zhitong Xiong, Zhongcheng Hong et al.
Current Large Multimodal Models (LMMs) in Earth Observation typically neglect the critical "vertical" dimension, limiting their reasoning capabilities in complex remote sensing geometries and disaster scenarios where physical spatial structures often outweigh planar visual textures. To bridge this gap, we introduce a comprehensive evaluation framework dedicated to height-aware remote sensing understanding. First, to overcome the severe scarcity of annotated data, we develop a scalable, VLM-driven data generation pipeline utilizing systematic prompt engineering and metadata extraction. This pipeline constructs two complementary benchmarks: GeoHeight-Bench for relative height analysis, and a more challenging GeoHeight-Bench+ for holistic, terrain-aware reasoning. Furthermore, to validate the necessity of height perception, we propose GeoHeightChat, the first height-aware remote sensing LMM baseline. Serving as a strong proof of concept, our baseline demonstrates that synergizing visual semantics with implicitly injected height geometric features effectively mitigates the "vertical blind spot", successfully unlocking a new paradigm of interactive height reasoning in existing optical models.
IVJul 10, 2020Code
Cross-Attention in Coupled Unmixing Nets for Unsupervised Hyperspectral Super-ResolutionJing Yao, Danfeng Hong, Jocelyn Chanussot et al.
The recent advancement of deep learning techniques has made great progress on hyperspectral image super-resolution (HSI-SR). Yet the development of unsupervised deep networks remains challenging for this task. To this end, we propose a novel coupled unmixing network with a cross-attention mechanism, CUCaNet for short, to enhance the spatial resolution of HSI by means of higher-spatial-resolution multispectral image (MSI). Inspired by coupled spectral unmixing, a two-stream convolutional autoencoder framework is taken as backbone to jointly decompose MS and HS data into a spectrally meaningful basis and corresponding coefficients. CUCaNet is capable of adaptively learning spectral and spatial response functions from HS-MS correspondences by enforcing reasonable consistency assumptions on the networks. Moreover, a cross-attention module is devised to yield more effective spatial-spectral information transfer in networks. Extensive experiments are conducted on three widely-used HS-MS datasets in comparison with state-of-the-art HSI-SR models, demonstrating the superiority of the CUCaNet in the HSI-SR application. Furthermore, the codes and datasets will be available at: https://github.com/danfenghong/ECCV2020_CUCaNet.
LGApr 9
Bias-Constrained Diffusion Schedules for PDE Emulations: Reconstruction Error Minimization and Efficient Unrolled TrainingConstantin Le Cleï, Nils Thürey, Xiaoxiang Zhu
Conditional Diffusion Models are powerful surrogates for emulating complex spatiotemporal dynamics, yet they often fail to match the accuracy of deterministic neural emulators for high-precision tasks. In this work, we address two critical limitations of autoregressive PDE diffusion models: their sub-optimal single-step accuracy and the prohibitive computational cost of unrolled training. First, we characterize the relationship between the noise schedule, the reconstruction error reduction rate and the diffusion exposure bias, demonstrating that standard schedules lead to suboptimal reconstruction error. Leveraging this insight, we propose an \textit{Adaptive Noise Schedule} framework that minimizes inference reconstruction error by dynamically constraining the model's exposure bias. We further show that this optimized schedule enables a fast \textit{Proxy Unrolled Training} method to stabilize long-term rollouts without the cost of full Markov Chain sampling. Both proposed methods enable significant improvements in short-term accuracy and long-term stability over diffusion and deterministic baselines on diverse benchmarks, including forced Navier-Stokes, Kuramoto-Sivashinsky and Transonic Flow.
CLMar 28, 2025
Generating Synthetic Oracle Datasets to Analyze Noise Impact: A Study on Building Function Classification Using TweetsShanshan Bai, Anna Kruspe, Xiaoxiang Zhu
Tweets provides valuable semantic context for earth observation tasks and serves as a complementary modality to remote sensing imagery. In building function classification (BFC), tweets are often collected using geographic heuristics and labeled via external databases, an inherently weakly supervised process that introduces both label noise and sentence level feature noise (e.g., irrelevant or uninformative tweets). While label noise has been widely studied, the impact of sentence level feature noise remains underexplored, largely due to the lack of clean benchmark datasets for controlled analysis. In this work, we propose a method for generating a synthetic oracle dataset using LLM, designed to contain only tweets that are both correctly labeled and semantically relevant to their associated buildings. This oracle dataset enables systematic investigation of noise impacts that are otherwise difficult to isolate in real-world data. To assess its utility, we compare model performance using Naive Bayes and mBERT classifiers under three configurations: real vs. synthetic training data, and cross-domain generalization. Results show that noise in real tweets significantly degrades the contextual learning capacity of mBERT, reducing its performance to that of a simple keyword-based model. In contrast, the clean synthetic dataset allows mBERT to learn effectively, outperforming Naive Bayes Bayes by a large margin. These findings highlight that addressing feature noise is more critical than model complexity in this task. Our synthetic dataset offers a novel experimental environment for future noise injection studies and is publicly available on GitHub.
CVApr 15, 2024
Contrastive Pretraining for Visual Concept Explanations of Socioeconomic OutcomesIvica Obadic, Alex Levering, Lars Pennig et al.
Predicting socioeconomic indicators from satellite imagery with deep learning has become an increasingly popular research direction. Post-hoc concept-based explanations can be an important step towards broader adoption of these models in policy-making as they enable the interpretation of socioeconomic outcomes based on visual concepts that are intuitive to humans. In this paper, we study the interplay between representation learning using an additional task-specific contrastive loss and post-hoc concept explainability for socioeconomic studies. Our results on two different geographical locations and tasks indicate that the task-specific pretraining imposes a continuous ordering of the latent space embeddings according to the socioeconomic outcomes. This improves the model's interpretability as it enables the latent space of the model to associate concepts encoding typical urban and natural area patterns with continuous intervals of socioeconomic outcomes. Further, we illustrate how analyzing the model's conceptual sensitivity for the intervals of socioeconomic outcomes can shed light on new insights for urban studies.
CVMay 8, 2023
Building Footprint Extraction with Graph Convolutional NetworkYilei Shi, Qinyu Li, Xiaoxiang Zhu
Building footprint information is an essential ingredient for 3-D reconstruction of urban models. The automatic generation of building footprints from satellite images presents a considerable challenge due to the complexity of building shapes. Recent developments in deep convolutional neural networks (DCNNs) have enabled accurate pixel-level labeling tasks. One central issue remains, which is the precise delineation of boundaries. Deep architectures generally fail to produce fine-grained segmentation with accurate boundaries due to progressive downsampling. In this work, we have proposed a end-to-end framework to overcome this issue, which uses the graph convolutional network (GCN) for building footprint extraction task. Our proposed framework outperforms state-of-the-art methods.
CVJan 26, 2022
ReforesTree: A Dataset for Estimating Tropical Forest Carbon Stock with Deep Learning and Aerial ImageryGyri Reiersen, David Dao, Björn Lütjens et al.
Forest biomass is a key influence for future climate, and the world urgently needs highly scalable financing schemes, such as carbon offsetting certifications, to protect and restore forests. Current manual forest carbon stock inventory methods of measuring single trees by hand are time, labour, and cost-intensive and have been shown to be subjective. They can lead to substantial overestimation of the carbon stock and ultimately distrust in forest financing. The potential for impact and scale of leveraging advancements in machine learning and remote sensing technologies is promising but needs to be of high quality in order to replace the current forest stock protocols for certifications. In this paper, we present ReforesTree, a benchmark dataset of forest carbon stock in six agro-forestry carbon offsetting sites in Ecuador. Furthermore, we show that a deep learning-based end-to-end model using individual tree detection from low cost RGB-only drone imagery is accurately estimating forest carbon stock within official carbon offsetting certification standards. Additionally, our baseline CNN model outperforms state-of-the-art satellite-based forest biomass and carbon stock estimates for this type of small-scale, tropical agro-forestry sites. We present this dataset to encourage machine learning research in this area to increase accountability and transparency of monitoring, verification and reporting (MVR) in carbon offsetting projects, as well as scaling global reforestation financing through accurate remote sensing.
CVJan 24, 2022
SEN12MS-CR-TS: A Remote Sensing Data Set for Multi-modal Multi-temporal Cloud RemovalPatrick Ebel, Yajin Xu, Michael Schmitt et al.
About half of all optical observations collected via spaceborne satellites are affected by haze or clouds. Consequently, cloud coverage affects the remote sensing practitioner's capabilities of a continuous and seamless monitoring of our planet. This work addresses the challenge of optical satellite image reconstruction and cloud removal by proposing a novel multi-modal and multi-temporal data set called SEN12MS-CR-TS. We propose two models highlighting the benefits and use cases of SEN12MS-CR-TS: First, a multi-modal multi-temporal 3D-Convolution Neural Network that predicts a cloud-free image from a sequence of cloudy optical and radar images. Second, a sequence-to-sequence translation model that predicts a cloud-free time series from a cloud-covered time series. Both approaches are evaluated experimentally, with their respective models trained and tested on SEN12MS-CR-TS. The conducted experiments highlight the contribution of our data set to the remote sensing community as well as the benefits of multi-modal and multi-temporal information to reconstruct noisy information. Our data set is available at https://patrickTUM.github.io/cloud_removal
CVDec 12, 2021
Change Detection Meets Visual Question AnsweringZhenghang Yuan, Lichao Mou, Zhitong Xiong et al.
The Earth's surface is continually changing, and identifying changes plays an important role in urban planning and sustainability. Although change detection techniques have been successfully developed for many years, these techniques are still limited to experts and facilitators in related fields. In order to provide every user with flexible access to change information and help them better understand land-cover changes, we introduce a novel task: change detection-based visual question answering (CDVQA) on multi-temporal aerial images. In particular, multi-temporal images can be queried to obtain high level change-based information according to content changes between two input images. We first build a CDVQA dataset including multi-temporal image-question-answer triplets using an automatic question-answer generation method. Then, a baseline CDVQA framework is devised in this work, and it contains four parts: multi-temporal feature encoding, multi-temporal fusion, multi-modal fusion, and answer prediction. In addition, we also introduce a change enhancing module to multi-temporal feature encoding, aiming at incorporating more change-related information. Finally, effects of different backbones and multi-temporal fusion strategies are studied on the performance of CDVQA task. The experimental results provide useful insights for developing better CDVQA models, which are important for future research on this task.
CVNov 22, 2021
Lebanon Solar Rooftop Potential Assessment using Buildings Segmentation from Aerial ImagesHasan Nasrallah, Abed Ellatif Samhat, Yilei Shi et al.
Estimating solar rooftop potential at a national level is a fundamental building block for every country to utilize solar power efficiently. Solar rooftop potential assessment relies on several features such as building geometry, location, and surrounding facilities. Hence, national-level approximations that do not take these factors into deep consideration are often inaccurate. This paper introduces Lebanon's first comprehensive footprint and solar rooftop potential maps using deep learning-based instance segmentation to extract buildings' footprints from satellite images. A photovoltaic panels placement algorithm that considers the morphology of each roof is proposed. We show that the average rooftop's solar potential can fulfill the yearly electric needs of a single-family residence while using only 5% of the roof surface. The usage of 50% of a residential apartment rooftop area would achieve energy security for up to 8 households. We also compute the average and total solar rooftop potential per district to localize regions corresponding to the highest and lowest solar rooftop potential yield. Factors such as size, ground coverage ratio and PV_out are carefully investigated for each district. Baalbeck district yielded the highest total solar rooftop potential despite its low built-up area. While, Beirut capital city has the highest average solar rooftop potential due to its extremely populated urban nature. Reported results and analysis reveal solar rooftop potential urban patterns and provides policymakers and key stakeholders with tangible insights. Lebanon's total solar rooftop potential is about 28.1 TWh/year, two times larger than the national energy consumption in 2019.
CVAug 15, 2021
SCIDA: Self-Correction Integrated Domain Adaptation from Single- to Multi-label Aerial ImagesTianze Yu, Jianzhe Lin, Lichao Mou et al.
Most publicly available datasets for image classification are with single labels, while images are inherently multi-labeled in our daily life. Such an annotation gap makes many pre-trained single-label classification models fail in practical scenarios. This annotation issue is more concerned for aerial images: Aerial data collected from sensors naturally cover a relatively large land area with multiple labels, while annotated aerial datasets, which are publicly available (e.g., UCM, AID), are single-labeled. As manually annotating multi-label aerial images would be time/labor-consuming, we propose a novel self-correction integrated domain adaptation (SCIDA) method for automatic multi-label learning. SCIDA is weakly supervised, i.e., automatically learning the multi-label image classification model from using massive, publicly available single-label images. To achieve this goal, we propose a novel Label-Wise self-Correction (LWC) module to better explore underlying label correlations. This module also makes the unsupervised domain adaptation (UDA) from single- to multi-label data possible. For model training, the proposed model only uses single-label information yet requires no prior knowledge of multi-labeled data; and it predicts labels for multi-label aerial images. In our experiments, trained with single-labeled MAI-AID-s and MAI-UCM-s datasets, the proposed model is tested directly on our collected Multi-scene Aerial Image (MAI) dataset.
CVJul 23, 2021
Tackling the Overestimation of Forest Carbon with Deep Learning and Aerial ImageryGyri Reiersen, David Dao, Björn Lütjens et al.
Forest carbon offsets are increasingly popular and can play a significant role in financing climate mitigation, forest conservation, and reforestation. Measuring how much carbon is stored in forests is, however, still largely done via expensive, time-consuming, and sometimes unaccountable field measurements. To overcome these limitations, many verification bodies are leveraging machine learning (ML) algorithms to estimate forest carbon from satellite or aerial imagery. Aerial imagery allows for tree species or family classification, which improves the satellite imagery-based forest type classification. However, aerial imagery is significantly more expensive to collect and it is unclear by how much the higher resolution improves the forest carbon estimation. This proposal paper describes the first systematic comparison of forest carbon estimation from aerial imagery, satellite imagery, and ground-truth field measurements via deep learning-based algorithms for a tropical reforestation project. Our initial results show that forest carbon estimates from satellite imagery can overestimate above-ground biomass by up to 10-times for tropical reforestation projects. The significant difference between aerial and satellite-derived forest carbon measurements shows the potential for aerial imagery-based ML algorithms and raises the importance to extend this study to a global benchmark between options for carbon measurements.
IVSep 16, 2020
Multi-Sensor Data Fusion for Cloud Removal in Global and All-Season Sentinel-2 ImageryPatrick Ebel, Andrea Meraner, Michael Schmitt et al.
This work has been accepted by IEEE TGRS for publication. The majority of optical observations acquired via spaceborne earth imagery are affected by clouds. While there is numerous prior work on reconstructing cloud-covered information, previous studies are oftentimes confined to narrowly-defined regions of interest, raising the question of whether an approach can generalize to a diverse set of observations acquired at variable cloud coverage or in different regions and seasons. We target the challenge of generalization by curating a large novel data set for training new cloud removal approaches and evaluate on two recently proposed performance metrics of image quality and diversity. Our data set is the first publically available to contain a global sample of co-registered radar and optical observations, cloudy as well as cloud-free. Based on the observation that cloud coverage varies widely between clear skies and absolute coverage, we propose a novel model that can deal with either extremes and evaluate its performance on our proposed data set. Finally, we demonstrate the superiority of training models on real over synthetic data, underlining the need for a carefully curated data set of real observations. To facilitate future research, our data set is made available online
CVMay 18, 2020
Cross-Task Transfer for Geotagged Audiovisual Aerial Scene RecognitionDi Hu, Xuhong Li, Lichao Mou et al.
Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.
CVNov 7, 2018
GeoSay: A Geometric Saliency for Extracting Buildings in Remote Sensing ImagesGui-Song Xia, Jin Huang, Nan Xue et al.
Automatic extraction of buildings in remote sensing images is an important but challenging task and finds many applications in different fields such as urban planning, navigation and so on. This paper addresses the problem of buildings extraction in very high-spatial-resolution (VHSR) remote sensing (RS) images, whose spatial resolution is often up to half meters and provides rich information about buildings. Based on the observation that buildings in VHSR-RS images are always more distinguishable in geometry than in texture or spectral domain, this paper proposes a geometric building index (GBI) for accurate building extraction, by computing the geometric saliency from VHSR-RS images. More precisely, given an image, the geometric saliency is derived from a mid-level geometric representations based on meaningful junctions that can locally describe geometrical structures of images. The resulting GBI is finally measured by integrating the derived geometric saliency of buildings. Experiments on three public and commonly used datasets demonstrate that the proposed GBI achieves the state-of-the-art performance and shows impressive generalization capability. Additionally, GBI preserves both the exact position and accurate shape of single buildings compared to existing methods.
LGAug 15, 2018
Joint & Progressive Learning from High-Dimensional Data for Multi-Label ClassificationDanfeng Hong, Naoto Yokoya, Jian Xu et al.
Despite the fact that nonlinear subspace learning techniques (e.g. manifold learning) have successfully applied to data representation, there is still room for improvement in explainability (explicit mapping), generalization (out-of-samples), and cost-effectiveness (linearization). To this end, a novel linearized subspace learning technique is developed in a joint and progressive way, called \textbf{j}oint and \textbf{p}rogressive \textbf{l}earning str\textbf{a}teg\textbf{y} (J-Play), with its application to multi-label classification. The J-Play learns high-level and semantically meaningful feature representation from high-dimensional data by 1) jointly performing multiple subspace learning and classification to find a latent subspace where samples are expected to be better classified; 2) progressively learning multi-coupled projections to linearly approach the optimal mapping bridging the original space with the most discriminative subspace; 3) locally embedding manifold structure in each learnable latent subspace. Extensive experiments are performed to demonstrate the superiority and effectiveness of the proposed method in comparison with previous state-of-the-art methods.