LGOct 15, 2022Code
Augmentation-Free Graph Contrastive Learning of Invariant-Discriminative RepresentationsHaifeng Li, Jun Cao, Jiawei Zhu et al.
The pretasks are mainly built on mutual information estimation, which requires data augmentation to construct positive samples with similar semantics to learn invariant signals and negative samples with dissimilar semantics in order to empower representation discriminability. However, an appropriate data augmentation configuration depends heavily on lots of empirical trials such as choosing the compositions of data augmentation techniques and the corresponding hyperparameter settings. We propose an augmentation-free graph contrastive learning method, invariant-discriminative graph contrastive learning (iGCL), that does not intrinsically require negative samples. iGCL designs the invariant-discriminative loss (ID loss) to learn invariant and discriminative representations. On the one hand, ID loss learns invariant signals by directly minimizing the mean square error between the target samples and positive samples in the representation space. On the other hand, ID loss ensures that the representations are discriminative by an orthonormal constraint forcing the different dimensions of representations to be independent of each other. This prevents representations from collapsing to a point or subspace. Our theoretical analysis explains the effectiveness of ID loss from the perspectives of the redundancy reduction criterion, canonical correlation analysis, and information bottleneck principle. The experimental results demonstrate that iGCL outperforms all baselines on 5 node classification benchmark datasets. iGCL also shows superior performance for different label ratios and is capable of resisting graph attacks, which indicates that iGCL has excellent generalization and robustness. The source code is available at https://github.com/lehaifeng/T-GCN/tree/master/iGCL.
CVApr 15
Seedance 2.0: Advancing Video Generation for World ComplexityTeam Seedance, De Chen, Liyang Chen et al. · gatech
Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.
CVNov 15, 2022Code
False: False Negative Samples Aware Contrastive Learning for Semantic Segmentation of High-Resolution Remote Sensing ImageZhaoyang Zhang, Xuying Wang, Xiaoming Mei et al.
The existing SSCL of RSI is built based on constructing positive and negative sample pairs. However, due to the richness of RSI ground objects and the complexity of the RSI contextual semantics, the same RSI patches have the coexistence and imbalance of positive and negative samples, which causing the SSCL pushing negative samples far away while pushing positive samples far away, and vice versa. We call this the sample confounding issue (SCI). To solve this problem, we propose a False negAtive sampLes aware contraStive lEarning model (FALSE) for the semantic segmentation of high-resolution RSIs. Since the SSCL pretraining is unsupervised, the lack of definable criteria for false negative sample (FNS) leads to theoretical undecidability, we designed two steps to implement the FNS approximation determination: coarse determination of FNS and precise calibration of FNS. We achieve coarse determination of FNS by the FNS self-determination (FNSD) strategy and achieve calibration of FNS by the FNS confidence calibration (FNCC) loss function. Experimental results on three RSI semantic segmentation datasets demonstrated that the FALSE effectively improves the accuracy of the downstream RSI semantic segmentation task compared with the current three models, which represent three different types of SSCL models. The mean Intersection-over-Union on ISPRS Potsdam dataset is improved by 0.7\% on average; on CVPR DGLC dataset is improved by 12.28\% on average; and on Xiangtan dataset this is improved by 1.17\% on average. This indicates that the SSCL model has the ability to self-differentiate FNS and that the FALSE effectively mitigates the SCI in self-supervised contrastive learning. The source code is available at https://github.com/GeoX-Lab/FALSE.
LGJun 28, 2023Code
GraSS: Contrastive Learning with Gradient Guided Sampling Strategy for Remote Sensing Image Semantic SegmentationZhaoyang Zhang, Zhen Ren, Chao Tao et al.
Self-supervised contrastive learning (SSCL) has achieved significant milestones in remote sensing image (RSI) understanding. Its essence lies in designing an unsupervised instance discrimination pretext task to extract image features from a large number of unlabeled images that are beneficial for downstream tasks. However, existing instance discrimination based SSCL suffer from two limitations when applied to the RSI semantic segmentation task: 1) Positive sample confounding issue; 2) Feature adaptation bias. It introduces a feature adaptation bias when applied to semantic segmentation tasks that require pixel-level or object-level features. In this study, We observed that the discrimination information can be mapped to specific regions in RSI through the gradient of unsupervised contrastive loss, these specific regions tend to contain singular ground objects. Based on this, we propose contrastive learning with Gradient guided Sampling Strategy (GraSS) for RSI semantic segmentation. GraSS consists of two stages: Instance Discrimination warm-up (ID warm-up) and Gradient guided Sampling contrastive training (GS training). The ID warm-up aims to provide initial discrimination information to the contrastive loss gradients. The GS training stage aims to utilize the discrimination information contained in the contrastive loss gradients and adaptively select regions in RSI patches that contain more singular ground objects, in order to construct new positive and negative samples. Experimental results on three open datasets demonstrate that GraSS effectively enhances the performance of SSCL in high-resolution RSI semantic segmentation. Compared to seven baseline methods from five different types of SSCL, GraSS achieves an average improvement of 1.57\% and a maximum improvement of 3.58\% in terms of mean intersection over the union. The source code is available at https://github.com/GeoX-Lab/GraSS
CVSep 15, 2024Code
Towards Single-Lens Controllable Depth-of-Field Imaging via Depth-Aware Point Spread FunctionsXiaolong Qian, Qi Jiang, Yao Gao et al.
Controllable Depth-of-Field (DoF) imaging commonly produces amazing visual effects based on heavy and expensive high-end lenses. However, confronted with the increasing demand for mobile scenarios, it is desirable to achieve a lightweight solution with Minimalist Optical Systems (MOS). This work centers around two major limitations of MOS, i.e., the severe optical aberrations and uncontrollable DoF, for achieving single-lens controllable DoF imaging via computational methods. A Depth-aware Controllable DoF Imaging (DCDI) framework is proposed equipped with All-in-Focus (AiF) aberration correction and monocular depth estimation, where the recovered image and corresponding depth map are utilized to produce imaging results under diverse DoFs of any high-end lens via patch-wise convolution. To address the depth-varying optical degradation, we introduce a Depth-aware Degradation-adaptive Training (DA2T) scheme. At the dataset level, a Depth-aware Aberration MOS (DAMOS) dataset is established based on the simulation of Point Spread Functions (PSFs) under different object distances. Additionally, we design two plug-and-play depth-aware mechanisms to embed depth information into the aberration image recovery for better tackling depth-aware degradation. Furthermore, we propose a storage-efficient Omni-Lens-Field model to represent the 4D PSF library of various lenses. With the predicted depth map, recovered image, and depth-aware PSF map inferred by Omni-Lens-Field, single-lens controllable DoF imaging is achieved. Comprehensive experimental results demonstrate that the proposed framework enhances the recovery performance, and attains impressive single-lens controllable DoF imaging results, providing a seminal baseline for this field. The source code and the established dataset will be publicly available at https://github.com/XiaolongQian/DCDI.
CVNov 15, 2022
Self-supervised remote sensing feature learning: Learning Paradigms, Challenges, and Future WorksChao Tao, Ji Qi, Mingning Guo et al.
Deep learning has achieved great success in learning features from massive remote sensing images (RSIs). To better understand the connection between feature learning paradigms (e.g., unsupervised feature learning (USFL), supervised feature learning (SFL), and self-supervised feature learning (SSFL)), this paper analyzes and compares them from the perspective of feature learning signals, and gives a unified feature learning framework. Under this unified framework, we analyze the advantages of SSFL over the other two learning paradigms in RSIs understanding tasks and give a comprehensive review of the existing SSFL work in RS, including the pre-training dataset, self-supervised feature learning signals, and the evaluation methods. We further analyze the effect of SSFL signals and pre-training data on the learned features to provide insights for improving the RSI feature learning. Finally, we briefly discuss some open problems and possible research directions.
CVApr 10, 2022
TOV: The Original Vision Model for Optical Remote Sensing Image Understanding via Self-supervised LearningChao Tao, Ji Qia, Guo Zhang et al.
Do we on the right way for remote sensing image understanding (RSIU) by training models via supervised data-dependent and task-dependent way, instead of human vision in a label-free and task-independent way? We argue that a more desirable RSIU model should be trained with intrinsic structure from data rather that extrinsic human labels to realize generalizability across a wide range of RSIU tasks. According to this hypothesis, we proposed \textbf{T}he \textbf{O}riginal \textbf{V}ision model (TOV) in remote sensing filed. Trained by massive unlabeled optical data along a human-like self-supervised learning (SSL) path that is from general knowledge to specialized knowledge, TOV model can be easily adapted to various RSIU tasks, including scene classification, object detection, and semantic segmentation, and outperforms dominant ImageNet supervised pretrained method as well as two recently proposed SSL pretrained methods on majority of 12 publicly available benchmarks. Moreover, we analyze the influences of two key factors on the performance of building TOV model for RSIU, including the influence of using different data sampling methods and the selection of learning paths during self-supervised optimization. We believe that a general model which is trained by a label-free and task-independent way may be the next paradigm for RSIU and hope the insights distilled from this study can help to foster the development of an original vision model for RSIU.
CVMay 19Code
iGSP:Implicit Gradient Subspace Projection for Efficient Continual Learning of Vision-Language ModelsXuezhi Cui, Dongbo Zhou, Wang Guo et al.
Vision-Language Models require efficient adaptation to continually emerging downstream tasks. While Parameter-Efficient Fine-Tuning mitigates catastrophic forgetting, assigning isolated modules per task leads to parameter explosion. Conversely, recent similarity-driven sharing mechanisms falsely equate superficial visual similarity with underlying alignment consistency. This fundamental mismatch triggers severe negative transfer between visually similar but logically distinct tasks and fails to exploit alignment reuse across visually diverse ones. We argue thatalignment sharing is fundamentally a geometric problem of overlapping optimization trajectories within shared low-rank subspaces. Grounded in this insight, we propose iGSP, a novel framework that achieves efficient adaptation via implicit gradient subspace projection. Leveraging the early convergence of MoE routers to establish the subspace basis, iGSP bifurcates the adaptation process into two phases. First, the Subspace Identification phase introduces candidate experts via basis pre-expansion, applies a novel subspace-constrained regularization to implicitly project new task gradients onto the historical subspace, and precisely prunes redundant dimensions by treating routing probabilities as gradient flow indicators, ultimately to maximize knowledge reuse. Second, the Orthogonal Subspace Fine-Tuning phase fixes this structural basis and removes the regularization to rapidly fit the task-specific residual loss. Extensive experiments on the MTIL benchmark demonstrate that iGSP achieves state-of-the-art accuracy while significantly improving training efficiency, reducing the average trainable parameters by 42.7\% compared to current SOTA methods, and decreasing the final total parameters by 86.9\% relative to counterparts. The source code is available at https://github.com/GeoX-Lab/iGSP.
SYMar 22, 2019
Bayesian Estimation Based Parameter Estimation for Composite LoadChang Fu, Zhe Yu, Di Shi et al.
Accurate identification of parameters of load models is essential in power system computations, including simulation, prediction, and stability and reliability analysis. Conventional point estimation based composite load modeling approaches suffer from disturbances and noises and provide limited information of the system dynamics. In this work, a statistic (Bayesian Estimation) based distribution estimation approach is proposed for both static (ZIP) and dynamic (Induction Motor) load modeling. When dealing with multiple parameters, Gibbs sampling method is employed. In each iteration, the proposal samples each parameter while keeps others fixed. The proposed method provides a distribution estimation of load models coefficients and is robust to measurement errors.
LGOct 30, 2022
STGC-GNNs: A GNN-based traffic prediction framework with a spatial-temporal Granger causality graphSilu He, Qinyao Luo, Ronghua Du et al.
The key to traffic prediction is to accurately depict the temporal dynamics of traffic flow traveling in a road network, so it is important to model the spatial dependence of the road network. The essence of spatial dependence is to accurately describe how traffic information transmission is affected by other nodes in the road network, and the GNN-based traffic prediction model, as a benchmark for traffic prediction, has become the most common method for the ability to model spatial dependence by transmitting traffic information with the message passing mechanism. However, existing methods model a local and static spatial dependence, which cannot transmit the global-dynamic traffic information (GDTi) required for long-term prediction. The challenge is the difficulty of detecting the precise transmission of GDTi due to the uncertainty of individual transport, especially for long-term transmission. In this paper, we propose a new hypothesis\: GDTi behaves macroscopically as a transmitting causal relationship (TCR) underlying traffic flow, which remains stable under dynamic changing traffic flow. We further propose spatial-temporal Granger causality (STGC) to express TCR, which models global and dynamic spatial dependence. To model global transmission, we model the causal order and causal lag of TCRs global transmission by a spatial-temporal alignment algorithm. To capture dynamic spatial dependence, we approximate the stable TCR underlying dynamic traffic flow by a Granger causality test. The experimental results on three backbone models show that using STGC to model the spatial dependence has better results than the original model for 45 min and 1 h long-term prediction.
CLAug 13, 2024Code
IFShip: Interpretable Fine-grained Ship Classification with Domain Knowledge-Enhanced Vision-Language ModelsMingning Guo, Mengwei Wu, Yuxiang Shen et al.
End-to-end interpretation currently dominates the remote sensing fine-grained ship classification (RS-FGSC) task. However, the inference process remains uninterpretable, leading to criticisms of these models as "black box" systems. To address this issue, we propose a domain knowledge-enhanced Chain-of-Thought (CoT) prompt generation mechanism, which is used to semi-automatically construct a task-specific instruction-following dataset, TITANIC-FGS. By training on TITANIC-FGS, we adapt general-domain vision-language models (VLMs) to the FGSC task, resulting in a model named IFShip. Building upon IFShip, we develop an FGSC visual chatbot that redefines the FGSC problem as a step-by-step reasoning task and conveys the reasoning process in natural language. Experimental results show that IFShip outperforms state-of-the-art FGSC algorithms in both interpretability and classification accuracy. Furthermore, compared to VLMs such as LLaVA and MiniGPT-4, IFShip demonstrates superior performance on the FGSC task. It provides an accurate chain of reasoning when fine-grained ship types are recognizable to the human eye and offers interpretable explanations when they are not. Our dataset is publicly available at: https://github.com/lostwolves/IFShip.
CVApr 14, 2023
BCE-Net: Reliable Building Footprints Change Extraction based on Historical Map and Up-to-Date Images using Contrastive LearningCheng Liao, Han Hu, Xuekun Yuan et al.
Automatic and periodic recompiling of building databases with up-to-date high-resolution images has become a critical requirement for rapidly developing urban environments. However, the architecture of most existing approaches for change extraction attempts to learn features related to changes but ignores objectives related to buildings. This inevitably leads to the generation of significant pseudo-changes, due to factors such as seasonal changes in images and the inclination of building façades. To alleviate the above-mentioned problems, we developed a contrastive learning approach by validating historical building footprints against single up-to-date remotely sensed images. This contrastive learning strategy allowed us to inject the semantics of buildings into a pipeline for the detection of changes, which is achieved by increasing the distinguishability of features of buildings from those of non-buildings. In addition, to reduce the effects of inconsistencies between historical building polygons and buildings in up-to-date images, we employed a deformable convolutional neural network to learn offsets intuitively. In summary, we formulated a multi-branch building extraction method that identifies newly constructed and removed buildings, respectively. To validate our method, we conducted comparative experiments using the public Wuhan University building change detection dataset and a more practical dataset named SI-BU that we established. Our method achieved F1 scores of 93.99% and 70.74% on the above datasets, respectively. Moreover, when the data of the public dataset were divided in the same manner as in previous related studies, our method achieved an F1 score of 94.63%, which surpasses that of the state-of-the-art method.
CVApr 19Code
RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images UnderstandingGaozhi Zhou, Hu He, Peng Shen et al.
Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox-lab/RS-HyRe-R1.
CVMar 29, 2022
Image Segmentation with Adaptive Spatial Priors from Joint RegistrationHaifeng Li, Weihong Guo, Jun Liu et al.
Image segmentation is a crucial but challenging task that has many applications. In medical imaging for instance, intensity inhomogeneity and noise are common. In thigh muscle images, different muscles are closed packed together and there are often no clear boundaries between them. Intensity based segmentation models cannot separate one muscle from another. To solve such problems, in this work we present a segmentation model with adaptive spatial priors from joint registration. This model combines segmentation and registration in a unified framework to leverage their positive mutual influence. The segmentation is based on a modified Gaussian mixture model (GMM), which integrates intensity inhomogeneity and spacial smoothness. The registration plays the role of providing a shape prior. We adopt a modified sum of squared difference (SSD) fidelity term and Tikhonov regularity term for registration, and also utilize Gaussian pyramid and parametric method for robustness. The connection between segmentation and registration is guaranteed by the cross entropy metric that aims to make the segmentation map (from segmentation) and deformed atlas (from registration) as similar as possible. This joint framework is implemented within a constraint optimization framework, which leads to an efficient algorithm. We evaluate our proposed model on synthetic and thigh muscle MR images. Numerical results show the improvement as compared to segmentation and registration performed separately and other joint models.
LGAug 7, 2023
AdaER: An Adaptive Experience Replay Approach for Continual Lifelong LearningXingyu Li, Bo Tang, Haifeng Li
Continual lifelong learning is an machine learning framework inspired by human learning, where learners are trained to continuously acquire new knowledge in a sequential manner. However, the non-stationary nature of streaming training data poses a significant challenge known as catastrophic forgetting, which refers to the rapid forgetting of previously learned knowledge when new tasks are introduced. While some approaches, such as experience replay (ER), have been proposed to mitigate this issue, their performance remains limited, particularly in the class-incremental scenario which is considered natural and highly challenging. In this paper, we present a novel algorithm, called adaptive-experience replay (AdaER), to address the challenge of continual lifelong learning. AdaER consists of two stages: memory replay and memory update. In the memory replay stage, AdaER introduces a contextually-cued memory recall (C-CMR) strategy, which selectively replays memories that are most conflicting with the current input data in terms of both data and task. Additionally, AdaER incorporates an entropy-balanced reservoir sampling (E-BRS) strategy to enhance the performance of the memory buffer by maximizing information entropy. To evaluate the effectiveness of AdaER, we conduct experiments on established supervised continual lifelong learning benchmarks, specifically focusing on class-incremental learning scenarios. The results demonstrate that AdaER outperforms existing continual lifelong learning baselines, highlighting its efficacy in mitigating catastrophic forgetting and improving learning performance.
CVDec 8, 2025Code
Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task PromptsMingning Guo, Mengwei Wu, Shaoxian Li et al.
Existing image perception methods based on VLMs generally follow a paradigm wherein models extract and analyze image content based on user-provided textual task prompts. However, such methods face limitations when applied to UAV imagery, which presents challenges like target confusion, scale variations, and complex backgrounds. These challenges arise because VLMs' understanding of image content depends on the semantic alignment between visual and textual tokens. When the task prompt is simplistic and the image content is complex, achieving effective alignment becomes difficult, limiting the model's ability to focus on task-relevant information. To address this issue, we introduce AerialVP, the first agent framework for task prompt enhancement in UAV image perception. AerialVP proactively extracts multi-dimensional auxiliary information from UAV images to enhance task prompts, overcoming the limitations of traditional VLM-based approaches. Specifically, the enhancement process includes three stages: (1) analyzing the task prompt to identify the task type and enhancement needs, (2) selecting appropriate tools from the tool repository, and (3) generating enhanced task prompts based on the analysis and selected tools. To evaluate AerialVP, we introduce AerialSense, a comprehensive benchmark for UAV image perception that includes Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding tasks. AerialSense provides a standardized basis for evaluating model generalization and performance across diverse resolutions, lighting conditions, and both urban and natural scenes. Experimental results demonstrate that AerialVP significantly enhances task prompt guidance, leading to stable and substantial performance improvements in both open-source and proprietary VLMs. Our work will be available at https://github.com/lostwolves/AerialVP.
SPNov 8, 2017
An Extended Kalman Filter Enhanced Hilbert-Huang Transform in Oscillation DetectionZhe Yu, Di Shi, Haifeng Li et al.
Hilbert-Huang transform (HHT) has drawn great attention in power system analysis due to its capability to deal with dynamic signal and provide instantaneous characteristics such as frequency, damping, and amplitudes. However, its shortcomings, including mode mixing and end effects, are as significant as its advantages. A preliminary result of an extended Kalman filter (EKF) method to enhance HHT and hopefully to overcome these disadvantages is presented in this paper. The proposal first removes dynamic DC components in signals using empirical mode decomposition. Then an EKF model is applied to extract instant coefficients. Numerical results using simulated and real-world low-frequency oscillation data suggest the proposal can help to overcome the mode mixing and end effects with a properly chosen number of modes.
LGMar 25, 2024Code
LSTTN: A Long-Short Term Transformer-based Spatio-temporal Neural Network for Traffic Flow ForecastingQinyao Luo, Silu He, Xing Han et al.
Accurate traffic forecasting is a fundamental problem in intelligent transportation systems and learning long-range traffic representations with key information through spatiotemporal graph neural networks (STGNNs) is a basic assumption of current traffic flow prediction models. However, due to structural limitations, existing STGNNs can only utilize short-range traffic flow data; therefore, the models cannot adequately learn the complex trends and periodic features in traffic flow. Besides, it is challenging to extract the key temporal information from the long historical traffic series and obtain a compact representation. To solve the above problems, we propose a novel LSTTN (Long-Short Term Transformer-based Network) framework comprehensively considering the long- and short-term features in historical traffic flow. First, we employ a masked subseries Transformer to infer the content of masked subseries from a small portion of unmasked subseries and their temporal context in a pretraining manner, forcing the model to efficiently learn compressed and contextual subseries temporal representations from long historical series. Then, based on the learned representations, long-term trend is extracted by using stacked 1D dilated convolution layers, and periodic features are extracted by dynamic graph convolution layers. For the difficulties in making time-step level prediction, LSTTN adopts a short-term trend extractor to learn fine-grained short-term temporal features. Finally, LSTTN fuses the long-term trend, periodic features and short-term features to obtain the prediction results. Experiments on four real-world datasets show that in 60-minute-ahead long-term forecasting, the LSTTN model achieves a minimum improvement of 5.63\% and a maximum improvement of 16.78\% over baseline models. The source code is available at https://github.com/GeoX-Lab/LSTTN.
AIMar 16Code
RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense ForecastingLinrui Xu, Zhongan Wang, Fei Shen et al.
Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120$ \times $ larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).
LGDec 8, 2022
Alleviating neighbor bias: augmenting graph self-supervise learning with structural equivalent positive samplesJiawei Zhu, Mei Hong, Ronghua Du et al.
In recent years, using a self-supervised learning framework to learn the general characteristics of graphs has been considered a promising paradigm for graph representation learning. The core of self-supervised learning strategies for graph neural networks lies in constructing suitable positive sample selection strategies. However, existing GNNs typically aggregate information from neighboring nodes to update node representations, leading to an over-reliance on neighboring positive samples, i.e., homophilous samples; while ignoring long-range positive samples, i.e., positive samples that are far apart on the graph but structurally equivalent samples, a problem we call "neighbor bias." This neighbor bias can reduce the generalization performance of GNNs. In this paper, we argue that the generalization properties of GNNs should be determined by combining homogeneous samples and structurally equivalent samples, which we call the "GC combination hypothesis." Therefore, we propose a topological signal-driven self-supervised method. It uses a topological information-guided structural equivalence sampling strategy. First, we extract multiscale topological features using persistent homology. Then we compute the structural equivalence of node pairs based on their topological features. In particular, we design a topological loss function to pull in non-neighboring node pairs with high structural equivalence in the representation space to alleviate neighbor bias. Finally, we use the joint training mechanism to adjust the effect of structural equivalence on the model to fit datasets with different characteristics. We conducted experiments on the node classification task across seven graph datasets. The results show that the model performance can be effectively improved using a strategy of topological signal enhancement.
CVOct 16, 2023
Using Global Land Cover Product as Prompt for Cropland Mapping via Visual Foundation ModelChao Tao, Aoran Hu, Rong Xiao et al.
Data-driven deep learning methods have shown great potential in cropland mapping. However, due to multiple factors such as attributes of cropland (topography, climate, crop type) and imaging conditions (viewing angle, illumination, scale), croplands under different scenes demonstrate a great domain gap. This makes it difficult for models trained in the specific scenes to directly generalize to other scenes. A common way to handle this problem is through the "Pretrain+Fine-tuning" paradigm. Unfortunately, considering the variety of features of cropland that are affected by multiple factors, it is hardly to handle the complex domain gap between pre-trained data and target data using only sparse fine-tuned samples as general constraints. Moreover, as the number of model parameters grows, fine-tuning is no longer an easy and low-cost task. With the emergence of prompt learning via visual foundation models, the "Pretrain+Prompting" paradigm redesigns the optimization target by introducing individual prompts for each single sample. This simplifies the domain adaption from generic to specific scenes during model reasoning processes. Therefore, we introduce the "Pretrain+Prompting" paradigm to interpreting cropland scenes and design the auto-prompting (APT) method based on freely available global land cover product. It can achieve a fine-grained adaptation process from generic scenes to specialized cropland scenes without introducing additional label costs. To our best knowledge, this work pioneers the exploration of the domain adaption problems for cropland mapping under prompt learning perspectives. Our experiments using two sub-meter cropland datasets from southern and northern China demonstrated that the proposed method via visual foundation models outperforms traditional supervised learning and fine-tuning approaches in the field of remote sensing.
CVOct 23, 2023
MSFormer: A Skeleton-multiview Fusion Method For Tooth Instance SegmentationYuan Li, Huan Liu, Yubo Tao et al.
Recently, deep learning-based tooth segmentation methods have been limited by the expensive and time-consuming processes of data collection and labeling. Achieving high-precision segmentation with limited datasets is critical. A viable solution to this entails fine-tuning pre-trained multiview-based models, thereby enhancing performance with limited data. However, relying solely on two-dimensional (2D) images for three-dimensional (3D) tooth segmentation can produce suboptimal outcomes because of occlusion and deformation, i.e., incomplete and distorted shape perception. To improve this fine-tuning-based solution, this paper advocates 2D-3D joint perception. The fundamental challenge in employing 2D-3D joint perception with limited data is that the 3D-related inputs and modules must follow a lightweight policy instead of using huge 3D data and parameter-rich modules that require extensive training data. Following this lightweight policy, this paper selects skeletons as the 3D inputs and introduces MSFormer, a novel method for tooth segmentation. MSFormer incorporates two lightweight modules into existing multiview-based models: a 3D-skeleton perception module to extract 3D perception from skeletons and a skeleton-image contrastive learning module to obtain the 2D-3D joint perception by fusing both multiview and skeleton perceptions. The experimental results reveal that MSFormer paired with large pre-trained multiview models achieves state-of-the-art performance, requiring only 100 training meshes. Furthermore, the segmentation accuracy is improved by 2.4%-5.5% with the increasing volume of training data.
CVJan 30Code
FarmMind: Reasoning-Query-Driven Dynamic Segmentation for Farmland Remote Sensing ImagesHaiyang Wu, Weiliang Mu, Jipeng Zhang et al.
Existing methods for farmland remote sensing image (FRSI) segmentation generally follow a static segmentation paradigm, where analysis relies solely on the limited information contained within a single input patch. Consequently, their reasoning capability is limited when dealing with complex scenes characterized by ambiguity and visual uncertainty. In contrast, human experts, when interpreting remote sensing images in such ambiguous cases, tend to actively query auxiliary images (such as higher-resolution, larger-scale, or temporally adjacent data) to conduct cross-verification and achieve more comprehensive reasoning. Inspired by this, we propose a reasoning-query-driven dynamic segmentation framework for FRSIs, named FarmMind. This framework breaks through the limitations of the static segmentation paradigm by introducing a reasoning-query mechanism, which dynamically and on-demand queries external auxiliary images to compensate for the insufficient information in a single input image. Unlike direct queries, this mechanism simulates the thinking process of human experts when faced with segmentation ambiguity: it first analyzes the root causes of segmentation ambiguities through reasoning, and then determines what type of auxiliary image needs to be queried based on this analysis. Extensive experiments demonstrate that FarmMind achieves superior segmentation performance and stronger generalization ability compared with existing methods. The source code and dataset used in this work are publicly available at: https://github.com/WithoutOcean/FarmMind.
AIDec 31, 2023Code
AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference FrameworkRun Shao, Cheng Yang, Qiujun Li et al.
Leveraging multimodal data is an inherent requirement for comprehending geographic objects. However, due to the high heterogeneity in structure and semantics among various spatio-temporal modalities, the joint interpretation of multimodal spatio-temporal data has long been an extremely challenging problem. The primary challenge resides in striking a trade-off between the cohesion and autonomy of diverse modalities. This trade-off becomes progressively nonlinear as the number of modalities expands. Inspired by the human cognitive system and linguistic philosophy, where perceptual signals from the five senses converge into language, we introduce the Language as Reference Framework (LaRF), a fundamental principle for constructing a multimodal unified model. Building upon this, we propose AllSpark, a multimodal spatio-temporal general artificial intelligence model. Our model integrates ten different modalities into a unified framework. To achieve modal cohesion, AllSpark introduces a modal bridge and multimodal large language model (LLM) to map diverse modal features into the language feature space. To maintain modality autonomy, AllSpark uses modality-specific encoders to extract the tokens of various spatio-temporal modalities. Finally, observing a gap between the model's interpretability and downstream tasks, we designed modality-specific prompts and task heads, enhancing the model's generalization capability across specific tasks. Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks for RGB and point cloud modalities without additional training, surpassing baseline performance by up to 41.82\%. The source code is available at https://github.com/GeoX-Lab/AllSpark.
CVMay 18
UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D ReconstructionXiang Yang, Yongli Wang, HaiFeng Li et al.
Feed-forward 3D reconstruction has recently demonstrated strong generalization across diverse scenes, yet its performance in UAV imagery remains underexplored due to distinctive acquisition geometries, large viewpoint variations, and ambiguity between horizontal field of view and flight height. We present UAVFF3D, a geometry-aware benchmark for feed-forward UAV 3D reconstruction, comprising over 170K real UAV images and more than 370K high-quality synthetic images. The benchmark also includes a challenging diagnostic test subset designed to analyze model behavior under UAV-specific geometric ambiguities.Building on UAVFF3D, we propose an evaluation protocol that jointly assesses camera-geometry estimation and reconstruction accuracy, addressing limitations of existing evaluations that rely on separate alignments. Experiments on four representative feed-forward reconstruction models show that UAV-domain adaptation substantially improves performance, reducing Ray Error by up to 84.2%, Pose ATE by up to 76.0%, and Chamfer Distance by up to 41.1%. Further analysis reveals that domain adaptation mitigates rotation-estimation degradation in oblique-view scenes and improves robustness under horizontal-field-of-view/height ambiguity. Incorporating camera priors further enhances reconstruction performance under UAV-specific acquisition geometries.
CVMar 27, 2024Code
Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image UnderstandingRun Shao, Zhaoyang Zhang, Chao Tao et al.
The tokenizer, as one of the fundamental components of large models, has long been overlooked or even misunderstood in visual tasks. One key factor of the great comprehension power of the large language model is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision, which cannot serve as effectively as words or subwords in language. Starting from the essence of the tokenizer, we defined semantically independent regions (SIRs) for vision. We designed a simple HOmogeneous visual tOKenizer: HOOK. HOOK mainly consists of two modules: the Object Perception Module (OPM) and the Object Vectorization Module (OVM). To achieve homogeneity, the OPM splits the image into 4*4 pixel seeds and then utilizes the attention mechanism to perceive SIRs. The OVM employs cross-attention to merge seeds within the same SIR. To achieve adaptability, the OVM defines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19 classification dataset, and GID5 segmentation dataset for sparse and dense tasks. The results demonstrate that the visual tokens obtained by HOOK correspond to individual objects, which demonstrates homogeneity. HOOK outperformed Patch Embed by 6\% and 10\% in the two tasks and achieved state-of-the-art performance compared to the baselines used for comparison. Compared to Patch Embed, which requires more than one hundred tokens for one image, HOOK requires only 6 and 8 tokens for sparse and dense tasks, respectively, resulting in efficiency improvements of 1.5 to 2.8 times. The code is available at https://github.com/GeoX-Lab/Hook.
LGDec 14, 2023Code
CAT: A Causally Graph Attention Network for Trimming Heterophilic GraphSilu He, Qinyao Luo, Xinsha Fu et al.
Local Attention-guided Message Passing Mechanism (LAMP) adopted in Graph Attention Networks (GATs) is designed to adaptively learn the importance of neighboring nodes for better local aggregation on the graph, which can bring the representations of similar neighbors closer effectively, thus showing stronger discrimination ability. However, existing GATs suffer from a significant discrimination ability decline in heterophilic graphs because the high proportion of dissimilar neighbors can weaken the self-attention of the central node, jointly resulting in the deviation of the central node from similar nodes in the representation space. This kind of effect generated by neighboring nodes is called the Distraction Effect (DE) in this paper. To estimate and weaken the DE of neighboring nodes, we propose a Causally graph Attention network for Trimming heterophilic graph (CAT). To estimate the DE, since the DE are generated through two paths (grab the attention assigned to neighbors and reduce the self-attention of the central node), we use Total Effect to model DE, which is a kind of causal estimand and can be estimated from intervened data; To weaken the DE, we identify the neighbors with the highest DE (we call them Distraction Neighbors) and remove them. We adopt three representative GATs as the base model within the proposed CAT framework and conduct experiments on seven heterophilic datasets in three different sizes. Comparative experiments show that CAT can improve the node classification accuracy of all base GAT models. Ablation experiments and visualization further validate the enhancement of discrimination ability brought by CAT. The source code is available at https://github.com/GeoX-Lab/CAT.
CVAug 25, 2024
PAM: A Propagation-Based Model for Segmenting Any 3D Objects across Multi-Modal Medical ImagesZifan Chen, Xinyu Nan, Jiazheng Li et al.
Volumetric segmentation is important in medical imaging, but current methods face challenges like requiring lots of manual annotations and being tailored to specific tasks, which limits their versatility. General segmentation models used for natural images don't perform well with the unique features of medical images. There's a strong need for an adaptable approach that can effectively handle different 3D medical structures and imaging modalities. In this study, we present PAM (Propagating Anything Model), a segmentation approach that uses a 2D prompt, like a bounding box or sketch, to create a complete 3D segmentation of medical image volumes. PAM works by modeling relationships between slices, maintaining information flow across the 3D structure. It combines a CNN-based UNet for processing within slices and a Transformer-based attention module for propagating information between slices, leading to better generalizability across various imaging modalities. PAM significantly outperformed existing models like MedSAM and SegVol, with an average improvement of over 18.1% in dice similarity coefficient (DSC) across 44 medical datasets and various object types. It also showed stable performance despite prompt deviations and different propagation setups, and faster inference speeds compared to other models. PAM's one-view prompt design made it more efficient, reducing interaction time by about 63.6% compared to two-view prompts. Thanks to its focus on structural relationships, PAM handled unseen and complex objects well, showing a unique ability to generalize to new situations. PAM represents an advancement in medical image segmentation, effectively reducing the need for extensive manual work and specialized training. Its adaptability makes it a promising tool for more automated and reliable analysis in clinical settings.
AIJan 27
LocationAgent: A Hierarchical Agent for Image Geolocation via Decoupling Strategy and Evidence from Parametric KnowledgeQiujun Li, Zijin Xiao, Xulin Wang et al.
Image geolocation aims to infer capture locations based on visual content. Fundamentally, this constitutes a reasoning process composed of \textit{hypothesis-verification cycles}, requiring models to possess both geospatial reasoning capabilities and the ability to verify evidence against geographic facts. Existing methods typically internalize location knowledge and reasoning patterns into static memory via supervised training or trajectory-based reinforcement fine-tuning. Consequently, these methods are prone to factual hallucinations and generalization bottlenecks in open-world settings or scenarios requiring dynamic knowledge. To address these challenges, we propose a Hierarchical Localization Agent, called LocationAgent. Our core philosophy is to retain hierarchical reasoning logic within the model while offloading the verification of geographic evidence to external tools. To implement hierarchical reasoning, we design the RER architecture (Reasoner-Executor-Recorder), which employs role separation and context compression to prevent the drifting problem in multi-step reasoning. For evidence verification, we construct a suite of clue exploration tools that provide diverse evidence to support location reasoning. Furthermore, to address data leakage and the scarcity of Chinese data in existing datasets, we introduce CCL-Bench (China City Location Bench), an image geolocation benchmark encompassing various scene granularities and difficulty levels. Extensive experiments demonstrate that LocationAgent significantly outperforms existing methods by at least 30\% in zero-shot settings.
CVJul 13, 2024
SeFi-CD: A Semantic First Change Detection Paradigm That Can Detect Any Change You WantLing Zhao, Zhenyang Huang, Dongsheng Kuang et al.
The existing change detection(CD) methods can be summarized as the visual-first change detection (ViFi-CD) paradigm, which first extracts change features from visual differences and then assigns them specific semantic information. However, CD is essentially dependent on change regions of interest (CRoIs), meaning that the CD results are directly determined by the semantics changes of interest, making its primary image factor semantic of interest rather than visual. The ViFi-CD paradigm can only assign specific semantics of interest to specific change features extracted from visual differences, leading to the inevitable omission of potential CRoIs and the inability to adapt to different CRoI CD tasks. In other words, changes in other CRoIs cannot be detected by the ViFi-CD method without retraining the model or significantly modifying the method. This paper introduces a new CD paradigm, the semantic-first CD (SeFi-CD) paradigm. The core idea of SeFi-CD is to first perceive the dynamic semantics of interest and then visually search for change features related to the semantics. Based on the SeFi-CD paradigm, we designed Anything You Want Change Detection (AUWCD). Experiments on public datasets demonstrate that the AUWCD outperforms the current state-of-the-art CD methods, achieving an average F1 score 5.01\% higher than that of these advanced supervised baselines on the SECOND dataset, with a maximum increase of 13.17\%. The proposed SeFi-CD offers a novel CD perspective and approach.
ROMay 23, 2025Code
BEDI: A Comprehensive Benchmark for Evaluating Embodied Agents on UAVsMingning Guo, Mengwei Wu, Jiarun He et al.
With the rapid advancement of low-altitude remote sensing and Vision-Language Models (VLMs), Embodied Agents based on Unmanned Aerial Vehicles (UAVs) have shown significant potential in autonomous tasks. However, current evaluation methods for UAV-Embodied Agents (UAV-EAs) remain constrained by the lack of standardized benchmarks, diverse testing scenarios and open system interfaces. To address these challenges, we propose BEDI (Benchmark for Embodied Drone Intelligence), a systematic and standardized benchmark designed for evaluating UAV-EAs. Specifically, we introduce a novel Dynamic Chain-of-Embodied-Task paradigm based on the perception-decision-action loop, which decomposes complex UAV tasks into standardized, measurable subtasks. Building on this paradigm, we design a unified evaluation framework encompassing five core sub-skills: semantic perception, spatial perception, motion control, tool utilization, and task planning. Furthermore, we construct a hybrid testing platform that integrates static real-world environments with dynamic virtual scenarios, enabling comprehensive performance assessment of UAV-EAs across varied contexts. The platform also offers open and standardized interfaces, allowing researchers to customize tasks and extend scenarios, thereby enhancing flexibility and scalability in the evaluation process. Finally, through empirical evaluations of several state-of-the-art (SOTA) VLMs, we reveal their limitations in embodied UAV tasks, underscoring the critical role of the BEDI benchmark in advancing embodied intelligence research and model optimization. By filling the gap in systematic and standardized evaluation within this field, BEDI facilitates objective model comparison and lays a robust foundation for future development in this field. Our benchmark will be released at https://github.com/lostwolves/BEDI .
AIMay 13
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing AgentsLiangtian Liu, Zeyuan Wang, Ziyu Li et al.
The rise of multi-modal large language models (MLLMs) is shifting remote sensing (RS) intelligence from "see" to "action", as OpenClaw-style frameworks enable agents to autonomously operate massive RS image-processing tools for complex tasks. Existing RS agents adopt a passive selection paradigm for tool invocation, relying on either full tool registration (Flat) or retrieval-augmented generation (RAG). However, in the massive and multi-source heterogeneous RS tool ecosystem, such passive mechanisms struggle to dynamically balance "context load" and "toolset completeness" throughout task reasoning, thus exhibiting inherent limitations: full tool registration triggers context space deficits during long-horizon tasks, whereas RAG retrieval may omit critical tools in essential steps. To overcome these bottlenecks, this paper redefines tool selection by arguing that the agent should act as an active explorer within the tool space. Based on this perspective, we propose RS-Claw, a novel RS agent architecture. By leveraging Skill encapsulation technology at the tool end, this architecture hierarchically structures tool descriptions, enabling the agent to execute on-demand sequential decision-making: initially selecting relevant skill branches by reading only tool summaries, then dynamically loading detailed descriptions, and ultimately achieving precise invocation. This active paradigm not only significantly liberates the agent's context space but also effectively ensures the accurate hit rate of critical tools during long-horizon reasoning. Systematic experiments on the Earth-Bench benchmark demonstrate that RS-Claw's active exploration mechanism effectively filters semantic noise and substantially frees up reasoning space, achieving an input token compression ratio of up to 86%, and comprehensively outperforming existing Flat and RAG baselines across complex reasoning evaluations.
AIApr 15
GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial AnalysisBo Yu, Cheng Yang, Dongyang Hou et al.
The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a "Last-Attempt Alignment" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.
CVOct 19, 2022
HAVANA: Hard negAtiVe sAmples aware self-supervised coNtrastive leArning for Airborne laser scanning point clouds semantic segmentationYunsheng Zhang, Jianguo Yao, Ruixiang Zhang et al.
Deep Neural Network (DNN) based point cloud semantic segmentation has presented significant achievements on large-scale labeled aerial laser point cloud datasets. However, annotating such large-scaled point clouds is time-consuming. Due to density variations and spatial heterogeneity of the Airborne Laser Scanning (ALS) point clouds, DNNs lack generalization capability and thus lead to unpromising semantic segmentation, as the DNN trained in one region underperform when directly utilized in other regions. However, Self-Supervised Learning (SSL) is a promising way to solve this problem by pre-training a DNN model utilizing unlabeled samples followed by a fine-tuned downstream task involving very limited labels. Hence, this work proposes a hard-negative sample aware self-supervised contrastive learning method to pre-train the model for semantic segmentation. The traditional contrastive learning for point clouds selects the hardest negative samples by solely relying on the distance between the embedded features derived from the learning process, potentially evolving some negative samples from the same classes to reduce the contrastive learning effectiveness. Therefore, we design an AbsPAN (Absolute Positive And Negative samples) strategy based on k-means clustering to filter the possible false-negative samples. Experiments on two typical ALS benchmark datasets demonstrate that the proposed method is more appealing than supervised training schemes without pre-training. Especially when the labels are severely inadequate (10% of the ISPRS training set), the results obtained by the proposed HAVANA method still exceed 94% of the supervised paradigm performance with full training set.
AIMay 10
The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?Zhaoyang Zhang, Run Shao, Dongyue Wu et al.
Understanding why independently trained neural networks from different modalities converge toward shared representations, and where this convergence leads, remains an open question in representation learning. All existing evidence relies on symmetric similarity measures, which can detect convergence but are structurally blind to its direction. We introduce directional convergence analysis using cycle-kNN, an asymmetric alignment measure, applied across dozens of independently trained unimodal models spanning point clouds, vision, and language. We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse, and this pattern holds across all model families and scales--yet is entirely invisible to symmetric measures. Mechanistic analysis traces the directionality to feature density asymmetry, whereby language representations occupy the most compact regions of representational space. The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.
CVJul 2, 2025Code
A Gift from the Integration of Discriminative and Diffusion-based Generative Learning: Boundary Refinement Remote Sensing Semantic SegmentationHao Wang, Keyan Hu, Xin Guo et al.
Remote sensing semantic segmentation must address both what the ground objects are within an image and where they are located. Consequently, segmentation models must ensure not only the semantic correctness of large-scale patches (low-frequency information) but also the precise localization of boundaries between patches (high-frequency information). However, most existing approaches rely heavily on discriminative learning, which excels at capturing low-frequency features, while overlooking its inherent limitations in learning high-frequency features for semantic segmentation. Recent studies have revealed that diffusion generative models excel at generating high-frequency details. Our theoretical analysis confirms that the diffusion denoising process significantly enhances the model's ability to learn high-frequency features; however, we also observe that these models exhibit insufficient semantic inference for low-frequency features when guided solely by the original image. Therefore, we integrate the strengths of both discriminative and generative learning, proposing the Integration of Discriminative and diffusion-based Generative learning for Boundary Refinement (IDGBR) framework. The framework first generates a coarse segmentation map using a discriminative backbone model. This map and the original image are fed into a conditioning guidance network to jointly learn a guidance representation subsequently leveraged by an iterative denoising diffusion process refining the coarse segmentation. Extensive experiments across five remote sensing semantic segmentation datasets (binary and multi-class segmentation) confirm our framework's capability of consistent boundary refinement for coarse results from diverse discriminative architectures. The source code will be available at https://github.com/KeyanHu-git/IDGBR.
CVJun 25, 2025Code
SFNet: Fusion of Spatial and Frequency-Domain Features for Remote Sensing Image Forgery DetectionJi Qi, Xinchang Zhang, Dingqi Ye et al.
The rapid advancement of generative artificial intelligence is producing fake remote sensing imagery (RSI) that is increasingly difficult to detect, potentially leading to erroneous intelligence, fake news, and even conspiracy theories. Existing forgery detection methods typically rely on single visual features to capture predefined artifacts, such as spatial-domain cues to detect forged objects like roads or buildings in RSI, or frequency-domain features to identify artifacts from up-sampling operations in adversarial generative networks (GANs). However, the nature of artifacts can significantly differ depending on geographic terrain, land cover types, or specific features within the RSI. Moreover, these complex artifacts evolve as generative models become more sophisticated. In short, over-reliance on a single visual cue makes existing forgery detectors struggle to generalize across diverse remote sensing data. This paper proposed a novel forgery detection framework called SFNet, designed to identify fake images in diverse remote sensing data by leveraging spatial and frequency domain features. Specifically, to obtain rich and comprehensive visual information, SFNet employs two independent feature extractors to capture spatial and frequency domain features from input RSIs. To fully utilize the complementary domain features, the domain feature mapping module and the hybrid domain feature refinement module(CBAM attention) of SFNet are designed to successively align and fuse the multi-domain features while suppressing redundant information. Experiments on three datasets show that SFNet achieves an accuracy improvement of 4%-15.18% over the state-of-the-art RS forgery detection methods and exhibits robust generalization capabilities. The code is available at https://github.com/GeoX-Lab/RSTI/tree/main/SFNet.
CVJun 18, 2024Code
RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image UnderstandingLinrui Xu, Ling Zhao, Wang Guo et al.
The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: https://github.com/GeoX-Lab/RS-GPT4V.
CVJan 8, 2025Code
Enhancing Scene Classification in Cloudy Image Scenarios: A Collaborative Transfer Method with Information Regulation Mechanism using Optical Cloud-Covered and SAR Remote Sensing ImagesYuze Wang, Rong Xiao, Haifeng Li et al.
In remote sensing scene classification, leveraging the transfer methods with well-trained optical models is an efficient way to overcome label scarcity. However, cloud contamination leads to optical information loss and significant impacts on feature distribution, challenging the reliability and stability of transferred target models. Common solutions include cloud removal for optical data or directly using Synthetic aperture radar (SAR) data in the target domain. However, cloud removal requires substantial auxiliary data for support and pre-training, while directly using SAR disregards the unobstructed portions of optical data. This study presents a scene classification transfer method that synergistically combines multi-modality data, which aims to transfer the source domain model trained on cloudfree optical data to the target domain that includes both cloudy optical and SAR data at low cost. Specifically, the framework incorporates two parts: (1) the collaborative transfer strategy, based on knowledge distillation, enables the efficient prior knowledge transfer across heterogeneous data; (2) the information regulation mechanism (IRM) is proposed to address the modality imbalance issue during transfer. It employs auxiliary models to measure the contribution discrepancy of each modality, and automatically balances the information utilization of modalities during the target model learning process at the sample-level. The transfer experiments were conducted on simulated and real cloud datasets, demonstrating the superior performance of the proposed method compared to other solutions in cloud-covered scenarios. We also verified the importance and limitations of IRM, and further discussed and visualized the modality imbalance problem during the model transfer. Codes are available at https://github.com/wangyuze-csu/ESCCS
CVMar 31, 2022Code
Deformation and Correspondence Aware Unsupervised Synthetic-to-Real Scene Flow Estimation for Point CloudsZhao Jin, Yinjie Lei, Naveed Akhtar et al.
Point cloud scene flow estimation is of practical importance for dynamic scene navigation in autonomous driving. Since scene flow labels are hard to obtain, current methods train their models on synthetic data and transfer them to real scenes. However, large disparities between existing synthetic datasets and real scenes lead to poor model transfer. We make two major contributions to address that. First, we develop a point cloud collector and scene flow annotator for GTA-V engine to automatically obtain diverse realistic training samples without human intervention. With that, we develop a large-scale synthetic scene flow dataset GTA-SF. Second, we propose a mean-teacher-based domain adaptation framework that leverages self-generated pseudo-labels of the target domain. It also explicitly incorporates shape deformation regularization and surface correspondence refinement to address distortions and misalignments in domain transfer. Through extensive experiments, we show that our GTA-SF dataset leads to a consistent boost in model generalization to three real datasets (i.e., Waymo, Lyft and KITTI) as compared to the most widely used FT3D dataset. Moreover, our framework achieves superior adaptation performance on six source-target dataset pairs, remarkably closing the average domain gap by 60%. Data and codes are available at https://github.com/leolyj/DCA-SRSFE
CVJun 20, 2021Code
Global and Local Contrastive Self-Supervised Learning for Semantic Segmentation of HR Remote Sensing ImagesHaifeng Li, Yi Li, Guo Zhang et al.
Supervised learning for semantic segmentation requires a large number of labeled samples, which is difficult to obtain in the field of remote sensing. Self-supervised learning (SSL), can be used to solve such problems by pre-training a general model with a large number of unlabeled images and then fine-tuning it on a downstream task with very few labeled samples. Contrastive learning is a typical method of SSL that can learn general invariant features. However, most existing contrastive learning methods are designed for classification tasks to obtain an image-level representation, which may be suboptimal for semantic segmentation tasks requiring pixel-level discrimination. Therefore, we propose a global style and local matching contrastive learning network (GLCNet) for remote sensing image semantic segmentation. Specifically, 1) the global style contrastive learning module is used to better learn an image-level representation, as we consider that style features can better represent the overall image features. 2) The local features matching contrastive learning module is designed to learn representations of local regions, which is beneficial for semantic segmentation. The experimental results show that our method mostly outperforms SOTA self-supervised methods and the ImageNet pre-training method. Specifically, with 1\% annotation from the original dataset, our approach improves Kappa by 6\% on the ISPRS Potsdam dataset relative to the existing baseline. Moreover, our method outperforms supervised learning methods when there are some differences between the datasets of upstream tasks and downstream tasks. Since SSL could directly learn the essential characteristics of data from unlabeled data, which is easy to obtain in the remote sensing field, this may be of great significance for tasks such as global mapping. The source code is available at https://github.com/GeoX-Lab/G-RSIM.
LGNov 26, 2020Code
KST-GCN: A Knowledge-Driven Spatial-Temporal Graph Convolutional Network for Traffic ForecastingJiawei Zhu, Xin Han, Hanhan Deng et al.
While considering the spatial and temporal features of traffic, capturing the impacts of various external factors on travel is an essential step towards achieving accurate traffic forecasting. However, existing studies seldom consider external factors or neglect the effect of the complex correlations among external factors on traffic. Intuitively, knowledge graphs can naturally describe these correlations. Since knowledge graphs and traffic networks are essentially heterogeneous networks, it is challenging to integrate the information in both networks. On this background, this study presents a knowledge representation-driven traffic forecasting method based on spatial-temporal graph convolutional networks. We first construct a knowledge graph for traffic forecasting and derive knowledge representations by a knowledge representation learning method named KR-EAR. Then, we propose the Knowledge Fusion Cell (KF-Cell) to combine the knowledge and traffic features as the input of a spatial-temporal graph convolutional backbone network. Experimental results on the real-world dataset show that our strategy enhances the forecasting performances of backbones at various prediction horizons. The ablation and perturbation analysis further verify the effectiveness and robustness of the proposed method. To the best of our knowledge, this is the first study that constructs and utilizes a knowledge graph to facilitate traffic forecasting; it also offers a promising direction to integrate external information and spatial-temporal information for traffic forecasting. The source code is available at https://github.com/lehaifeng/T-GCN/tree/master/KST-GCN.
LGJun 20, 2020Code
A3T-GCN: Attention Temporal Graph Convolutional Network for Traffic ForecastingJiawei Zhu, Yujiao Song, Ling Zhao et al.
Accurate real-time traffic forecasting is a core technological problem against the implementation of the intelligent transportation system. However, it remains challenging considering the complex spatial and temporal dependencies among traffic flows. In the spatial dimension, due to the connectivity of the road network, the traffic flows between linked roads are closely related. In terms of the temporal factor, although there exists a tendency among adjacent time points in general, the importance of distant past points is not necessarily smaller than that of recent past points since traffic flows are also affected by external factors. In this study, an attention temporal graph convolutional network (A3T-GCN) traffic forecasting method was proposed to simultaneously capture global temporal dynamics and spatial correlations. The A3T-GCN model learns the short-time trend in time series by using the gated recurrent units and learns the spatial dependence based on the topology of the road network through the graph convolutional network. Moreover, the attention mechanism was introduced to adjust the importance of different time points and assemble global temporal information to improve prediction accuracy. Experimental results in real-world datasets demonstrate the effectiveness and robustness of proposed A3T-GCN. The source code can be visited at https://github.com/lehaifeng/T-GCN/A3T.
CVApr 18, 2020Code
DAPnet: A Double Self-attention Convolutional Network for Point Cloud Semantic LabelingLi Chen, Zewei Xu, Yongjian Fu et al.
Airborne Laser Scanning (ALS) point clouds have complex structures, and their 3D semantic labeling has been a challenging task. It has three problems: (1) the difficulty of classifying point clouds around boundaries of objects from different classes, (2) the diversity of shapes within the same class, and (3) the scale differences between classes. In this study, we propose a novel double self-attention convolutional network called the DAPnet. The double self-attention includes the point attention module (PAM) and the group attention module (GAM). For problem (1), the PAM can effectively assign different weights based on the relevance of point clouds in adjacent areas. Meanwhile, for the problem (2), the GAM enhances the correlation between groups, i.e., grouped features within the same classes. To solve the problem (3), we adopt a multiscale radius to construct the groups and concatenate extracted hierarchical features with the output of the corresponding upsampling process. Under the ISPRS 3D Semantic Labeling Contest dataset, the DAPnet outperforms the benchmark by 85.2\% with an overall accuracy of 90.7\%. By conducting ablation comparisons, we find that the PAM effectively improves the model than the GAM. The incorporation of the double self-attention module has an average of 7\% improvement on the pre-class accuracy. Plus, the DAPnet consumes a similar training time to those without the attention modules for model convergence. The DAPnet can assign different weights to features based on the relevance between point clouds and their neighbors, which effectively improves classification performance. The source codes are available at: https://github.com/RayleighChen/point-attention.
CVMar 7, 2020Code
DASNet: Dual attentive fully convolutional siamese networks for change detection of high resolution satellite imagesJie Chen, Ziyang Yuan, Jian Peng et al.
Change detection is a basic task of remote sensing image processing. The research objective is to identity the change information of interest and filter out the irrelevant change information as interference factors. Recently, the rise of deep learning has provided new tools for change detection, which have yielded impressive results. However, the available methods focus mainly on the difference information between multitemporal remote sensing images and lack robustness to pseudo-change information. To overcome the lack of resistance of current methods to pseudo-changes, in this paper, we propose a new method, namely, dual attentive fully convolutional Siamese networks (DASNet) for change detection in high-resolution images. Through the dual-attention mechanism, long-range dependencies are captured to obtain more discriminant feature representations to enhance the recognition performance of the model. Moreover, the imbalanced sample is a serious problem in change detection, i.e. unchanged samples are much more than changed samples, which is one of the main reasons resulting in pseudo-changes. We put forward the weighted double margin contrastive loss to address this problem by punishing the attention to unchanged feature pairs and increase attention to changed feature pairs. The experimental results of our method on the change detection dataset (CDD) and the building change detection dataset (BCDD) demonstrate that compared with other baseline methods, the proposed method realizes maximum improvements of 2.1\% and 3.6\%, respectively, in the F1 score. Our Pytorch implementation is available at https://github.com/lehaifeng/DASNet.
CVDec 19, 2019Code
SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing ImagesHaifeng Li, Kaijian Qiu, Li Chen et al.
High-resolution remote sensing images (HRRSIs) contain substantial ground object information, such as texture, shape, and spatial location. Semantic segmentation, which is an important task for element extraction, has been widely used in processing mass HRRSIs. However, HRRSIs often exhibit large intraclass variance and small interclass variance due to the diversity and complexity of ground objects, thereby bringing great challenges to a semantic segmentation task. In this paper, we propose a new end-to-end semantic segmentation network, which integrates lightweight spatial and channel attention modules that can refine features adaptively. We compare our method with several classic methods on the ISPRS Vaihingen and Potsdam datasets. Experimental results show that our method can achieve better semantic segmentation results. The source codes are available at https://github.com/lehaifeng/SCAttNet.
LGDec 19, 2019Code
Overcoming Long-term Catastrophic Forgetting through Adversarial Neural Pruning and Synaptic ConsolidationJian Peng, Bo Tang, Hao Jiang et al.
Artificial neural networks face the well-known problem of catastrophic forgetting. What's worse, the degradation of previously learned skills becomes more severe as the task sequence increases, known as the long-term catastrophic forgetting. It is due to two facts: first, as the model learns more tasks, the intersection of the low-error parameter subspace satisfying for these tasks becomes smaller or even does not exist; second, when the model learns a new task, the cumulative error keeps increasing as the model tries to protect the parameter configuration of previous tasks from interference. Inspired by the memory consolidation mechanism in mammalian brains with synaptic plasticity, we propose a confrontation mechanism in which Adversarial Neural Pruning and synaptic Consolidation (ANPyC) is used to overcome the long-term catastrophic forgetting issue. The neural pruning acts as long-term depression to prune task-irrelevant parameters, while the novel synaptic consolidation acts as long-term potentiation to strengthen task-relevant parameters. During the training, this confrontation achieves a balance in that only crucial parameters remain, and non-significant parameters are freed to learn subsequent tasks. ANPyC avoids forgetting important information and makes the model efficient to learn a large number of tasks. Specifically, the neural pruning iteratively relaxes the current task's parameter conditions to expand the common parameter subspace of the task; the synaptic consolidation strategy, which consists of a structure-aware parameter-importance measurement and an element-wise parameter updating strategy, decreases the cumulative error when learning new tasks. The full source code is available at https://github.com/GeoX-Lab/ANPyC.
CVOct 26, 2019Code
MAP-Net: Multi Attending Path Neural Network for Building Footprint Extraction from Remote Sensed ImageryQing Zhu, Cheng Liao, Han Hu et al.
Accurately and efficiently extracting building footprints from a wide range of remote sensed imagery remains a challenge due to their complex structure, variety of scales and diverse appearances. Existing convolutional neural network (CNN)-based building extraction methods are complained that they cannot detect the tiny buildings because the spatial information of CNN feature maps are lost during repeated pooling operations of the CNN, and the large buildings still have inaccurate segmentation edges. Moreover, features extracted by a CNN are always partial which restricted by the size of the respective field, and large-scale buildings with low texture are always discontinuous and holey when extracted. This paper proposes a novel multi attending path neural network (MAP-Net) for accurately extracting multiscale building footprints and precise boundaries. MAP-Net learns spatial localization-preserved multiscale features through a multi-parallel path in which each stage is gradually generated to extract high-level semantic features with fixed resolution. Then, an attention module adaptively squeezes channel-wise features from each path for optimization, and a pyramid spatial pooling module captures global dependency for refining discontinuous building footprints. Experimental results show that MAP-Net outperforms state-of-the-art (SOTA) algorithms in boundary localization accuracy as well as continuity of large buildings. Specifically, our method achieved 0.68\%, 1.74\%, 1.46\% precision, and 1.50\%, 1.53\%, 0.82\% IoU score improvement without increasing computational complexity compared with the latest HRNetv2 on the Urban 3D, Deep Globe and WHU datasets, respectively. The TensorFlow implementation is available at https://github.com/lehaifeng/MAPNet.
CVDec 25, 2018Code
A Data-driven Adversarial Examples Recognition Framework via Adversarial Feature GenomeLi Chen, Qi Li, Weiye Chen et al.
Adversarial examples pose many security threats to convolutional neural networks (CNNs). Most defense algorithms prevent these threats by finding differences between the original images and adversarial examples. However, the found differences do not contain features about the classes, so these defense algorithms can only detect adversarial examples without recovering the correct labels. In this regard, we propose the Adversarial Feature Genome (AFG), a novel type of data that contains both the differences and features about classes. This method is inspired by an observed phenomenon, namely the Adversarial Feature Separability (AFS), where the difference between the feature maps of the original images and adversarial examples becomes larger with deeper layers. On top of that, we further develop an adversarial example recognition framework that detects adversarial examples and can recover the correct labels. In the experiments, the detection and classification of adversarial examples by AFGs has an accuracy of more than 90.01\% in various attack scenarios. To the best of our knowledge, our method is the first method that focuses on both attack detecting and recovering. AFG gives a new data-driven perspective to improve the robustness of CNNs. The source code is available at https://github.com/GeoX-Lab/Adv_Fea_Genome.
LGDec 4, 2018Code
Overcoming Catastrophic Forgetting by Soft Parameter PruningJian Peng, Jiang Hao, Zhuo Li et al.
Catastrophic forgetting is a challenge issue in continual learning when a deep neural network forgets the knowledge acquired from the former task after learning on subsequent tasks. However, existing methods try to find the joint distribution of parameters shared with all tasks. This idea can be questionable because this joint distribution may not present when the number of tasks increase. On the other hand, It also leads to "long-term" memory issue when the network capacity is limited since adding tasks will "eat" the network capacity. In this paper, we proposed a Soft Parameters Pruning (SPP) strategy to reach the trade-off between short-term and long-term profit of a learning model by freeing those parameters less contributing to remember former task domain knowledge to learn future tasks, and preserving memories about previous tasks via those parameters effectively encoding knowledge about tasks at the same time. The SPP also measures the importance of parameters by information entropy in a label free manner. The experiments on several tasks shows SPP model achieved the best performance compared with others state-of-the-art methods. Experiment results also indicate that our method is less sensitive to hyper-parameter and better generalization. Our research suggests that a softer strategy, i.e. approximate optimize or sub-optimal solution, will benefit alleviating the dilemma of memory. The source codes are available at https://github.com/lehaifeng/Learning_by_memory.