CVSep 27, 2022Code
SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing ImageryJiaqing Zhang, Jie Lei, Weiying Xie et al.
Accurately and timely detecting multiscale small objects that contain tens of pixels from remote sensing images (RSI) remains challenging. Most of the existing solutions primarily design complex deep neural networks to learn strong feature representations for objects separated from the background, which often results in a heavy computation burden. In this article, we propose an accurate yet fast object detection method for RSI, named SuperYOLO, which fuses multimodal data and performs high-resolution (HR) object detection on multiscale objects by utilizing the assisted super resolution (SR) learning and considering both the detection accuracy and computation cost. First, we utilize a symmetric compact multimodal fusion (MF) to extract supplementary information from various data for improving small object detection in RSI. Furthermore, we design a simple and flexible SR branch to learn HR feature representations that can discriminate small objects from vast backgrounds with low-resolution (LR) input, thus further improving the detection accuracy. Moreover, to avoid introducing additional computation, the SR branch is discarded in the inference stage, and the computation of the network model is reduced due to the LR input. Experimental results show that, on the widely used VEDAI RS dataset, SuperYOLO achieves an accuracy of 75.09% (in terms of mAP50 ), which is more than 10% higher than the SOTA large models, such as YOLOv5l, YOLOv5x, and RS designed YOLOrs. Meanwhile, the parameter size and GFLOPs of SuperYOLO are about 18 times and 3.8 times less than YOLOv5x. Our proposed model shows a favorable accuracy and speed tradeoff compared to the state-of-the-art models. The code will be open-sourced at https://github.com/icey-zhang/SuperYOLO.
CVDec 31, 2022Code
Guided Hybrid Quantization for Object detection in Multimodal Remote Sensing Imagery via One-to-one Self-teachingJiaqing Zhang, Jie Lei, Weiying Xie et al.
Considering the computation complexity, we propose a Guided Hybrid Quantization with One-to-one Self-Teaching (GHOST}) framework. More concretely, we first design a structure called guided quantization self-distillation (GQSD), which is an innovative idea for realizing lightweight through the synergy of quantization and distillation. The training process of the quantization model is guided by its full-precision model, which is time-saving and cost-saving without preparing a huge pre-trained model in advance. Second, we put forward a hybrid quantization (HQ) module to obtain the optimal bit width automatically under a constrained condition where a threshold for distribution distance between the center and samples is applied in the weight value search space. Third, in order to improve information transformation, we propose a one-to-one self-teaching (OST) module to give the student network a ability of self-judgment. A switch control machine (SCM) builds a bridge between the student network and teacher network in the same location to help the teacher to reduce wrong guidance and impart vital knowledge to the student. This distillation method allows a model to learn from itself and gain substantial improvement without any additional supervision. Extensive experiments on a multimodal dataset (VEDAI) and single-modality datasets (DOTA, NWPU, and DIOR) show that object detection based on GHOST outperforms the existing detectors. The tiny parameters (<9.7 MB) and Bit-Operations (BOPs) (<2158 G) compared with any remote sensing-based, lightweight or distillation-based algorithms demonstrate the superiority in the lightweight design domain. Our code and model will be released at https://github.com/icey-zhang/GHOST.
CVJul 19, 2023Code
BSDM: Background Suppression Diffusion Model for Hyperspectral Anomaly DetectionJitao Ma, Weiying Xie, Yunsong Li et al.
Hyperspectral anomaly detection (HAD) is widely used in Earth observation and deep space exploration. A major challenge for HAD is the complex background of the input hyperspectral images (HSIs), resulting in anomalies confused in the background. On the other hand, the lack of labeled samples for HSIs leads to poor generalization of existing HAD methods. This paper starts the first attempt to study a new and generalizable background learning problem without labeled samples. We present a novel solution BSDM (background suppression diffusion model) for HAD, which can simultaneously learn latent background distributions and generalize to different datasets for suppressing complex background. It is featured in three aspects: (1) For the complex background of HSIs, we design pseudo background noise and learn the potential background distribution in it with a diffusion model (DM). (2) For the generalizability problem, we apply a statistical offset module so that the BSDM adapts to datasets of different domains without labeling samples. (3) For achieving background suppression, we innovatively improve the inference process of DM by feeding the original HSIs into the denoising network, which removes the background as noise. Our work paves a new background suppression way for HAD that can improve HAD performance without the prerequisite of manually labeled data. Assessments and generalization experiments of four HAD methods on several real HSI datasets demonstrate the above three unique properties of the proposed method. The code is available at https://github.com/majitao-xd/BSDM-HAD.
CVMar 16, 2023Code
Contrastive Semi-supervised Learning for Underwater Image Restoration via Reliable BankShirui Huang, Keyan Wang, Huan Liu et al.
Despite the remarkable achievement of recent underwater image restoration techniques, the lack of labeled data has become a major hurdle for further progress. In this work, we propose a mean-teacher based Semi-supervised Underwater Image Restoration (Semi-UIR) framework to incorporate the unlabeled data into network training. However, the naive mean-teacher method suffers from two main problems: (1) The consistency loss used in training might become ineffective when the teacher's prediction is wrong. (2) Using L1 distance may cause the network to overfit wrong labels, resulting in confirmation bias. To address the above problems, we first introduce a reliable bank to store the "best-ever" outputs as pseudo ground truth. To assess the quality of outputs, we conduct an empirical analysis based on the monotonicity property to select the most trustworthy NR-IQA method. Besides, in view of the confirmation bias problem, we incorporate contrastive regularization to prevent the overfitting on wrong labels. Experimental results on both full-reference and non-reference underwater benchmarks demonstrate that our algorithm has obvious improvement over SOTA methods quantitatively and qualitatively. Code has been released at https://github.com/Huang-ShiRui/Semi-UIR.
CVJul 10, 2024Code
IRSAM: Advancing Segment Anything Model for Infrared Small Target DetectionMingjin Zhang, Yuchun Wang, Jie Guo et al.
The recent Segment Anything Model (SAM) is a significant advancement in natural image segmentation, exhibiting potent zero-shot performance suitable for various downstream image segmentation tasks. However, directly utilizing the pretrained SAM for Infrared Small Target Detection (IRSTD) task falls short in achieving satisfying performance due to a notable domain gap between natural and infrared images. Unlike a visible light camera, a thermal imager reveals an object's temperature distribution by capturing infrared radiation. Small targets often show a subtle temperature transition at the object's boundaries. To address this issue, we propose the IRSAM model for IRSTD, which improves SAM's encoder-decoder architecture to learn better feature representation of infrared small objects. Specifically, we design a Perona-Malik diffusion (PMD)-based block and incorporate it into multiple levels of SAM's encoder to help it capture essential structural features while suppressing noise. Additionally, we devise a Granularity-Aware Decoder (GAD) to fuse the multi-granularity feature from the encoder to capture structural information that may be lost in long-distance modeling. Extensive experiments on the public datasets, including NUAA-SIRST, NUDT-SIRST, and IRSTD-1K, validate the design choice of IRSAM and its significant superiority over representative state-of-the-art methods. The source code are available at: github.com/IPIC-Lab/IRSAM.
CVSep 7, 2024Code
Unleashing the Power of Generic Segmentation Models: A Simple Baseline for Infrared Small Target DetectionMingjin Zhang, Chi Zhang, Qiming Zhang et al.
Recent advancements in deep learning have greatly advanced the field of infrared small object detection (IRSTD). Despite their remarkable success, a notable gap persists between these IRSTD methods and generic segmentation approaches in natural image domains. This gap primarily arises from the significant modality differences and the limited availability of infrared data. In this study, we aim to bridge this divergence by investigating the adaptation of generic segmentation models, such as the Segment Anything Model (SAM), to IRSTD tasks. Our investigation reveals that many generic segmentation models can achieve comparable performance to state-of-the-art IRSTD methods. However, their full potential in IRSTD remains untapped. To address this, we propose a simple, lightweight, yet effective baseline model for segmenting small infrared objects. Through appropriate distillation strategies, we empower smaller student models to outperform state-of-the-art methods, even surpassing fine-tuned teacher results. Furthermore, we enhance the model's performance by introducing a novel query design comprising dense and sparse queries to effectively encode multi-scale features. Through extensive experimentation across four popular IRSTD datasets, our model demonstrates significantly improved performance in both accuracy and throughput compared to existing approaches, surpassing SAM and Semantic-SAM by over 14 IoU on NUDT and 4 IoU on IRSTD1k. The source code and models will be released at https://github.com/O937-blip/SimIR.
CVJul 23, 2024Code
FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese NetworkWeiying Xie, Yusi Zhang, Tianlin Hui et al.
Multimodal object detection offers a promising prospect to facilitate robust detection in various visual conditions. However, existing two-stream backbone networks are challenged by complex fusion and substantial parameter increments. This is primarily due to large data distribution biases of multimodal homogeneous information. In this paper, we propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone. The shared parameters enhance the consistency of homogeneous information, while lightweight modal adaptors focus on modality unique features. Furthermore, we design an adaptive rank allocation strategy to adapt to the varying heterogeneity at different feature levels. When applied to two multimodal object detection datasets, experiments validate the effectiveness of our method. Notably, on DroneVehicle, LMA attains a 10.4% accuracy improvement over the state-of-the-art method with a 149M-parameters reduction. The code is available at https://github.com/zyszxhy/FoRA. Our work was submitted to ACM MM in April 2024, but was rejected. We will continue to refine our work and paper writing next, mainly including proof of theory and multi-task applications of FoRA.
CVNov 16, 2023
FedDiff: Diffusion Model Driven Federated Learning for Multi-Modal and Multi-ClientsDaiXun Li, Weiying Xie, ZiXuan Wang et al.
With the rapid development of imaging sensor technology in the field of remote sensing, multi-modal remote sensing data fusion has emerged as a crucial research direction for land cover classification tasks. While diffusion models have made great progress in generative models and image classification tasks, existing models primarily focus on single-modality and single-client control, that is, the diffusion process is driven by a single modal in a single computing node. To facilitate the secure fusion of heterogeneous data from clients, it is necessary to enable distributed multi-modal control, such as merging the hyperspectral data of organization A and the LiDAR data of organization B privately on each base station client. In this study, we propose a multi-modal collaborative diffusion federated learning framework called FedDiff. Our framework establishes a dual-branch diffusion model feature extraction setup, where the two modal data are inputted into separate branches of the encoder. Our key insight is that diffusion models driven by different modalities are inherently complementary in terms of potential denoising steps on which bilateral connections can be built. Considering the challenge of private and efficient communication between multiple clients, we embed the diffusion model into the federated learning communication structure, and introduce a lightweight communication module. Qualitative and quantitative experiments validate the superiority of our framework in terms of image quality and conditional consistency.
CVNov 16, 2023
FedFusion: Manifold Driven Federated Learning for Multi-satellite and Multi-modality FusionDaiXun Li, Weiying Xie, Yunsong Li et al.
Multi-satellite, multi-modality in-orbit fusion is a challenging task as it explores the fusion representation of complex high-dimensional data under limited computational resources. Deep neural networks can reveal the underlying distribution of multi-modal remote sensing data, but the in-orbit fusion of multimodal data is more difficult because of the limitations of different sensor imaging characteristics, especially when the multimodal data follows non-independent identically distribution (Non-IID) distributions. To address this problem while maintaining classification performance, this paper proposes a manifold-driven multi-modality fusion framework, FedFusion, which randomly samples local data on each client to jointly estimate the prominent manifold structure of shallow features of each client and explicitly compresses the feature matrices into a low-rank subspace through cascading and additive approaches, which is used as the feature input of the subsequent classifier. Considering the physical space limitations of the satellite constellation, we developed a multimodal federated learning module designed specifically for manifold data in a deep latent space. This module achieves iterative updating of the sub-network parameters of each client through global weighted averaging, constructing a framework that can represent compact representations of each client. The proposed framework surpasses existing methods in terms of performance on three multimodal datasets, achieving a classification average accuracy of 94.35$\%$ while compressing communication costs by a factor of 4. Furthermore, extensive numerical evaluations of real-world satellite images were conducted on the orbiting edge computing architecture based on Jetson TX2 industrial modules, which demonstrated that FedFusion significantly reduced training time by 48.4 minutes (15.18%) while optimizing accuracy.}
CVNov 22, 2023
Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset PruningXin Zhang, Jiawei Du, Yunsong Li et al.
Dataset pruning aims to construct a coreset capable of achieving performance comparable to the original, full dataset. Most existing dataset pruning methods rely on snapshot-based criteria to identify representative samples, often resulting in poor generalization across various pruning and cross-architecture scenarios. Recent studies have addressed this issue by expanding the scope of training dynamics considered, including factors such as forgetting event and probability change, typically using an averaging approach. However, these works struggle to integrate a broader range of training dynamics without overlooking well-generalized samples, which may not be sufficiently highlighted in an averaging manner. In this study, we propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS), to tackle this problem. TDDS utilizes a dual-depth strategy to achieve a balance between incorporating extensive training dynamics and identifying representative samples for dataset pruning. In the first depth, we estimate the series of each sample's individual contributions spanning the training progress, ensuring comprehensive integration of training dynamics. In the second depth, we focus on the variability of the sample-wise contributions identified in the first depth to highlight well-generalized samples. Extensive experiments conducted on CIFAR and ImageNet datasets verify the superiority of TDDS over previous SOTA methods. Specifically on CIFAR-100, our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
CVNov 16, 2023
MDFL: Multi-domain Diffusion-driven Feature LearningDaixun Li, Weiying Xie, Jiaqing Zhang et al.
High-dimensional images, known for their rich semantic information, are widely applied in remote sensing and other fields. The spatial information in these images reflects the object's texture features, while the spectral information reveals the potential spectral representations across different bands. Currently, the understanding of high-dimensional images remains limited to a single-domain perspective with performance degradation. Motivated by the masking texture effect observed in the human visual system, we present a multi-domain diffusion-driven feature learning network (MDFL) , a scheme to redefine the effective information domain that the model really focuses on. This method employs diffusion-based posterior sampling to explicitly consider joint information interactions between the high-dimensional manifold structures in the spectral, spatial, and frequency domains, thereby eliminating the influence of masking texture effects in visual models. Additionally, we introduce a feature reuse mechanism to gather deep and raw features of high-dimensional data. We demonstrate that MDFL significantly improves the feature extraction performance of high-dimensional data, thereby providing a powerful aid for revealing the intrinsic patterns and structures of such data. The experimental results on three multi-modal remote sensing datasets show that MDFL reaches an average overall accuracy of 98.25%, outperforming various state-of-the-art baseline schemes. The code will be released, contributing to the computer vision community.
CVAug 26, 2024
FusionSAM: Visual Multi-Modal Learning with Segment AnythingDaixun Li, Weiying Xie, Mingxiang Cao et al.
Multimodal image fusion and semantic segmentation are critical for autonomous driving. Despite advancements, current models often struggle with segmenting densely packed elements due to a lack of comprehensive fusion features for guidance during training. While the Segment Anything Model (SAM) allows precise control during fine-tuning through its flexible prompting encoder, its potential remains largely unexplored in the context of multimodal segmentation for natural images. In this paper, we introduce SAM into multimodal image segmentation for the first time, proposing a novel framework that combines Latent Space Token Generation (LSTG) and Fusion Mask Prompting (FMP) modules. This approach transforms the training methodology for multimodal segmentation from a traditional black-box approach to a controllable, prompt-based mechanism. Specifically, we obtain latent space features for both modalities through vector quantization and embed them into a cross-attention-based inter-domain fusion module to establish long-range dependencies between modalities. We then use these comprehensive fusion features as prompts to guide precise pixel-level segmentation. Extensive experiments on multiple public datasets demonstrate that our method significantly outperforms SAM and SAM2 in multimodal autonomous driving scenarios, achieving an average improvement of 4.1$\%$ over the state-of-the-art method in segmentation mIoU, and the performance is also optimized in other multi-modal visual scenes.
CVJul 27, 2024
Reducing Spurious Correlation for Federated Domain GeneralizationShuran Ma, Weiying Xie, Daixun Li et al.
The rapid development of multimedia has provided a large amount of data with different distributions for visual tasks, forming different domains. Federated Learning (FL) can efficiently use this diverse data distributed on different client media in a decentralized manner through model sharing. However, in open-world scenarios, there is a challenge: global models may struggle to predict well on entirely new domain data captured by certain media, which were not encountered during training. Existing methods still rely on strong statistical correlations between samples and labels to address this issue, which can be misleading, as some features may establish spurious short-cut correlations with the predictions. To comprehensively address this challenge, we introduce FedCD (Cross-Domain Invariant Federated Learning), an overall optimization framework at both the local and global levels. We introduce the Spurious Correlation Intervener (SCI), which employs invariance theory to locally generate interventers for features in a self-supervised manner to reduce the model's susceptibility to spurious correlated features. Our approach requires no sharing of data or features, only the gradients related to the model. Additionally, we develop the simple yet effective Risk Extrapolation Aggregation strategy (REA), determining aggregation coefficients through mathematical optimization to facilitate global causal invariant predictions. Extensive experiments and ablation studies highlight the effectiveness of our approach. In both classification and object detection generalization tasks, our method outperforms the baselines by an average of at least 1.45% in Acc, 4.8% and 1.27% in mAP50.
CVJan 6, 2024Code
Multimodal Informative ViT: Information Aggregation and Distribution for Hyperspectral and LiDAR ClassificationJiaqing Zhang, Jie Lei, Weiying Xie et al.
In multimodal land cover classification (MLCC), a common challenge is the redundancy in data distribution, where irrelevant information from multiple modalities can hinder the effective integration of their unique features. To tackle this, we introduce the Multimodal Informative Vit (MIVit), a system with an innovative information aggregate-distributing mechanism. This approach redefines redundancy levels and integrates performance-aware elements into the fused representation, facilitating the learning of semantics in both forward and backward directions. MIVit stands out by significantly reducing redundancy in the empirical distribution of each modality's separate and fused features. It employs oriented attention fusion (OAF) for extracting shallow local features across modalities in horizontal and vertical dimensions, and a Transformer feature extractor for extracting deep global features through long-range attention. We also propose an information aggregation constraint (IAC) based on mutual information, designed to remove redundant information and preserve complementary information within embedded features. Additionally, the information distribution flow (IDF) in MIVit enhances performance-awareness by distributing global classification information across different modalities' feature maps. This architecture also addresses missing modality challenges with lightweight independent modality classifiers, reducing the computational load typically associated with Transformers. Our results show that MIVit's bidirectional aggregate-distributing mechanism between modalities is highly effective, achieving an average overall accuracy of 95.56% across three multimodal datasets. This performance surpasses current state-of-the-art methods in MLCC. The code for MIVit is accessible at https://github.com/icey-zhang/MIViT.
MMApr 21, 2024Code
Beyond Alignment: Blind Video Face Restoration via Parsing-Guided Temporal-Coherent TransformerKepeng Xu, Li Xu, Gang He et al.
Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{https://github.com/kepengxu/PGTFormer}{https://github.com/kepengxu/PGTFormer}.
CVJan 6, 2024Code
Distribution-aware Interactive Attention Network and Large-scale Cloud Recognition Benchmark on FY-4A Satellite ImageJiaqing Zhang, Jie Lei, Weiying Xie et al.
Accurate cloud recognition and warning are crucial for various applications, including in-flight support, weather forecasting, and climate research. However, recent deep learning algorithms have predominantly focused on detecting cloud regions in satellite imagery, with insufficient attention to the specificity required for accurate cloud recognition. This limitation inspired us to develop the novel FY-4A-Himawari-8 (FYH) dataset, which includes nine distinct cloud categories and uses precise domain adaptation methods to align 70,419 image-label pairs in terms of projection, temporal resolution, and spatial resolution, thereby facilitating the training of supervised deep learning networks. Given the complexity and diversity of cloud formations, we have thoroughly analyzed the challenges inherent to cloud recognition tasks, examining the intricate characteristics and distribution of the data. To effectively address these challenges, we designed a Distribution-aware Interactive-Attention Network (DIAnet), which preserves pixel-level details through a high-resolution branch and a parallel multi-resolution cross-branch. We also integrated a distribution-aware loss (DAL) to mitigate the imbalance across cloud categories. An Interactive Attention Module (IAM) further enhances the robustness of feature extraction combined with spatial and channel information. Empirical evaluations on the FYH dataset demonstrate that our method outperforms other cloud recognition networks, achieving superior performance in terms of mean Intersection over Union (mIoU). The code for implementing DIAnet is available at https://github.com/icey-zhang/DIAnet.
CVJan 2, 2024Code
Exploring Hyperspectral Anomaly Detection with Human Vision: A Small Target Aware DetectorJitao Ma, Weiying Xie, Yunsong Li
Hyperspectral anomaly detection (HAD) aims to localize pixel points whose spectral features differ from the background. HAD is essential in scenarios of unknown or camouflaged target features, such as water quality monitoring, crop growth monitoring and camouflaged target detection, where prior information of targets is difficult to obtain. Existing HAD methods aim to objectively detect and distinguish background and anomalous spectra, which can be achieved almost effortlessly by human perception. However, the underlying processes of human visual perception are thought to be quite complex. In this paper, we analyze hyperspectral image (HSI) features under human visual perception, and transfer the solution process of HAD to the more robust feature space for the first time. Specifically, we propose a small target aware detector (STAD), which introduces saliency maps to capture HSI features closer to human visual perception. STAD not only extracts more anomalous representations, but also reduces the impact of low-confidence regions through a proposed small target filter (STF). Furthermore, considering the possibility of HAD algorithms being applied to edge devices, we propose a full connected network to convolutional network knowledge distillation strategy. It can learn the spectral and spatial features of the HSI while lightening the network. We train the network on the HAD100 training set and validate the proposed method on the HAD100 test set. Our method provides a new solution space for HAD that is closer to human visual perception with high confidence. Sufficient experiments on real HSI with multiple method comparisons demonstrate the excellent performance and unique potential of the proposed method. The code is available at https://github.com/majitao-xd/STAD-HAD.
CVMar 9, 2025Code
M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing ClassificationMingxiang Cao, Weiying Xie, Xin Zhang et al.
Multi-modal fusion holds great promise for integrating information from different modalities. However, due to a lack of consideration for modal consistency, existing multi-modal fusion methods in the field of remote sensing still face challenges of incomplete semantic information and low computational efficiency in their fusion designs. Inspired by the observation that the visual language pre-training model CLIP can effectively extract strong semantic information from visual features, we propose M$^3$amba, a novel end-to-end CLIP-driven Mamba model for multi-modal fusion to address these challenges. Specifically, we introduce CLIP-driven modality-specific adapters in the fusion architecture to avoid the bias of understanding specific domains caused by direct inference, making the original CLIP encoder modality-specific perception. This unified framework enables minimal training to achieve a comprehensive semantic understanding of different modalities, thereby guiding cross-modal feature fusion. To further enhance the consistent association between modality mappings, a multi-modal Mamba fusion architecture with linear complexity and a cross-attention module Cross-SS2D are designed, which fully considers effective and efficient information interaction to achieve complete fusion. Extensive experiments have shown that M$^3$amba has an average performance improvement of at least 5.98\% compared with the state-of-the-art methods in multi-modal hyperspectral image classification tasks in the remote sensing field, while also demonstrating excellent training efficiency, achieving a double improvement in accuracy and efficiency. The code is released at https://github.com/kaka-Cao/M3amba.
CVDec 10, 2024Code
DiffCLIP: Few-shot Language-driven Multimodal ClassifierJiaqing Zhang, Mingxiang Cao, Xue Yang et al.
Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized domains such as remote sensing due to the limited availability of image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a novel framework that extends CLIP to effectively convey comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images. DiffCLIP is a few-shot learning method that leverages unlabeled images for pretraining. It employs unsupervised mask diffusion learning to capture the distribution of diverse modalities without requiring labels. The modality-shared image encoder maps multimodal data into a unified subspace, extracting shared features with consistent parameters across modalities. A well-trained image encoder further enhances learning by aligning visual representations with class-label text information from CLIP. By integrating these approaches, DiffCLIP significantly boosts CLIP performance using a minimal number of image-text pairs. We evaluate DiffCLIP on widely used high-dimensional multimodal datasets, demonstrating its effectiveness in addressing few-shot annotated classification tasks. DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP, while utilizing only 2-shot image-text pairs. The code has been released at https://github.com/icey-zhang/DiffCLIP.
CVSep 10, 2025Code
Hyperspectral Mamba for Hyperspectral Object TrackingLong Gao, Yunhe Zhang, Yan Jiang et al.
Hyperspectral object tracking holds great promise due to the rich spectral information and fine-grained material distinctions in hyperspectral images, which are beneficial in challenging scenarios. While existing hyperspectral trackers have made progress by either transforming hyperspectral data into false-color images or incorporating modality fusion strategies, they often fail to capture the intrinsic spectral information, temporal dependencies, and cross-depth interactions. To address these limitations, a new hyperspectral object tracking network equipped with Mamba (HyMamba), is proposed. It unifies spectral, cross-depth, and temporal modeling through state space modules (SSMs). The core of HyMamba lies in the Spectral State Integration (SSI) module, which enables progressive refinement and propagation of spectral features with cross-depth and temporal spectral information. Embedded within each SSI, the Hyperspectral Mamba (HSM) module is introduced to learn spatial and spectral information synchronously via three directional scanning SSMs. Based on SSI and HSM, HyMamba constructs joint features from false-color and hyperspectral inputs, and enhances them through interaction with original spectral features extracted from raw hyperspectral images. Extensive experiments conducted on seven benchmark datasets demonstrate that HyMamba achieves state-of-the-art performance. For instance, it achieves 73.0\% of the AUC score and 96.3\% of the DP@20 score on the HOTC2020 dataset. The code will be released at https://github.com/lgao001/HyMamba.
CVMar 14, 2024Code
E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion DetectionJiaqing Zhang, Mingxiang Cao, Weiying Xie et al.
Multimodal image fusion and object detection are crucial for autonomous driving. While current methods have advanced the fusion of texture details and semantic information, their complex training processes hinder broader applications. Addressing this challenge, we introduce E2E-MFD, a novel end-to-end algorithm for multimodal fusion detection. E2E-MFD streamlines the process, achieving high performance with a single training phase. It employs synchronous joint optimization across components to avoid suboptimal solutions tied to individual tasks. Furthermore, it implements a comprehensive optimization strategy in the gradient matrix for shared parameters, ensuring convergence to an optimal fusion detection configuration. Our extensive testing on multiple public datasets reveals E2E-MFD's superior capabilities, showcasing not only visually appealing image fusion but also impressive detection outcomes, such as a 3.9% and 2.0% mAP50 increase on horizontal object detection dataset M3FD and oriented object detection dataset DroneVehicle, respectively, compared to state-of-the-art approaches. The code is released at https://github.com/icey-zhang/E2E-MFD.
CVDec 15, 2021Code
Transcoded Video Restoration by Temporal Spatial Auxiliary NetworkLi Xu, Gang He, Jinjia Zhou et al.
In most video platforms, such as Youtube, and TikTok, the played videos usually have undergone multiple video encodings such as hardware encoding by recording devices, software encoding by video editing apps, and single/multiple video transcoding by video application servers. Previous works in compressed video restoration typically assume the compression artifacts are caused by one-time encoding. Thus, the derived solution usually does not work very well in practice. In this paper, we propose a new method, temporal spatial auxiliary network (TSAN), for transcoded video restoration. Our method considers the unique traits between video encoding and transcoding, and we consider the initial shallow encoded videos as the intermediate labels to assist the network to conduct self-supervised attention training. In addition, we employ adjacent multi-frame information and propose the temporal deformable alignment and pyramidal spatial fusion for transcoded video restoration. The experimental results demonstrate that the performance of the proposed method is superior to that of the previous techniques. The code is available at https://github.com/icecherylXuli/TSAN.
IVMay 19, 2020Code
AdaptiveWeighted Attention Network with Camera Spectral Sensitivity Prior for Spectral Reconstruction from RGB ImagesJiaojiao Li, Chaoxiong Wu, Rui Song et al.
Recent promising effort for spectral reconstruction (SR) focuses on learning a complicated mapping through using a deeper and wider convolutional neural networks (CNNs). Nevertheless, most CNN-based SR algorithms neglect to explore the camera spectral sensitivity (CSS) prior and interdependencies among intermediate features, thus limiting the representation ability of the network and performance of SR. To conquer these issues, we propose a novel adaptive weighted attention network (AWAN) for SR, whose backbone is stacked with multiple dual residual attention blocks (DRAB) decorating with long and short skip connections to form the dual residual learning. Concretely, we investigate an adaptive weighted channel attention (AWCA) module to reallocate channel-wise feature responses via integrating correlations between channels. Furthermore, a patch-level second-order non-local (PSNL) module is developed to capture long-range spatial contextual information by second-order non-local operations for more powerful feature representations. Based on the fact that the recovered RGB images can be projected by the reconstructed hyperspectral image (HSI) and the given CSS function, we incorporate the discrepancies of the RGB images and HSIs as a finer constraint for more accurate reconstruction. Experimental results demonstrate the effectiveness of our proposed AWAN network in terms of quantitative comparison and perceptual quality over other state-of-the-art SR methods. In the NTIRE 2020 Spectral Reconstruction Challenge, our entries obtain the 1st ranking on the Clean track and the 3rd place on the Real World track. Codes are available at https://github.com/Deep-imagelab/AWAN.
CVJan 10, 2024
SwiMDiff: Scene-wide Matching Contrastive Learning with Diffusion Constraint for Remote Sensing ImageJiayuan Tian, Jie Lei, Jiaqing Zhang et al.
With recent advancements in aerospace technology, the volume of unlabeled remote sensing image (RSI) data has increased dramatically. Effectively leveraging this data through self-supervised learning (SSL) is vital in the field of remote sensing. However, current methodologies, particularly contrastive learning (CL), a leading SSL method, encounter specific challenges in this domain. Firstly, CL often mistakenly identifies geographically adjacent samples with similar semantic content as negative pairs, leading to confusion during model training. Secondly, as an instance-level discriminative task, it tends to neglect the essential fine-grained features and complex details inherent in unstructured RSIs. To overcome these obstacles, we introduce SwiMDiff, a novel self-supervised pre-training framework designed for RSIs. SwiMDiff employs a scene-wide matching approach that effectively recalibrates labels to recognize data from the same scene as false negatives. This adjustment makes CL more applicable to the nuances of remote sensing. Additionally, SwiMDiff seamlessly integrates CL with a diffusion model. Through the implementation of pixel-level diffusion constraints, we enhance the encoder's ability to capture both the global semantic information and the fine-grained features of the images more comprehensively. Our proposed framework significantly enriches the information available for downstream tasks in remote sensing. Demonstrating exceptional performance in change detection and land-cover classification tasks, SwiMDiff proves its substantial utility and value in the field of remote sensing.
IVNov 16, 2024
An End-to-End Real-World Camera Imaging PipelineKepeng Xu, Zijia Ma, Li Xu et al.
Recent advances in neural camera imaging pipelines have demonstrated notable progress. Nevertheless, the real-world imaging pipeline still faces challenges including the lack of joint optimization in system components, computational redundancies, and optical distortions such as lens shading.In light of this, we propose an end-to-end camera imaging pipeline (RealCamNet) to enhance real-world camera imaging performance. Our methodology diverges from conventional, fragmented multi-stage image signal processing towards end-to-end architecture. This architecture facilitates joint optimization across the full pipeline and the restoration of coordinate-biased distortions. RealCamNet is designed for high-quality conversion from RAW to RGB and compact image compression. Specifically, we deeply analyze coordinate-dependent optical distortions, e.g., vignetting and dark shading, and design a novel Coordinate-Aware Distortion Restoration (CADR) module to restore coordinate-biased distortions. Furthermore, we propose a Coordinate-Independent Mapping Compression (CIMC) module to implement tone mapping and redundant information compression. Existing datasets suffer from misalignment and overly idealized conditions, making them inadequate for training real-world imaging pipelines. Therefore, we collected a real-world imaging dataset. Experiment results show that RealCamNet achieves the best rate-distortion performance with lower inference latency.
CLOct 11, 2024
AMPO: Automatic Multi-Branched Prompt OptimizationSheng Yang, Yurong Wu, Yan Gao et al.
Prompt engineering is very important to enhance the performance of large language models (LLMs). When dealing with complex issues, prompt engineers tend to distill multiple patterns from examples and inject relevant solutions to optimize the prompts, achieving satisfying results. However, existing automatic prompt optimization techniques are only limited to producing single flow instructions, struggling with handling diverse patterns. In this paper, we present AMPO, an automatic prompt optimization method that can iteratively develop a multi-branched prompt using failure cases as feedback. Our goal is to explore a novel way of structuring prompts with multi-branches to better handle multiple patterns in complex tasks, for which we introduce three modules: Pattern Recognition, Branch Adjustment, and Branch Pruning. In experiments across five tasks, AMPO consistently achieves the best results. Additionally, our approach demonstrates significant optimization efficiency due to our adoption of a minimal search strategy.
CVOct 15, 2024
SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object DetectionShuhan Dong, Yunsong Li, Weiying Xie et al.
Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors. By learning long-term dependencies, Transformer can effectively integrate multimodal features in the feature extraction stage, which greatly improves the performance of multimodal object detection. However, current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network, thus limiting the improvements in detection performance. In this paper, we introduce an accurate and efficient object detection method named SeaDATE. Initially, we propose a novel dual attention Feature Fusion (DTF) module that, under Transformer's guidance, integrates local and global information through a dual attention mechanism, strengthening the fusion of modal features from orthogonal perspectives using spatial and channel tokens. Meanwhile, our theoretical analysis and empirical validation demonstrate that the Transformer-guided fusion method, treating images as sequences of pixels for fusion, performs better on shallow features' detail information compared to deep semantic information. To address this, we designed a contrastive learning (CL) module aimed at learning features of multimodal samples, remedying the shortcomings of Transformer-guided fusion in extracting deep semantic features, and effectively utilizing cross-modal information. Extensive experiments and ablation studies on the FLIR, LLVIP, and M3FD datasets have proven our method to be effective, achieving state-of-the-art detection performance.
CVApr 20, 2024
Hyperspectral Anomaly Detection with Self-Supervised Anomaly PriorYidan Liu, Weiying Xie, Kai Jiang et al.
The majority of existing hyperspectral anomaly detection (HAD) methods use the low-rank representation (LRR) model to separate the background and anomaly components, where the anomaly component is optimized by handcrafted sparse priors (e.g., $\ell_{2,1}$-norm). However, this may not be ideal since they overlook the spatial structure present in anomalies and make the detection result largely dependent on manually set sparsity. To tackle these problems, we redefine the optimization criterion for the anomaly component in the LRR model with a self-supervised network called self-supervised anomaly prior (SAP). This prior is obtained by the pretext task of self-supervised learning, which is customized to learn the characteristics of hyperspectral anomalies. Specifically, this pretext task is a classification task to distinguish the original hyperspectral image (HSI) and the pseudo-anomaly HSI, where the pseudo-anomaly is generated from the original HSI and designed as a prism with arbitrary polygon bases and arbitrary spectral bands. In addition, a dual-purified strategy is proposed to provide a more refined background representation with an enriched background dictionary, facilitating the separation of anomalies from complex backgrounds. Extensive experiments on various hyperspectral datasets demonstrate that the proposed SAP offers a more accurate and interpretable solution than other advanced HAD methods.
CVJan 13, 2024
DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View PerceptionKai Jiang, Jiaxing Huang, Weiying Xie et al.
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space. However, most existing studies were conducted under a supervised setup which cannot scale well while handling various new data. Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored. In this work, we design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features. DA-BEV introduces the idea of query into the domain adaptation framework to derive useful information from image-view and BEV features. It consists of two query-based designs, namely, query-based adversarial learning (QAL) and query-based self-training (QST), which exploits image-view features or BEV features to regularize the adaptation of the other. Extensive experiments show that DA-BEV achieves superior domain adaptive BEV perception performance consistently across multiple datasets and tasks such as 3D object detection and 3D scene segmentation.
CVJan 13, 2024
Domain Adaptation for Large-Vocabulary Object DetectorsKai Jiang, Jiaxing Huang, Weiying Xie et al.
Large-vocabulary object detectors (LVDs) aim to detect objects of many categories, which learn super objectness features and can locate objects accurately while applied to various downstream data. However, LVDs often struggle in recognizing the located objects due to domain discrepancy in data distribution and object vocabulary. At the other end, recent vision-language foundation models such as CLIP demonstrate superior open-vocabulary recognition capability. This paper presents KGD, a Knowledge Graph Distillation technique that exploits the implicit knowledge graphs (KG) in CLIP for effectively adapting LVDs to various downstream domains. KGD consists of two consecutive stages: 1) KG extraction that employs CLIP to encode downstream domain data as nodes and their feature distances as edges, constructing KG that inherits the rich semantic relations in CLIP explicitly; and 2) KG encapsulation that transfers the extracted KG into LVDs to enable accurate cross-domain object classification. In addition, KGD can extract both visual and textual KG independently, providing complementary vision and language knowledge for object localization and object classification in detection tasks over various downstream domains. Experiments over multiple widely adopted detection benchmarks show that KGD outperforms the state-of-the-art consistently by large margins.
CVDec 28, 2023
Multi-scale direction-aware SAR object detection network via global information fusionMingxiang Cao, Weiying Xie, Jie Lei et al.
Deep learning has driven significant progress in object detection using Synthetic Aperture Radar (SAR) imagery. Existing methods, while achieving promising results, often struggle to effectively integrate local and global information, particularly direction-aware features. This paper proposes SAR-Net, a novel framework specifically designed for global fusion of direction-aware information in SAR object detection. SAR-Net leverages two key innovations: the Unity Compensation Mechanism (UCM) and the Direction-aware Attention Module (DAM). UCM facilitates the establishment of complementary relationships among features across different scales, enabling efficient global information fusion and transmission. Additionally, DAM, through bidirectional attention polymerization, captures direction-aware information, effectively eliminating background interference. Extensive experiments demonstrate the effectiveness of SAR-Net, achieving state-of-the-art results on aircraft (SAR-AIRcraft-1.0) and ship datasets (SSDD, HRSID), confirming its generalization capability and robustness.
CVDec 29, 2023
RS-DGC: Exploring Neighborhood Statistics for Dynamic Gradient Compression on Remote Sensing Image InterpretationWeiying Xie, Zixuan Wang, Jitao Ma et al.
Distributed deep learning has recently been attracting more attention in remote sensing (RS) applications due to the challenges posed by the increased amount of open data that are produced daily by Earth observation programs. However, the high communication costs of sending model updates among multiple nodes are a significant bottleneck for scalable distributed learning. Gradient sparsification has been validated as an effective gradient compression (GC) technique for reducing communication costs and thus accelerating the training speed. Existing state-of-the-art gradient sparsification methods are mostly based on the "larger-absolute-more-important" criterion, ignoring the importance of small gradients, which is generally observed to affect the performance. Inspired by informative representation of manifold structures from neighborhood information, we propose a simple yet effective dynamic gradient compression scheme leveraging neighborhood statistics indicator for RS image interpretation, termed RS-DGC. We first enhance the interdependence between gradients by introducing the gradient neighborhood to reduce the effect of random noise. The key component of RS-DGC is a Neighborhood Statistical Indicator (NSI), which can quantify the importance of gradients within a specified neighborhood on each node to sparsify the local gradients before gradient transmission in each iteration. Further, a layer-wise dynamic compression scheme is proposed to track the importance changes of each layer in real time. Extensive downstream tasks validate the superiority of our method in terms of intelligent interpretation of RS images. For example, we achieve an accuracy improvement of 0.51% with more than 50 times communication compression on the NWPU-RESISC45 dataset using VGG-19 network.
LGNov 17, 2024
Towards Accurate and Efficient Sub-8-Bit Integer TrainingWenjin Guo, Donglai Liu, Weiying Xie et al.
Neural network training is a memory- and compute-intensive task. Quantization, which enables low-bitwidth formats in training, can significantly mitigate the workload. To reduce quantization error, recent methods have developed new data formats and additional pre-processing operations on quantizers. However, it remains quite challenging to achieve high accuracy and efficiency simultaneously. In this paper, we explore sub-8-bit integer training from its essence of gradient descent optimization. Our integer training framework includes two components: ShiftQuant to realize accurate gradient estimation, and L1 normalization to smoothen the loss landscape. ShiftQuant attains performance that approaches the theoretical upper bound of group quantization. Furthermore, it liberates group quantization from inefficient memory rearrangement. The L1 normalization facilitates the implementation of fully quantized normalization layers with impressive convergence accuracy. Our method frees sub-8-bit integer training from pre-processing and supports general devices. This framework achieves negligible accuracy loss across various neural networks and tasks ($0.92\%$ on 4-bit ResNets, $0.61\%$ on 6-bit Transformers). The prototypical implementation of ShiftQuant achieves more than $1.85\times/15.3\%$ performance improvement on CPU/GPU compared to its FP16 counterparts, and $33.9\%$ resource consumption reduction on FPGA than the FP16 counterparts. The proposed fully-quantized L1 normalization layers achieve more than $35.54\%$ improvement in throughout on CPU compared to traditional L2 normalization layers. Moreover, theoretical analysis verifies the advancement of our method.
CVNov 14, 2025
Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing ClassificationQinghao Gao, Jianhai Qu, Yunsong Li et al.
Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.
LGSep 16, 2025
High-Energy Concentration for Federated Learning in Frequency DomainHaozhi Shi, Weiying Xie, Hangyu Ye et al.
Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain designs, which inevitably increases the communication burden. In this paper, we propose a novel Frequency-Domain aware FL method with high-energy concentration (FedFD) to address this problem. Our FedFD is inspired by the discovery that the discrete cosine transform predominantly distributes energy to specific regions, referred to as high-energy concentration. The principle behind FedFD is that low-energy like high-frequency components usually contain redundant information and noise, thus filtering them helps reduce communication costs and optimize performance. Our FedFD is mathematically formulated to preserve the low-frequency components using a binary mask, facilitating an optimal solution through frequency-domain distribution alignment. In particular, real data-driven synthetic classification is imposed into the loss to enhance the quality of the low-frequency components. On five image and speech datasets, FedFD achieves superior performance than state-of-the-art methods while reducing communication costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $α= 0.01$, FedFD achieves a minimum reduction of 37.78\% in the communication cost, while attaining a 10.88\% performance gain.
CVMar 28, 2025
Hyperspectral Adapter for Object Tracking based on Hyperspectral VideoLong Gao, Yunhe Zhang, Langkun Chen et al.
Object tracking based on hyperspectral video attracts increasing attention to the rich material and motion information in the hyperspectral videos. The prevailing hyperspectral methods adapt pretrained RGB-based object tracking networks for hyperspectral tasks by fine-tuning the entire network on hyperspectral datasets, which achieves impressive results in challenging scenarios. However, the performance of hyperspectral trackers is limited by the loss of spectral information during the transformation, and fine-tuning the entire pretrained network is inefficient for practical applications. To address the issues, a new hyperspectral object tracking method, hyperspectral adapter for tracking (HyA-T), is proposed in this work. The hyperspectral adapter for the self-attention (HAS) and the hyperspectral adapter for the multilayer perceptron (HAM) are proposed to generate the adaption information and to transfer the multi-head self-attention (MSA) module and the multilayer perceptron (MLP) in pretrained network for the hyperspectral object tracking task by augmenting the adaption information into the calculation of the MSA and MLP. Additionally, the hyperspectral enhancement of input (HEI) is proposed to augment the original spectral information into the input of the tracking network. The proposed methods extract spectral information directly from the hyperspectral images, which prevent the loss of the spectral information. Moreover, only the parameters in the proposed methods are fine-tuned, which is more efficient than the existing methods. Extensive experiments were conducted on four datasets with various spectral bands, verifing the effectiveness of the proposed methods. The HyA-T achieves state-of-the-art performance on all the datasets.
CVMay 23, 2023
SAR-to-Optical Image Translation via Thermodynamics-inspired NetworkMingjin Zhang, Jiamin Xu, Chengyu He et al.
Synthetic aperture radar (SAR) is prevalent in the remote sensing field but is difficult to interpret in human visual perception. Recently, SAR-to-optical (S2O) image conversion methods have provided a prospective solution for interpretation. However, since there is a huge domain difference between optical and SAR images, they suffer from low image quality and geometric distortion in the produced optical images. Motivated by the analogy between pixels during the S2O image translation and molecules in a heat field, Thermodynamics-inspired Network for SAR-to-Optical Image Translation (S2O-TDN) is proposed in this paper. Specifically, we design a Third-order Finite Difference (TFD) residual structure in light of the TFD equation of thermodynamics, which allows us to efficiently extract inter-domain invariant features and facilitate the learning of the nonlinear translation mapping. In addition, we exploit the first law of thermodynamics (FLT) to devise an FLT-guided branch that promotes the state transition of the feature values from the unstable diffusion state to the stable one, aiming to regularize the feature diffusion and preserve image structures during S2O image translation. S2O-TDN follows an explicit design principle derived from thermodynamic theory and enjoys the advantage of explainability. Experiments on the public SEN1-2 dataset show the advantages of the proposed S2O-TDN over the current methods with more delicate textures and higher quantitative results.
IVDec 29, 2021
HPRN: Holistic Prior-embedded Relation Network for Spectral Super-ResolutionChaoxiong Wu, Jiaojiao Li, Rui Song et al.
Spectral super-resolution (SSR) refers to the hyperspectral image (HSI) recovery from an RGB counterpart. Due to the one-to-many nature of the SSR problem, a single RGB image can be reprojected to many HSIs. The key to tackle this ill-posed problem is to plug into multi-source prior information such as the natural spatial context-prior of RGB images, deep feature-prior or inherent statistical-prior of HSIs, etc., so as to effectively alleviate the degree of ill-posedness. However, most current approaches only consider the general and limited priors in their customized convolutional neural networks (CNNs), which leads to the inability to guarantee the confidence and fidelity of reconstructed spectra. In this paper, we propose a novel holistic prior-embedded relation network (HPRN) to integrate comprehensive priors to regularize and optimize the solution space of SSR. Basically, the core framework is delicately assembled by several multi-residual relation blocks (MRBs) that fully facilitate the transmission and utilization of the low-frequency content prior of RGBs. Innovatively, the semantic prior of RGB inputs is introduced to mark category attributes, and a semantic-driven spatial relation module (SSRM) is invented to perform the feature aggregation of clustered similar range for refining recovered characteristics. Additionally, we develop a transformer-based channel relation module (TCRM), which breaks the habit of employing scalars as the descriptors of channel-wise relations in the previous deep feature-prior, and replaces them with certain vectors to make the mapping function more robust and smoother. In order to maintain the mathematical correlation and spectral consistency between hyperspectral bands, the second-order prior constraints (SOPC) are incorporated into the loss function to guide the HSI reconstruction.