Shuchang Lyu

CV
h-index28
26papers
2,139citations
Novelty44%
AI Score58

26 Papers

CVOct 22, 2023Code
OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

Chunlei Wang, Wenquan Feng, Xiangtai Li et al.

Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding, which entails locating a specific region within an image based on a corresponding language description. While current foundational models excel at various visual language tasks, there's a noticeable absence of models specifically tailored for open-vocabulary visual grounding. This research endeavor introduces novel and challenging OV tasks, namely Open-Vocabulary Visual Grounding and Open-Vocabulary Phrase Localization. The overarching aim is to establish connections between language descriptions and the localization of novel objects. To facilitate this, we have curated a comprehensive annotated benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images. In our pursuit of addressing these challenges, we delved into various baseline methodologies rooted in existing open-vocabulary object detection, VG, and phrase localization frameworks. Surprisingly, we discovered that state-of-the-art methods often falter in diverse scenarios. Consequently, we developed a novel framework that integrates two critical components: Text-Image Query Selection and Language-Guided Feature Attention. These modules are designed to bolster the recognition of novel categories and enhance the alignment between visual and linguistic information. Extensive experiments demonstrate the efficacy of our proposed framework, which consistently attains SOTA performance across the OV-VG task. Additionally, ablation studies provide further evidence of the effectiveness of our innovative models. Codes and datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG.

CVJan 13, 2023Code
Self-Training Guided Disentangled Adaptation for Cross-Domain Remote Sensing Image Semantic Segmentation

Qi Zhao, Shuchang Lyu, Binghao Liu et al.

Deep convolutional neural networks (DCNNs) based remote sensing (RS) image semantic segmentation technology has achieved great success used in many real-world applications such as geographic element analysis. However, strong dependency on annotated data of specific scene makes it hard for DCNNs to fit different RS scenes. To solve this problem, recent works gradually focus on cross-domain RS image semantic segmentation task. In this task, different ground sampling distance, remote sensing sensor variation and different geographical landscapes are three main factors causing dramatic domain shift between source and target images. To decrease the negative influence of domain shift, we propose a self-training guided disentangled adaptation network (ST-DASegNet). We first propose source student backbone and target student backbone to respectively extract the source-style and target-style feature for both source and target images. Towards the intermediate output feature maps of each backbone, we adopt adversarial learning for alignment. Then, we propose a domain disentangled module to extract the universal feature and purify the distinct feature of source-style and target-style features. Finally, these two features are fused and served as input of source student decoder and target student decoder to generate final predictions. Based on our proposed domain disentangled module, we further propose exponential moving average (EMA) based cross-domain separated self-training mechanism to ease the instability and disadvantageous effect during adversarial optimization. Extensive experiments and analysis on benchmark RS datasets show that ST-DASegNet outperforms previous methods on cross-domain RS image semantic segmentation task and achieves state-of-the-art (SOTA) results. Our code is available at https://github.com/cv516Buaa/ST-DASegNet.

CVJul 23, 2023Code
Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

Menghao Li, Chunlei Wang, Wenquan Feng et al.

Visual Grounding (VG) aims at localizing target objects from an image based on given expressions and has made significant progress with the development of detection and vision transformer. However, existing VG methods tend to generate false-alarm objects when presented with inaccurate or irrelevant descriptions, which commonly occur in practical applications. Moreover, existing methods fail to capture fine-grained features, accurate localization, and sufficient context comprehension from the whole image and textual descriptions. To address both issues, we propose an Iterative Robust Visual Grounding (IR-VG) framework with Masked Reference based Centerpoint Supervision (MRCS). The framework introduces iterative multi-level vision-language fusion (IMVF) for better alignment. We use MRCS to ahieve more accurate localization with point-wised feature supervision. Then, to improve the robustness of VG, we also present a multi-stage false-alarm sensitive decoder (MFSD) to prevent the generation of false-alarm objects when presented with inaccurate expressions. The proposed framework is evaluated on five regular VG datasets and two newly constructed robust VG datasets. Extensive experiments demonstrate that IR-VG achieves new state-of-the-art (SOTA) results, with improvements of 25\% and 10\% compared to existing SOTA approaches on the two newly proposed robust VG datasets. Moreover, the proposed framework is also verified effective on five regular VG datasets. Codes and models will be publicly at https://github.com/cv516Buaa/IR-VG.

CVJul 14, 2022Code
MMOTU: A Multi-Modality Ovarian Tumor Ultrasound Image Dataset for Unsupervised Cross-Domain Semantic Segmentation

Qi Zhao, Shuchang Lyu, Wenpei Bai et al.

Ovarian cancer is one of the most harmful gynecological diseases. Detecting ovarian tumors in early stage with computer-aided techniques can efficiently decrease the mortality rate. With the improvement of medical treatment standard, ultrasound images are widely applied in clinical treatment. However, recent notable methods mainly focus on single-modality ultrasound ovarian tumor segmentation or recognition, which means there still lacks researches on exploring the representation capability of multi-modality ultrasound ovarian tumor images. To solve this problem, we propose a Multi-Modality Ovarian Tumor Ultrasound (MMOTU) image dataset containing 1469 2d ultrasound images and 170 contrast enhanced ultrasonography (CEUS) images with pixel-wise and global-wise annotations. Based on MMOTU, we mainly focus on unsupervised cross-domain semantic segmentation task. To solve the domain shift problem, we propose a feature alignment based architecture named Dual-Scheme Domain-Selected Network (DS2Net). Specifically, we first design source-encoder and target-encoder to extract two-style features of source and target images. Then, we propose Domain-Distinct Selected Module (DDSM) and Domain-Universal Selected Module (DUSM) to extract the distinct and universal features in two styles (source-style or target-style). Finally, we fuse these two kinds of features and feed them into the source-decoder and target-decoder to generate final predictions. Extensive comparison experiments and analysis on MMOTU image dataset show that DS2Net can boost the segmentation performance for bidirectional cross-domain adaptation of 2d ultrasound images and CEUS images. Our proposed dataset and code are all available at https://github.com/cv516Buaa/MMOTU_DS2Net.

CVFeb 16, 2023
Local-to-Global Information Communication for Real-Time Semantic Segmentation Network Search

Guangliang Cheng, Peng Sun, Ting-Bing Xu et al.

Neural Architecture Search (NAS) has shown great potentials in automatically designing neural network architectures for real-time semantic segmentation. Unlike previous works that utilize a simplified search space with cell-sharing way, we introduce a new search space where a lightweight model can be more effectively searched by replacing the cell-sharing manner with cell-independent one. Based on this, the communication of local to global information is achieved through two well-designed modules. For local information exchange, a graph convolutional network (GCN) guided module is seamlessly integrated as a communication deliver between cells. For global information aggregation, we propose a novel dense-connected fusion module (cell) which aggregates long-range multi-level features in the network automatically. In addition, a latency-oriented constraint is endowed into the search process to balance the accuracy and latency. We name the proposed framework as Local-to-Global Information Communication Network Search (LGCNet). Extensive experiments on Cityscapes and CamVid datasets demonstrate that LGCNet achieves the new state-of-the-art trade-off between accuracy and speed. In particular, on Cityscapes dataset, LGCNet achieves the new best performance of 74.0\% mIoU with the speed of 115.2 FPS on Titan Xp.

CVAug 17, 2022
Look in Different Views: Multi-Scheme Regression Guided Cell Instance Segmentation

Menghao Li, Wenquan Feng, Shuchang Lyu et al.

Cell instance segmentation is a new and challenging task aiming at joint detection and segmentation of every cell in an image. Recently, many instance segmentation methods have applied in this task. Despite their great success, there still exists two main weaknesses caused by uncertainty of localizing cell center points. First, densely packed cells can easily be recognized into one cell. Second, elongated cell can easily be recognized into two cells. To overcome these two weaknesses, we propose a novel cell instance segmentation network based on multi-scheme regression guidance. With multi-scheme regression guidance, the network has the ability to look each cell in different views. Specifically, we first propose a gaussian guidance attention mechanism to use gaussian labels for guiding the network's attention. We then propose a point-regression module for assisting the regression of cell center. Finally, we utilize the output of the above two modules to further guide the instance segmentation. With multi-scheme regression guidance, we can take full advantage of the characteristics of different regions, especially the central region of the cell. We conduct extensive experiments on benchmark datasets, DSB2018, CA2.5 and SCIS. The encouraging results show that our network achieves SOTA (state-of-the-art) performance. On the DSB2018 and CA2.5, our network surpasses previous methods by 1.2% (AP50). Particularly on SCIS dataset, our network performs stronger by large margin (3.0% higher AP50). Visualization and analysis further prove that our proposed method is interpretable.

CVMay 1, 2025Code
Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook

Muyi Bao, Shuchang Lyu, Zhaoyang Xu et al.

Deep learning has profoundly transformed remote sensing, yet prevailing architectures like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) remain constrained by critical trade-offs: CNNs suffer from limited receptive fields, while ViTs grapple with quadratic computational complexity, hindering their scalability for high-resolution remote sensing data. State Space Models (SSMs), particularly the recently proposed Mamba architecture, have emerged as a paradigm-shifting solution, combining linear computational scaling with global context modeling. This survey presents a comprehensive review of Mamba-based methodologies in remote sensing, systematically analyzing about 120 Mamba-based remote sensing studies to construct a holistic taxonomy of innovations and applications. Our contributions are structured across five dimensions: (i) foundational principles of vision Mamba architectures, (ii) micro-architectural advancements such as adaptive scan strategies and hybrid SSM formulations, (iii) macro-architectural integrations, including CNN-Transformer-Mamba hybrids and frequency-domain adaptations, (iv) rigorous benchmarking against state-of-the-art methods in multiple application tasks, such as object detection, semantic segmentation, change detection, etc. and (v) critical analysis of unresolved challenges with actionable future directions. By bridging the gap between SSM theory and remote sensing practice, this survey establishes Mamba as a transformative framework for remote sensing analysis. To our knowledge, this paper is the first systematic review of Mamba architectures in remote sensing. Our work provides a structured foundation for advancing research in remote sensing systems through SSM-based methods. We curate an open-source repository (https://github.com/BaoBao0926/Awesome-Mamba-in-Remote-Sensing) to foster community-driven advancements.

CVNov 8, 2024Code
Joint-Optimized Unsupervised Adversarial Domain Adaptation in Remote Sensing Segmentation with Prompted Foundation Model

Shuchang Lyu, Qi Zhao, Guangliang Cheng et al.

Unsupervised Domain Adaptation for Remote Sensing Semantic Segmentation (UDA-RSSeg) addresses the challenge of adapting a model trained on source domain data to target domain samples, thereby minimizing the need for annotated data across diverse remote sensing scenes. This task presents two principal challenges: (1) severe inconsistencies in feature representation across different remote sensing domains, and (2) a domain gap that emerges due to the representation bias of source domain patterns when translating features to predictive logits. To tackle these issues, we propose a joint-optimized adversarial network incorporating the "Segment Anything Model (SAM) (SAM-JOANet)" for UDA-RSSeg. Our approach integrates SAM to leverage its robust generalized representation capabilities, thereby alleviating feature inconsistencies. We introduce a finetuning decoder designed to convert SAM-Encoder features into predictive logits. Additionally, a feature-level adversarial-based prompted segmentor is employed to generate class-agnostic maps, which guide the finetuning decoder's feature representations. The network is optimized end-to-end, combining the prompted segmentor and the finetuning decoder. Extensive evaluations on benchmark datasets, including ISPRS (Potsdam/Vaihingen) and CITY-OSM (Paris/Chicago), demonstrate the effectiveness of our method. The results, supported by visualization and analysis, confirm the method's interpretability and robustness. The code of this paper is available at https://github.com/CV-ShuchangLyu/SAM-JOANet.

CVMar 13Code
Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing

Shuchang Lyu, Haiquan Wen, Guangliang Cheng et al.

Recent advances in reasoning language models and reinforcement learning with verifiable rewards have significantly enhanced multi-step reasoning capabilities. This progress motivates the extension of reasoning paradigms to remote sensing visual grounding task. However, existing remote sensing grounding methods remain largely confined to perception-level matching and single-entity formulations, limiting the role of explicit reasoning and inter-entity modeling. To address this challenge, we introduce a new benchmark dataset for Multi-Entity Reasoning Grounding in Remote Sensing (ME-RSRG). Based on ME-RSRG, we reformulate remote sensing grounding as a multi-entity reasoning task and propose an Entity-Aware Reasoning (EAR) framework built upon visual-linguistic foundation models. EAR generates structured reasoning traces and subject-object grounding outputs. It adopts supervised fine-tuning for cold-start initialization and is further optimized via entity-aware reward-driven Group Relative Policy Optimization (GRPO). Extensive experiments on ME-RSRG demonstrate the challenges of multi-entity reasoning and verify the effectiveness of our proposed EAR framework. Our dataset, code, and models will be available at https://github.com/CV-ShuchangLyu/ME-RSRG.

CVSep 28, 2025Code
Adversarial Versus Federated: An Adversarial Learning based Multi-Modality Cross-Domain Federated Medical Segmentation

You Zhou, Lijiang Chen, Shuchang Lyu et al.

Federated learning enables collaborative training of machine learning models among different clients while ensuring data privacy, emerging as the mainstream for breaking data silos in the healthcare domain. However, the imbalance of medical resources, data corruption or improper data preservation may lead to a situation where different clients possess medical images of different modality. This heterogeneity poses a significant challenge for cross-domain medical image segmentation within the federated learning framework. To address this challenge, we propose a new Federated Domain Adaptation (FedDA) segmentation training framework. Specifically, we propose a feature-level adversarial learning among clients by aligning feature maps across clients through embedding an adversarial training mechanism. This design can enhance the model's generalization on multiple domains and alleviate the negative impact from domain-shift. Comprehensive experiments on three medical image datasets demonstrate that our proposed FedDA substantially achieves cross-domain federated aggregation, endowing single modality client with cross-modality processing capabilities, and consistently delivers robust performance compared to state-of-the-art federated aggregation algorithms in objective and subjective assessment. Our code are available at https://github.com/GGbond-study/FedDA.

CVAug 6, 2025Code
RAIDX: A Retrieval-Augmented Generation and GRPO Reinforcement Learning Framework for Explainable Deepfake Detection

Tianxiao Li, Zhenglin Huang, Haiquan Wen et al.

The rapid advancement of AI-generation models has enabled the creation of hyperrealistic imagery, posing ethical risks through widespread misinformation. Current deepfake detection methods, categorized as face specific detectors or general AI-generated detectors, lack transparency by framing detection as a classification task without explaining decisions. While several LLM-based approaches offer explainability, they suffer from coarse-grained analyses and dependency on labor-intensive annotations. This paper introduces RAIDX (Retrieval-Augmented Image Deepfake Detection and Explainability), a novel deepfake detection framework integrating Retrieval-Augmented Generation (RAG) and Group Relative Policy Optimization (GRPO) to enhance detection accuracy and decision explainability. Specifically, RAIDX leverages RAG to incorporate external knowledge for improved detection accuracy and employs GRPO to autonomously generate fine-grained textual explanations and saliency maps, eliminating the need for extensive manual annotations. Experiments on multiple benchmarks demonstrate RAIDX's effectiveness in identifying real or fake, and providing interpretable rationales in both textual descriptions and saliency maps, achieving state-of-the-art detection performance while advancing transparency in deepfake identification. RAIDX represents the first unified framework to synergize RAG and GRPO, addressing critical gaps in accuracy and explainability. Our code and models will be publicly available.

IVJul 6, 2025Code
ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition

You Zhou, Lijiang Chen, Guangxia Cui et al.

Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological recognition dataset called \textbf{ViTaL} that contains \textbf{V}isual, \textbf{T}abular and \textbf{L}inguistic modality data of 496 patients across six pathological categories. The ViTaL dataset comprises three subsets corresponding to different patient data modalities: visual data from 2216 two-dimensional ultrasound images, tabular data from medical examinations of 496 patients, and linguistic data from ultrasound reports of 496 patients. It is insufficient to merely distinguish between benign and malignant ovarian tumors in clinical practice. To enable multi-pathology classification of ovarian tumor, we propose a ViTaL-Net based on the Triplet Hierarchical Offset Attention Mechanism (THOAM) to minimize the loss incurred during feature fusion of multi-modal data. This mechanism could effectively enhance the relevance and complementarity between information from different modalities. ViTaL-Net serves as a benchmark for the task of multi-pathology, multi-modality classification of ovarian tumors. In our comprehensive experiments, the proposed method exhibited satisfactory performance, achieving accuracies exceeding 90\% on the two most common pathological types of ovarian tumor and an overall performance of 85\%. Our dataset and code are available at https://github.com/GGbond-study/vitalnet.

IVMar 25, 2025Code
ASP-VMUNet: Atrous Shifted Parallel Vision Mamba U-Net for Skin Lesion Segmentation

Muyi Bao, Shuchang Lyu, Zhaoyang Xu et al.

Skin lesion segmentation is a critical challenge in computer vision, and it is essential to separate pathological features from healthy skin for diagnostics accurately. Traditional Convolutional Neural Networks (CNNs) are limited by narrow receptive fields, and Transformers face significant computational burdens. This paper presents a novel skin lesion segmentation framework, the Atrous Shifted Parallel Vision Mamba UNet (ASP-VMUNet), which integrates the efficient and scalable Mamba architecture to overcome limitations in traditional CNNs and computationally demanding Transformers. The framework introduces an atrous scan technique that minimizes background interference and expands the receptive field, enhancing Mamba's scanning capabilities. Additionally, the inclusion of a Parallel Vision Mamba (PVM) layer and a shift round operation optimizes feature segmentation and fosters rich inter-segment information exchange. A supplementary CNN branch with a Selective-Kernel (SK) Block further refines the segmentation by blending local and global contextual information. Tested on four benchmark datasets (ISIC16/17/18 and PH2), ASP-VMUNet demonstrates superior performance in skin lesion segmentation, validated by comprehensive ablation studies. This approach not only advances medical image segmentation but also highlights the benefits of hybrid architectures in medical imaging technology. Our code is available at https://github.com/BaoBao0926/ASP-VMUNet/tree/main.

CVAug 14, 2021Code
A Self-Distillation Embedded Supervised Affinity Attention Model for Few-Shot Segmentation

Qi Zhao, Binghao Liu, Shuchang Lyu et al.

Few-shot segmentation focuses on the generalization of models to segment unseen object with limited annotated samples. However, existing approaches still face two main challenges. First, huge feature distinction between support and query images causes knowledge transferring barrier, which harms the segmentation performance. Second, limited support prototypes cannot adequately represent features of support objects, hard to guide high-quality query segmentation. To deal with the above two issues, we propose self-distillation embedded supervised affinity attention model to improve the performance of few-shot segmentation task. Specifically, the self-distillation guided prototype module uses self-distillation to align the features of support and query. The supervised affinity attention module generates high-quality query attention map to provide sufficient object information. Extensive experiments prove that our model significantly improves the performance compared to existing methods. Comprehensive ablation experiments and visualization studies also show the significant effect of our method on few-shot segmentation task. On COCO-20i dataset, we achieve new state-of-the-art results. Training code and pretrained models are available at https://github.com/cv516Buaa/SD-AANet.

CRDec 29, 2023
MVPatch: More Vivid Patch for Adversarial Camouflaged Attacks on Object Detectors in the Physical World

Zheng Zhou, Hongbo Zhao, Ju Liu et al.

Recent studies have shown that Adversarial Patches (APs) can effectively manipulate object detection models. However, the conspicuous patterns often associated with these patches tend to attract human attention, posing a significant challenge. Existing research has primarily focused on enhancing attack efficacy in the physical domain while often neglecting the optimization of stealthiness and transferability. Furthermore, applying APs in real-world scenarios faces major challenges related to transferability, stealthiness, and practicality. To address these challenges, we introduce generalization theory into the context of APs, enabling our iterative process to simultaneously enhance transferability and refine visual correlation with realistic images. We propose a Dual-Perception-Based Framework (DPBF) to generate the More Vivid Patch (MVPatch), which enhances transferability, stealthiness, and practicality. The DPBF integrates two key components: the Model-Perception-Based Module (MPBM) and the Human-Perception-Based Module (HPBM), along with regularization terms. The MPBM employs ensemble strategy to reduce object confidence scores across multiple detectors, thereby improving AP transferability with robust theoretical support. Concurrently, the HPBM introduces a lightweight method for achieving visual similarity, creating natural and inconspicuous adversarial patches without relying on additional generative models. The regularization terms further enhance the practicality of the generated APs in the physical domain. Additionally, we introduce naturalness and transferability scores to provide an unbiased assessment of APs. Extensive experimental validation demonstrates that MVPatch achieves superior transferability and a natural appearance in both digital and physical domains, underscoring its effectiveness and stealthiness.

CVNov 14, 2024
BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation

Zheng Zhou, Wenquan Feng, Shuchang Lyu et al.

Dataset Distillation (DD) is an emerging technique that compresses large-scale datasets into significantly smaller synthesized datasets while preserving high test performance and enabling the efficient training of large models. However, current research primarily focuses on enhancing evaluation accuracy under limited compression ratios, often overlooking critical security concerns such as adversarial robustness. A key challenge in evaluating this robustness lies in the complex interactions between distillation methods, model architectures, and adversarial attack strategies, which complicate standardized assessments. To address this, we introduce BEARD, an open and unified benchmark designed to systematically assess the adversarial robustness of DD methods, including DM, IDM, and BACON. BEARD encompasses a variety of adversarial attacks (e.g., FGSM, PGD, C&W) on distilled datasets like CIFAR-10/100 and TinyImageNet. Utilizing an adversarial game framework, it introduces three key metrics: Robustness Ratio (RR), Attack Efficiency Ratio (AE), and Comprehensive Robustness-Efficiency Index (CREI). Our analysis includes unified benchmarks, various Images Per Class (IPC) settings, and the effects of adversarial training. Results are available on the BEARD Leaderboard, along with a library providing model and dataset pools to support reproducible research. Access the code at BEARD.

CVOct 17, 2025
Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey

Shuchang Lyu, Qi Zhao, Zheng Zhou et al.

Domain adaptation is a crucial and increasingly important task in remote sensing, aiming to transfer knowledge from a source domain a differently distributed target domain. It has broad applications across various real-world applications, including remote sensing element interpretation, ecological environment monitoring, and urban/rural planning. However, domain adaptation in remote sensing poses significant challenges due to differences in data, such as variations in ground sampling distance, imaging modes from various sensors, geographical landscapes, and environmental conditions. In recent years, deep learning has emerged as a powerful tool for feature representation and cross-domain knowledge transfer, leading to widespread adoption in remote sensing tasks. In this paper, we present a comprehensive survey of significant advancements in deep learning based domain adaptation for remote sensing. We first introduce the preliminary knowledge to clarify key concepts, mathematical notations, and the taxonomy of methodologies. We then organize existing algorithms from multiple perspectives, including task categorization, input mode, supervision paradigm, and algorithmic granularity, providing readers with a structured understanding of the field. Next, we review widely used datasets and summarize the performance of state-of-the-art methods to provide an overview of current progress. We also identify open challenges and potential directions to guide future research in domain adaptation for remote sensing. Compared to previous surveys, this work addresses a broader range of domain adaptation tasks in remote sensing, rather than concentrating on a few subfields. It also presents a systematic taxonomy, providing a more comprehensive and organized understanding of the field. As a whole, this survey can inspire the research community, foster understanding, and guide future work in the field.

CVJun 3, 2024
BACON: Bayesian Optimal Condensation Framework for Dataset Distillation

Zheng Zhou, Hongbo Zhao, Guangliang Cheng et al.

Dataset Distillation (DD) aims to distill knowledge from extensive datasets into more compact ones while preserving performance on the test set, thereby reducing storage costs and training expenses. However, existing methods often suffer from computational intensity, particularly exhibiting suboptimal performance with large dataset sizes due to the lack of a robust theoretical framework for analyzing the DD problem. To address these challenges, we propose the BAyesian optimal CONdensation framework (BACON), which is the first work to introduce the Bayesian theoretical framework to the literature of DD. This framework provides theoretical support for enhancing the performance of DD. Furthermore, BACON formulates the DD problem as the minimization of the expected risk function in joint probability distributions using the Bayesian framework. Additionally, by analyzing the expected risk function for optimal condensation, we derive a numerically feasible lower bound based on specific assumptions, providing an approximate solution for BACON. We validate BACON across several datasets, demonstrating its superior performance compared to existing state-of-the-art methods. For instance, under the IPC-10 setting, BACON achieves a 3.46% accuracy gain over the IDM method on the CIFAR-10 dataset and a 3.10% gain on the TinyImageNet dataset. Our extensive experiments confirm the effectiveness of BACON and its seamless integration with existing methods, thereby enhancing their performance for the DD task. Code and distilled datasets are available at BACON.

CVMay 9, 2023
Change Detection Methods for Remote Sensing in the Last Decade: A Comprehensive Review

Guangliang Cheng, Yunmeng Huang, Xiangtai Li et al.

Change detection is an essential and widely utilized task in remote sensing that aims to detect and analyze changes occurring in the same geographical area over time, which has broad applications in urban development, agricultural surveys, and land cover monitoring. Detecting changes in remote sensing images is a complex challenge due to various factors, including variations in image quality, noise, registration errors, illumination changes, complex landscapes, and spatial heterogeneity. In recent years, deep learning has emerged as a powerful tool for feature extraction and addressing these challenges. Its versatility has resulted in its widespread adoption for numerous image-processing tasks. This paper presents a comprehensive survey of significant advancements in change detection for remote sensing images over the past decade. We first introduce some preliminary knowledge for the change detection task, such as problem definition, datasets, evaluation metrics, and transformer basics, as well as provide a detailed taxonomy of existing algorithms from three different perspectives: algorithm granularity, supervision modes, and learning frameworks in the methodology section. This survey enables readers to gain systematic knowledge of change detection tasks from various angles. We then summarize the state-of-the-art performance on several dominant change detection datasets, providing insights into the strengths and limitations of existing algorithms. Based on our survey, some future research directions for change detection in remote sensing are well identified. This survey paper will shed some light on the community and inspire further research efforts in the change detection task.

CVNov 29, 2021
Attention-based Feature Decomposition-Reconstruction Network for Scene Text Detection

Qi Zhao, Yufei Wang, Shuchang Lyu et al.

Recently, scene text detection has been a challenging task. Texts with arbitrary shape or large aspect ratio are usually hard to detect. Previous segmentation-based methods can describe curve text more accurately but suffer from over segmentation and text adhesion. In this paper, we propose attention-based feature decomposition-reconstruction network for scene text detection, which utilizes contextual information and low-level feature to enhance the performance of segmentation-based text detector. In the phase of feature fusion, we introduce cross level attention module to enrich contextual information of text by adding attention mechanism on fused multi-scaled feature. In the phase of probability map generation, a feature decomposition-reconstruction module is proposed to alleviate the over segmentation problem of large aspect ratio text, which decomposes text feature according to their frequency characteristic and then reconstructs it by adding low-level feature. Experiments have been conducted on two public benchmark datasets and results show that our proposed method achieves state-of-the-art performance.

CVOct 9, 2021
A Feature Consistency Driven Attention Erasing Network for Fine-Grained Image Retrieval

Qi Zhao, Xu Wang, Shuchang Lyu et al.

Large-scale fine-grained image retrieval has two main problems. First, low dimensional feature embedding can fasten the retrieval process but bring accuracy reduce due to overlooking the feature of significant attention regions of images in fine-grained datasets. Second, fine-grained images lead to the same category query hash codes mapping into the different cluster in database hash latent space. To handle these two issues, we propose a feature consistency driven attention erasing network (FCAENet) for fine-grained image retrieval. For the first issue, we propose an adaptive augmentation module in FCAENet, which is selective region erasing module (SREM). SREM makes the network more robust on subtle differences of fine-grained task by adaptively covering some regions of raw images. The feature extractor and hash layer can learn more representative hash code for fine-grained images by SREM. With regard to the second issue, we fully exploit the pair-wise similarity information and add the enhancing space relation loss (ESRL) in FCAENet to make the vulnerable relation stabler between the query hash code and database hash code. We conduct extensive experiments on five fine-grained benchmark datasets (CUB2011, Aircraft, NABirds, VegFru, Food101) for 12bits, 24bits, 32bits, 48bits hash code. The results show that FCAENet achieves the state-of-the-art (SOTA) fine-grained retrieval performance compared with other methods.

CVAug 26, 2021
TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios

Xingkui Zhu, Shuchang Lyu, Xu Wang et al.

Object detection on drone-captured scenarios is a recent popular task. As drones always navigate in different altitudes, the object scale varies violently, which burdens the optimization of networks. Moreover, high-speed and low-altitude flight bring in the motion blur on the densely packed objects, which leads to great challenge of object distinction. To solve the two issues mentioned above, we propose TPH-YOLOv5. Based on YOLOv5, we add one more prediction head to detect different-scale objects. Then we replace the original prediction heads with Transformer Prediction Heads (TPH) to explore the prediction potential with self-attention mechanism. We also integrate convolutional block attention model (CBAM) to find attention region on scenarios with dense objects. To achieve more improvement of our proposed TPH-YOLOv5, we provide bags of useful strategies such as data augmentation, multiscale testing, multi-model integration and utilizing extra classifier. Extensive experiments on dataset VisDrone2021 show that TPH-YOLOv5 have good performance with impressive interpretability on drone-captured scenarios. On DET-test-challenge dataset, the AP result of TPH-YOLOv5 are 39.18%, which is better than previous SOTA method (DPNetV3) by 1.81%. On VisDrone Challenge 2021, TPHYOLOv5 wins 5th place and achieves well-matched results with 1st place model (AP 39.43%). Compared to baseline model (YOLOv5), TPH-YOLOv5 improves about 7%, which is encouraging and competitive.

CVApr 1, 2021
Embedded Self-Distillation in Compact Multi-Branch Ensemble Network for Remote Sensing Scene Classification

Qi Zhao, Yujing Ma, Shuchang Lyu et al.

Remote sensing (RS) image scene classification task faces many challenges due to the interference from different characteristics of different geographical elements. To solve this problem, we propose a multi-branch ensemble network to enhance the feature representation ability by fusing features in final output logits and intermediate feature maps. However, simply adding branches will increase the complexity of models and decline the inference efficiency. On this issue, we embed self-distillation (SD) method to transfer knowledge from ensemble network to main-branch in it. Through optimizing with SD, main-branch will have close performance as ensemble network. During inference, we can cut other branches to simplify the whole model. In this paper, we first design compact multi-branch ensemble network, which can be trained in an end-to-end manner. Then, we insert SD method on output logits and feature maps. Compared to previous methods, our proposed architecture (ESD-MBENet) performs strongly on classification accuracy with compact design. Extensive experiments are applied on three benchmark RS datasets AID, NWPU-RESISC45 and UC-Merced with three classic baseline models, VGG16, ResNet50 and DenseNet121. Results prove that our proposed ESD-MBENet can achieve better accuracy than previous state-of-the-art (SOTA) complex models. Moreover, abundant visualization analysis make our method more convincing and interpretable.

CVMar 1, 2021
Embedded Knowledge Distillation in Depth-Level Dynamic Neural Network

Qi Zhao, Shuchang Lyu, Zhiwei Zhang et al.

In real applications, different computation-resource devices need different-depth networks (e.g., ResNet-18/34/50) with high-accuracy. Usually, existing methods either design multiple networks and train them independently, or construct depth-level/width-level dynamic neural networks which is hard to prove the accuracy of each sub-net. In this article, we propose an elegant Depth-Level Dynamic Neural Network (DDNN) integrated different-depth sub-nets of similar architectures. To improve the generalization of sub-nets, we design the Embedded-Knowledge-Distillation (EKD) training mechanism for the DDNN to implement knowledge transfer from the teacher (full-net) to multiple students (sub-nets). Specifically, the Kullback-Leibler (KL) divergence is introduced to constrain the posterior class probability consistency between full-net and sub-nets, and self-attention distillation on the same resolution feature of different depth is addressed to drive more abundant feature representations of sub-nets. Thus, we can obtain multiple high-accuracy sub-nets simultaneously in a DDNN via the online knowledge distillation in each training iteration without extra computation cost. Extensive experiments on CIFAR-10/100, and ImageNet datasets demonstrate that sub-nets in DDNN with EKD training achieve better performance than individually training networks while preserving the original performance of full-nets.

CVDec 30, 2020
MM-FSOD: Meta and metric integrated few-shot object detection

Yuewen Li, Wenquan Feng, Shuchang Lyu et al.

In the object detection task, CNN (Convolutional neural networks) models always need a large amount of annotated examples in the training process. To reduce the dependency of expensive annotations, few-shot object detection has become an increasing research focus. In this paper, we present an effective object detection framework (MM-FSOD) that integrates metric learning and meta-learning to tackle the few-shot object detection task. Our model is a class-agnostic detection model that can accurately recognize new categories, which are not appearing in training samples. Specifically, to fast learn the features of new categories without a fine-tuning process, we propose a meta-representation module (MR module) to learn intra-class mean prototypes. MR module is trained with a meta-learning method to obtain the ability to reconstruct high-level features. To further conduct similarity of features between support prototype with query RoIs features, we propose a Pearson metric module (PR module) which serves as a classifier. Compared to the previous commonly used metric method, cosine distance metric. PR module enables the model to align features into discriminative embedding space. We conduct extensive experiments on benchmark datasets FSOD, MS COCO, and PASCAL VOC to demonstrate the feasibility and efficiency of our model. Comparing with the previous method, MM-FSOD achieves state-of-the-art (SOTA) results.

CVDec 29, 2020
MGML: Multi-Granularity Multi-Level Feature Ensemble Network for Remote Sensing Scene Classification

Qi Zhao, Shuchang Lyu, Yuewen Li et al.

Remote sensing (RS) scene classification is a challenging task to predict scene categories of RS images. RS images have two main characters: large intra-class variance caused by large resolution variance and confusing information from large geographic covering area. To ease the negative influence from the above two characters. We propose a Multi-granularity Multi-Level Feature Ensemble Network (MGML-FENet) to efficiently tackle RS scene classification task in this paper. Specifically, we propose Multi-granularity Multi-Level Feature Fusion Branch (MGML-FFB) to extract multi-granularity features in different levels of network by channel-separate feature generator (CS-FG). To avoid the interference from confusing information, we propose Multi-granularity Multi-Level Feature Ensemble Module (MGML-FEM) which can provide diverse predictions by full-channel feature generator (FC-FG). Compared to previous methods, our proposed networks have ability to use structure information and abundant fine-grained features. Furthermore, through ensemble learning method, our proposed MGML-FENets can obtain more convincing final predictions. Extensive classification experiments on multiple RS datasets (AID, NWPU-RESISC45, UC-Merced and VGoogle) demonstrate that our proposed networks achieve better performance than previous state-of-the-art (SOTA) networks. The visualization analysis also shows the good interpretability of MGML-FENet.