Boxue Du

CV
h-index53
59papers
2,759citations
Novelty49%
AI Score60

59 Papers

25.9CVNov 19, 2022Code
DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

Maoyuan Ye, Jing Zhang, Shanshan Zhao et al.

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple DETR-like baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code is available at https://github.com/ViTAE-Transformer/DeepSolo.

30.8CVApr 6, 2022Code
An Empirical Study of Remote Sensing Pretraining

Di Wang, Jing Zhang, Bo Du et al.

Deep learning has largely reshaped remote sensing (RS) research for aerial image understanding and made a great success. Nevertheless, most of the existing deep models are initialized with the ImageNet pretrained weights. Since natural images inevitably present a large domain gap relative to aerial images, probably limiting the finetuning performance on downstream aerial scene tasks. This issue motivates us to conduct an empirical study of remote sensing pretraining (RSP) on aerial images. To this end, we train different networks from scratch with the help of the largest RS scene recognition dataset up to now -- MillionAID, to obtain a series of RS pretrained backbones, including both convolutional neural networks (CNN) and vision transformers such as Swin and ViTAE, which have shown promising performance on computer vision tasks. Then, we investigate the impact of RSP on representative downstream tasks including scene recognition, semantic segmentation, object detection, and change detection using these CNN and vision transformer backbones. Empirical study shows that RSP can help deliver distinctive performances in scene recognition tasks and in perceiving RS related semantics such as "Bridge" and "Airplane". We also find that, although RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, it may still suffer from task discrepancies, where downstream tasks require different representations from scene recognition tasks. These findings call for further research efforts on both large-scale pretraining datasets and effective pretraining methods. The codes and pretrained models will be released at https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing.

22.1CVJul 10, 2022Code
DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer

Maoyuan Ye, Jing Zhang, Shanshan Zhao et al.

Recently, Transformer-based methods, which predict polygon points or Bezier curve control points for localizing texts, are popular in scene text detection. However, these methods built upon detection transformer framework might achieve sub-optimal training efficiency and performance due to coarse positional query modeling.In addition, the point label form exploited in previous works implies the reading order of humans, which impedes the detection robustness from our observation. To address these challenges, this paper proposes a concise Dynamic Point Text DEtection TRansformer network, termed DPText-DETR. In detail, DPText-DETR directly leverages explicit point coordinates to generate position queries and dynamically updates them in a progressive way. Moreover, to improve the spatial inductive bias of non-local self-attention in Transformer, we present an Enhanced Factorized Self-Attention module which provides point queries within each instance with circular shape guidance. Furthermore, we design a simple yet effective positional label form to tackle the side effect of the previous form. To further evaluate the impact of different label forms on the detection robustness in real-world scenario, we establish an Inverse-Text test set containing 500 manually labeled images. Extensive experiments prove the high training efficiency, robustness, and state-of-the-art performance of our method on popular benchmarks. The code and the Inverse-Text test set are available at https://github.com/ymy-k/DPText-DETR.

31.7CVMar 5, 2022Code
Learning Affinity from Attention: End-to-End Weakly-Supervised Semantic Segmentation with Transformers

Lixiang Ru, Yibing Zhan, Baosheng Yu et al.

Weakly-supervised semantic segmentation (WSSS) with image-level labels is an important and challenging task. Due to the high training efficiency, end-to-end solutions for WSSS have received increasing attention from the community. However, current methods are mainly based on convolutional neural networks and fail to explore the global information properly, thus usually resulting in incomplete object regions. In this paper, to address the aforementioned problem, we introduce Transformers, which naturally integrate global information, to generate more integral initial pseudo labels for end-to-end WSSS. Motivated by the inherent consistency between the self-attention in Transformers and the semantic affinity, we propose an Affinity from Attention (AFA) module to learn semantic affinity from the multi-head self-attention (MHSA) in Transformers. The learned affinity is then leveraged to refine the initial pseudo labels for segmentation. In addition, to efficiently derive reliable affinity labels for supervising AFA and ensure the local consistency of pseudo labels, we devise a Pixel-Adaptive Refinement module that incorporates low-level image appearance information to refine the pseudo labels. We perform extensive experiments and our method achieves 66.0% and 38.9% mIoU on the PASCAL VOC 2012 and MS COCO 2014 datasets, respectively, significantly outperforming recent end-to-end methods and several multi-stage competitors. Code is available at https://github.com/rulixiang/afa.

13.6CVApr 1, 2022Code
An End-to-end Supervised Domain Adaptation Framework for Cross-Domain Change Detection

Jia Liu, Wenjie Xuan, Yuhang Gan et al.

Existing deep learning-based change detection methods try to elaborately design complicated neural networks with powerful feature representations, but ignore the universal domain shift induced by time-varying land cover changes, including luminance fluctuations and season changes between pre-event and post-event images, thereby producing sub-optimal results. In this paper, we propose an end-to-end Supervised Domain Adaptation framework for cross-domain Change Detection, namely SDACD, to effectively alleviate the domain shift between bi-temporal images for better change predictions. Specifically, our SDACD presents collaborative adaptations from both image and feature perspectives with supervised learning. Image adaptation exploits generative adversarial learning with cycle-consistency constraints to perform cross-domain style transformation, effectively narrowing the domain gap in a two-side generation fashion. As to feature adaptation, we extract domain-invariant features to align different feature distributions in the feature space, which could further reduce the domain gap of cross-domain images. To further improve the performance, we combine three types of bi-temporal images for the final change prediction, including the initial input bi-temporal images and two generated bi-temporal images from the pre-event and post-event domains. Extensive experiments and analyses on two benchmarks demonstrate the effectiveness and universality of our proposed framework. Notably, our framework pushes several representative baseline models up to new State-Of-The-Art records, achieving 97.34% and 92.36% on the CDD and WHU building datasets, respectively. The source code and models are publicly available at https://github.com/Perfect-You/SDACD.

27.0CVMar 2, 2023Code
Token Contrast for Weakly-Supervised Semantic Segmentation

Lixiang Ru, Heliang Zheng, Yibing Zhan et al.

Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, \ie, the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo.

32.6CVAug 8, 2022Code
Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Di Wang, Qiming Zhang, Yufei Xu et al.

Large-scale vision foundation models have made significant progress in visual tasks on natural images, with vision transformers being the primary choice due to their good scalability and representation ability. However, large-scale models in remote sensing (RS) have not yet been sufficiently explored. In this paper, we resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models tailored to RS tasks and investigate how such large models perform. To handle the large sizes and objects of arbitrary orientations in RS images, we propose a new rotated varied-size window attention to replace the original full attention in transformers, which can significantly reduce the computational cost and memory footprint while learning better object representation by extracting rich context from the generated diverse windows. Experiments on detection tasks show the superiority of our model over all state-of-the-art models, achieving 81.24% mAP on the DOTA-V1.0 dataset. The results of our models on downstream classification and segmentation tasks also show competitive performance compared to existing advanced methods. Further experiments show the advantages of our models in terms of computational complexity and data efficiency in transferring.

12.8IVJul 7, 2022Code
What Makes for Automatic Reconstruction of Pulmonary Segments

Kaiming Kuang, Li Zhang, Jingyu Li et al.

3D reconstruction of pulmonary segments plays an important role in surgical treatment planning of lung cancer, which facilitates preservation of pulmonary function and helps ensure low recurrence rates. However, automatic reconstruction of pulmonary segments remains unexplored in the era of deep learning. In this paper, we investigate what makes for automatic reconstruction of pulmonary segments. First and foremost, we formulate, clinically and geometrically, the anatomical definitions of pulmonary segments, and propose evaluation metrics adhering to these definitions. Second, we propose ImPulSe (Implicit Pulmonary Segment), a deep implicit surface model designed for pulmonary segment reconstruction. The automatic reconstruction of pulmonary segments by ImPulSe is accurate in metrics and visually appealing. Compared with canonical segmentation methods, ImPulSe outputs continuous predictions of arbitrary resolutions with higher training efficiency and fewer parameters. Lastly, we experiment with different network inputs to analyze what matters in the task of pulmonary segment reconstruction. Our code is available at https://github.com/M3DV/ImPulSe.

11.6CVApr 19, 2023Code
DCN-T: Dual Context Network with Transformer for Hyperspectral Image Classification

Di Wang, Jing Zhang, Bo Du et al.

Hyperspectral image (HSI) classification is challenging due to spatial variability caused by complex imaging conditions. Prior methods suffer from limited representation ability, as they train specially designed networks from scratch on limited annotated data. We propose a tri-spectral image generation pipeline that transforms HSI into high-quality tri-spectral images, enabling the use of off-the-shelf ImageNet pretrained backbone networks for feature extraction. Motivated by the observation that there are many homogeneous areas with distinguished semantic and geometric properties in HSIs, which can be used to extract useful contexts, we propose an end-to-end segmentation network named DCN-T. It adopts transformers to effectively encode regional adaptation and global aggregation spatial contexts within and between the homogeneous areas discovered by similarity-based clustering. To fully exploit the rich spectrums of the HSI, we adopt an ensemble approach where all segmentation results of the tri-spectral images are integrated into the final prediction through a voting scheme. Extensive experiments on three public benchmarks show that our proposed method outperforms state-of-the-art methods for HSI classification.

9.5IVAug 3, 2022Code
LSSANet: A Long Short Slice-Aware Network for Pulmonary Nodule Detection

Rui Xu, Yong Luo, Bo Du et al.

Convolutional neural networks (CNNs) have been demonstrated to be highly effective in the field of pulmonary nodule detection. However, existing CNN based pulmonary nodule detection methods lack the ability to capture long-range dependencies, which is vital for global information extraction. In computer vision tasks, non-local operations have been widely utilized, but the computational cost could be very high for 3D computed tomography (CT) images. To address this issue, we propose a long short slice-aware network (LSSANet) for the detection of pulmonary nodules. In particular, we develop a new non-local mechanism termed long short slice grouping (LSSG), which splits the compact non-local embeddings into a short-distance slice grouped one and a long-distance slice grouped counterpart. This not only reduces the computational burden, but also keeps long-range dependencies among any elements across slices and in the whole feature map. The proposed LSSG is easy-to-use and can be plugged into many pulmonary nodule detection networks. To verify the performance of LSSANet, we compare with several recently proposed and competitive detection approaches based on 2D/3D CNN. Promising evaluation results on the large-scale PN9 dataset demonstrate the effectiveness of our method. Code is at https://github.com/Ruixxxx/LSSANet.

11.2CVOct 3, 2022Code
Unsupervised Multimodal Change Detection Based on Structural Relationship Graph Representation Learning

Hongruixuan Chen, Naoto Yokoya, Chen Wu et al.

Unsupervised multimodal change detection is a practical and challenging topic that can play an important role in time-sensitive emergency applications. To address the challenge that multimodal remote sensing images cannot be directly compared due to their modal heterogeneity, we take advantage of two types of modality-independent structural relationships in multimodal images. In particular, we present a structural relationship graph representation learning framework for measuring the similarity of the two structural relationships. Firstly, structural graphs are generated from preprocessed multimodal image pairs by means of an object-based image analysis approach. Then, a structural relationship graph convolutional autoencoder (SR-GCAE) is proposed to learn robust and representative features from graphs. Two loss functions aiming at reconstructing vertex information and edge information are presented to make the learned representations applicable for structural relationship similarity measurement. Subsequently, the similarity levels of two structural relationships are calculated from learned graph representations and two difference images are generated based on the similarity levels. After obtaining the difference images, an adaptive fusion strategy is presented to fuse the two difference images. Finally, a morphological filtering-based postprocessing approach is employed to refine the detection results. Experimental results on five datasets with different modal combinations demonstrate the effectiveness of the proposed method.

19.8CVDec 12, 2022Code
Diff-Font: Diffusion Model for Robust One-Shot Font Generation

Haibin He, Xinyuan Chen, Chaoyue Wang et al.

Font generation is a difficult and time-consuming task, especially in those languages using ideograms that have complicated structures with a large number of characters, such as Chinese. To solve this problem, few-shot font generation and even one-shot font generation have attracted a lot of attention. However, most existing font generation methods may still suffer from (i) large cross-font gap challenge; (ii) subtle cross-font variation problem; and (iii) incorrect generation of complicated characters. In this paper, we propose a novel one-shot font generation method based on a diffusion model, named Diff-Font, which can be stably trained on large datasets. The proposed model aims to generate the entire font library by giving only one sample as the reference. Specifically, a large stroke-wise dataset is constructed, and a stroke-wise diffusion model is proposed to preserve the structure and the completion of each generated character. To our best knowledge, the proposed Diff-Font is the first work that developed diffusion models to handle the font generation task. The well-trained Diff-Font is not only robust to font gap and font variation, but also achieved promising performance on difficult character generation. Compared to previous font generation methods, our model reaches state-of-the-art performance both qualitatively and quantitatively.

2.8CVApr 23, 2023Code
HKNAS: Classification of Hyperspectral Imagery Based on Hyper Kernel Neural Architecture Search

Di Wang, Bo Du, Liangpei Zhang et al.

Recent neural architecture search (NAS) based approaches have made great progress in hyperspectral image (HSI) classification tasks. However, the architectures are usually optimized independently of the network weights, increasing searching time and restricting model performances. To tackle these issues, in this paper, different from previous methods that extra define structural parameters, we propose to directly generate structural parameters by utilizing the specifically designed hyper kernels, ingeniously converting the original complex dual optimization problem into easily implemented one-tier optimizations, and greatly shrinking searching costs. Then, we develop a hierarchical multi-module search space whose candidate operations only contain convolutions, and these operations can be integrated into unified kernels. Using the above searching strategy and searching space, we obtain three kinds of networks to separately conduct pixel-level or image-level classifications with 1-D or 3-D convolutions. In addition, by combining the proposed hyper kernel searching scheme with the 3-D convolution decomposition mechanism, we obtain diverse architectures to simulate 3-D convolutions, greatly improving network flexibilities. A series of quantitative and qualitative experiments on six public datasets demonstrate that the proposed methods achieve state-of-the-art results compared with other advanced NAS-based HSI classification approaches.

8.9IVMar 7, 2023Code
SGDA: Towards 3D Universal Pulmonary Nodule Detection via Slice Grouped Domain Attention

Rui Xu, Zhi Liu, Yong Luo et al.

Lung cancer is the leading cause of cancer death worldwide. The best solution for lung cancer is to diagnose the pulmonary nodules in the early stage, which is usually accomplished with the aid of thoracic computed tomography (CT). As deep learning thrives, convolutional neural networks (CNNs) have been introduced into pulmonary nodule detection to help doctors in this labor-intensive task and demonstrated to be very effective. However, the current pulmonary nodule detection methods are usually domain-specific, and cannot satisfy the requirement of working in diverse real-world scenarios. To address this issue, we propose a slice grouped domain attention (SGDA) module to enhance the generalization capability of the pulmonary nodule detection networks. This attention module works in the axial, coronal, and sagittal directions. In each direction, we divide the input feature into groups, and for each group, we utilize a universal adapter bank to capture the feature subspaces of the domains spanned by all pulmonary nodule datasets. Then the bank outputs are combined from the perspective of domain to modulate the input group. Extensive experiments demonstrate that SGDA enables substantially better multi-domain pulmonary nodule detection performance compared with the state-of-the-art multi-domain learning methods.

7.6CVJun 8, 2023
An Efficient Transformer for Simultaneous Learning of BEV and Lane Representations in 3D Lane Detection

Ziye Chen, Kate Smith-Miles, Bo Du et al.

Accurately detecting lane lines in 3D space is crucial for autonomous driving. Existing methods usually first transform image-view features into bird-eye-view (BEV) by aid of inverse perspective mapping (IPM), and then detect lane lines based on the BEV features. However, IPM ignores the changes in road height, leading to inaccurate view transformations. Additionally, the two separate stages of the process can cause cumulative errors and increased complexity. To address these limitations, we propose an efficient transformer for 3D lane detection. Different from the vanilla transformer, our model contains a decomposed cross-attention mechanism to simultaneously learn lane and BEV representations. The mechanism decomposes the cross-attention between image-view and BEV features into the one between image-view and lane features, and the one between lane and BEV features, both of which are supervised with ground-truth lane lines. Our method obtains 2D and 3D lane predictions by applying the lane features to the image-view and BEV features, respectively. This allows for a more accurate view transformation than IPM-based methods, as the view transformation is learned from data with a supervised cross-attention. Additionally, the cross-attention between lane and BEV features enables them to adjust to each other, resulting in more accurate lane detection than the two separate stages. Finally, the decomposed cross-attention is more efficient than the original one. Experimental results on OpenLane and ONCE-3DLanes demonstrate the state-of-the-art performance of our method.

12.6CVFeb 21, 2023Code
HCGMNET: A Hierarchical Change Guiding Map Network For Change Detection

Chengxi Han, Chen Wu, Bo Du

Very-high-resolution (VHR) remote sensing (RS) image change detection (CD) has been a challenging task for its very rich spatial information and sample imbalance problem. In this paper, we have proposed a hierarchical change guiding map network (HCGMNet) for change detection. The model uses hierarchical convolution operations to extract multiscale features, continuously merges multi-scale features layer by layer to improve the expression of global and local information, and guides the model to gradually refine edge features and comprehensive performance by a change guide module (CGM), which is a self-attention with changing guide map. Extensive experiments on two CD datasets show that the proposed HCGMNet architecture achieves better CD performance than existing state-of-the-art (SOTA) CD methods.

1.5CVAug 11, 2023
Rethinking the Localization in Weakly Supervised Object Localization

Rui Xu, Yong Luo, Han Hu et al.

Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision. This task is to localize the objects in the images given only the image-level supervision. Recently, dividing WSOL into two parts (class-agnostic object localization and object classification) has become the state-of-the-art pipeline for this task. However, existing solutions under this pipeline usually suffer from the following drawbacks: 1) they are not flexible since they can only localize one object for each image due to the adopted single-class regression (SCR) for localization; 2) the generated pseudo bounding boxes may be noisy, but the negative impact of such noise is not well addressed. To remedy these drawbacks, we first propose to replace SCR with a binary-class detector (BCD) for localizing multiple objects, where the detector is trained by discriminating the foreground and background. Then we design a weighted entropy (WE) loss using the unlabeled data to reduce the negative impact of noisy bounding boxes. Extensive experiments on the popular CUB-200-2011 and ImageNet-1K datasets demonstrate the effectiveness of our method.

5.7CVJul 27, 2022Code
Leveraging GAN Priors for Few-Shot Part Segmentation

Mengya Han, Heliang Zheng, Chaoyue Wang et al.

Few-shot part segmentation aims to separate different parts of an object given only a few annotated samples. Due to the challenge of limited data, existing works mainly focus on learning classifiers over pre-trained features, failing to learn task-specific features for part segmentation. In this paper, we propose to learn task-specific features in a "pre-training"-"fine-tuning" paradigm. We conduct prompt designing to reduce the gap between the pre-train task (i.e., image generation) and the downstream task (i.e., part segmentation), so that the GAN priors for generation can be leveraged for segmentation. This is achieved by projecting part segmentation maps into the RGB space and conducting interpolation between RGB segmentation maps and original images. Specifically, we design a fine-tuning strategy to progressively tune an image generator into a segmentation generator, where the supervision of the generator varying from images to segmentation maps by interpolation. Moreover, we propose a two-stream architecture, i.e., a segmentation stream to generate task-specific features, and an image stream to provide spatial constraints. The image stream can be regarded as a self-supervised auto-encoder, and this enables our model to benefit from large-scale support images. Overall, this work is an attempt to explore the internal relevance between generation tasks and perception tasks by prompt designing. Extensive experiments show that our model can achieve state-of-the-art performance on several part segmentation datasets.

2.8CVMar 24, 2023
EMS-Net: Efficient Multi-Temporal Self-Attention For Hyperspectral Change Detection

Meiqi Hu, Chen Wu, Bo Du

Hyperspectral change detection plays an essential role of monitoring the dynamic urban development and detecting precise fine object evolution and alteration. In this paper, we have proposed an original Efficient Multi-temporal Self-attention Network (EMS-Net) for hyperspectral change detection. The designed EMS module cuts redundancy of those similar and containing-no-changes feature maps, computing efficient multi-temporal change information for precise binary change map. Besides, to explore the clustering characteristics of the change detection, a novel supervised contrastive loss is provided to enhance the compactness of the unchanged. Experiments implemented on two hyperspectral change detection datasets manifests the out-standing performance and validity of proposed method.

6.8CVFeb 16, 2023
Visible-Infrared Person Re-Identification via Patch-Mixed Cross-Modality Learning

Zhihao Qian, Yutian Lin, Bo Du

Visible-infrared person re-identification (VI-ReID) aims to retrieve images of the same pedestrian from different modalities, where the challenges lie in the significant modality discrepancy. To alleviate the modality gap, recent methods generate intermediate images by GANs, grayscaling, or mixup strategies. However, these methods could introduce extra data distribution, and the semantic correspondence between the two modalities is not well learned. In this paper, we propose a Patch-Mixed Cross-Modality framework (PMCM), where two images of the same person from two modalities are split into patches and stitched into a new one for model learning. A part-alignment loss is introduced to regularize representation learning, and a patch-mixed modality learning loss is proposed to align between the modalities. In this way, the model learns to recognize a person through patches of different styles, thereby the modality semantic correspondence can be inferred. In addition, with the flexible image generation strategy, the patch-mixed images freely adjust the ratio of different modality patches, which could further alleviate the modality imbalance problem. On two VI-ReID datasets, we report new state-of-the-art performance with the proposed method.

1.5CVApr 3, 2023Code
Unsupervised Cross-domain Pulmonary Nodule Detection without Source Data

Rui Xu, Yong Luo, Bo Du

Cross-domain pulmonary nodule detection suffers from performance degradation due to a large shift of data distributions between the source and target domain. Besides, considering the high cost of medical data annotation, it is often assumed that the target images are unlabeled. Existing approaches have made much progress for this unsupervised domain adaptation setting. However, this setting is still rarely plausible in medical applications since the source medical data are often not accessible due to privacy concerns. This motivates us to propose a Source-free Unsupervised cross-domain method for Pulmonary nodule detection (SUP), named Instance-level Contrastive Instruction fine-tuning framework (ICI). It first adapts the source model to the target domain by utilizing instance-level contrastive learning. Then the adapted model is trained in a teacher-student interaction manner, and a weighted entropy loss is incorporated to further improve the accuracy. We establish a benchmark by adapting a pre-trained source model to three popular datasets for pulmonary nodule detection. To the best of our knowledge, this represents the first exploration of source-free unsupervised domain adaptation in medical image object detection. Our extensive evaluations reveal that SUP-ICI substantially surpasses existing state-of-the-art approaches, achieving FROC score improvements ranging from 8.98% to 16.05%. This breakthrough not only sets a new precedent for domain adaptation techniques in medical imaging but also significantly advances the field toward overcoming challenges posed by data privacy and availability. Code: https://github.com/Ruixxxx/SFUDA.

7.3CVMar 3, 2022
Multi-Tailed Vision Transformer for Efficient Inference

Yunke Wang, Bo Du, Wenyuan Wang et al.

Recently, Vision Transformer (ViT) has achieved promising performance in image recognition and gradually serves as a powerful backbone in various vision tasks. To satisfy the sequential input of Transformer, the tail of ViT first splits each image into a sequence of visual tokens with a fixed length. Then the following self-attention layers constructs the global relationship between tokens to produce useful representation for the downstream tasks. Empirically, representing the image with more tokens leads to better performance, yet the quadratic computational complexity of self-attention layer to the number of tokens could seriously influence the efficiency of ViT's inference. For computational reduction, a few pruning methods progressively prune uninformative tokens in the Transformer encoder, while leaving the number of tokens before the Transformer untouched. In fact, fewer tokens as the input for the Transformer encoder can directly reduce the following computational cost. In this spirit, we propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper. MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder. A tail predictor is introduced to decide which tail is the most efficient for the image to produce accurate prediction. Both modules are optimized in an end-to-end fashion, with the Gumbel-Softmax trick. Experiments on ImageNet-1K demonstrate that MT-ViT can achieve a significant reduction on FLOPs with no degradation of the accuracy and outperform other compared methods in both accuracy and FLOPs.

3.7CVMay 23, 2022
Multi-Temporal Spatial-Spectral Comparison Network for Hyperspectral Anomalous Change Detection

Meiqi Hu, Chen Wu, Bo Du

Hyperspectral anomalous change detection has been a challenging task for its emphasis on the dynamics of small and rare objects against the prevalent changes. In this paper, we have proposed a Multi-Temporal spatial-spectral Comparison Network for hyperspectral anomalous change detection (MTC-NET). The whole model is a deep siamese network, aiming at learning the prevalent spectral difference resulting from the complex imaging conditions from the hyperspectral images by contrastive learning. A three-dimensional spatial spectral attention module is designed to effectively extract the spatial semantic information and the key spectral differences. Then the gaps between the multi-temporal features are minimized, boosting the alignment of the semantic and spectral features and the suppression of the multi-temporal background spectral difference. The experiments on the "Viareggio 2013" datasets demonstrate the effectiveness of proposed MTC-NET.

20.9CVJan 31, 2024Code
Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Maoyuan Ye, Jing Zhang, Juhua Liu et al.

The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring $20\times$ fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

2.0CVAug 11, 2024Code
FADE: A Dataset for Detecting Falling Objects around Buildings in Video

Zhigang Tu, Zhengbo Zhang, Zitao Gao et al.

Objects falling from buildings, a frequently occurring event in daily life, can cause severe injuries to pedestrians due to the high impact force they exert. Surveillance cameras are often installed around buildings to detect falling objects, but such detection remains challenging due to the small size and fast motion of the objects. Moreover, the field of falling object detection around buildings (FODB) lacks a large-scale dataset for training learning-based detection methods and for standardized evaluation. To address these challenges, we propose a large and diverse video benchmark dataset named FADE. Specifically, FADE contains 2,611 videos from 25 scenes, featuring 8 falling object categories, 4 weather conditions, and 4 video resolutions. Additionally, we develop a novel detection method for FODB that effectively leverages motion information and generates small-sized yet high-quality detection proposals. The efficacy of our method is evaluated on the proposed FADE dataset by comparing it with state-of-the-art approaches in generic object detection, video object detection, and moving object detection. The dataset and code are publicly available at https://fadedataset.github.io/FADE.github.io/.

8.4CVDec 2, 2025Code
GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Zixuan Song, Jing Zhang, Di Wang et al.

Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at https://github.com/MiliLab/GeoBridge.

15.7LGNov 18, 2024Code
Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

Ziyi Zhang, Li Shen, Sen Zhang et al.

Aligning diffusion models with downstream objectives is essential for their practical applications. However, standard alignment methods often struggle with step generalization when directly applied to few-step diffusion models, leading to inconsistent performance across different denoising step scenarios. To address this, we introduce Stepwise Diffusion Policy Optimization (SDPO), a novel alignment method tailored for few-step diffusion models. Unlike prior approaches that rely on a single sparse reward from only the final step of each denoising trajectory for trajectory-level optimization, SDPO incorporates dense reward feedback at every intermediate step. By learning the differences in dense rewards between paired samples, SDPO facilitates stepwise optimization of few-step diffusion models, ensuring consistent alignment across all denoising steps. To promote stable and efficient training, SDPO introduces an online reinforcement learning framework featuring several novel strategies designed to effectively exploit the stepwise granularity of dense rewards. Experimental results demonstrate that SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations, underscoring its robust step generalization capabilities. Code is avaliable at https://github.com/ZiyiZhang27/sdpo.

13.4LGOct 24, 2024Code
What If the Input is Expanded in OOD Detection?

Boxuan Zhang, Jianing Zhu, Zengmao Wang et al.

Out-of-distribution (OOD) detection aims to identify OOD inputs from unknown classes, which is important for the reliable deployment of machine learning models in the open world. Various scoring functions are proposed to distinguish it from in-distribution (ID) data. However, existing methods generally focus on excavating the discriminative information from a single input, which implicitly limits its representation dimension. In this work, we introduce a novel perspective, i.e., employing different common corruptions on the input space, to expand that. We reveal an interesting phenomenon termed confidence mutation, where the confidence of OOD data can decrease significantly under the corruptions, while the ID data shows a higher confidence expectation considering the resistance of semantic features. Based on that, we formalize a new scoring method, namely, Confidence aVerage (CoVer), which can capture the dynamic differences by simply averaging the scores obtained from different corrupted inputs and the original ones, making the OOD and ID distributions more separable in detection tasks. Extensive experiments and analyses have been conducted to understand and verify the effectiveness of CoVer. The code is publicly available at: https://github.com/tmlr-group/CoVer.

4.3NCDec 22, 2023Code
Joint Learning Neuronal Skeleton and Brain Circuit Topology with Permutation Invariant Encoders for Neuron Classification

Minghui Liao, Guojia Wan, Bo Du

Determining the types of neurons within a nervous system plays a significant role in the analysis of brain connectomics and the investigation of neurological diseases. However, the efficiency of utilizing anatomical, physiological, or molecular characteristics of neurons is relatively low and costly. With the advancements in electron microscopy imaging and analysis techniques for brain tissue, we are able to obtain whole-brain connectome consisting neuronal high-resolution morphology and connectivity information. However, few models are built based on such data for automated neuron classification. In this paper, we propose NeuNet, a framework that combines morphological information of neurons obtained from skeleton and topological information between neurons obtained from neural circuit. Specifically, NeuNet consists of three components, namely Skeleton Encoder, Connectome Encoder, and Readout Layer. Skeleton Encoder integrates the local information of neurons in a bottom-up manner, with a one-dimensional convolution in neural skeleton's point data; Connectome Encoder uses a graph neural network to capture the topological information of neural circuit; finally, Readout Layer fuses the above two information and outputs classification results. We reprocess and release two new datasets for neuron classification task from volume electron microscopy(VEM) images of human brain cortex and Drosophila brain. Experiments on these two datasets demonstrated the effectiveness of our model with accuracy of 0.9169 and 0.9363, respectively. Code and data are available at: https://github.com/WHUminghui/NeuNet.

11.8CVMay 19, 2025Code
Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?

Haibin He, Maoyuan Ye, Jing Zhang et al.

Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at https://github.com/Hxyz-123/ReasoningOCR.

3.6CVJan 23, 2025Code
Leveraging Textual Anatomical Knowledge for Class-Imbalanced Semi-Supervised Multi-Organ Segmentation

Yuliang Gu, Weilun Tsao, Bo Du et al.

Annotating 3D medical images demands substantial time and expertise, driving the adoption of semi-supervised learning (SSL) for segmentation tasks. However, the complex anatomical structures of organs often lead to significant class imbalances, posing major challenges for deploying SSL in real-world scenarios. Despite the availability of valuable prior information, such as inter-organ relative positions and organ shape priors, existing SSL methods have yet to fully leverage these insights. To address this gap, we propose a novel approach that integrates textual anatomical knowledge (TAK) into the segmentation model. Specifically, we use GPT-4o to generate textual descriptions of anatomical priors, which are then encoded using a CLIP-based model. These encoded priors are injected into the segmentation model as parameters of the segmentation head. Additionally, contrastive learning is employed to enhance the alignment between textual priors and visual features. Extensive experiments demonstrate the superior performance of our method, significantly surpassing state-of-the-art approaches. The source code will be available at: https://github.com/Lunn88/TAK-Semi.

8.3CLAug 11, 2025Code
REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation

Wentao Jiang, Xiang Feng, Zengmao Wang et al.

Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as "dead ends", committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at https://github.com/MiliLab/REX-RAG.

8.4CVMay 28, 2025Code
GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking

Haibin He, Jing Zhang, Maoyuan Ye et al.

Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking, significantly increasing task complexity. Despite progress in VTS, existing methods still fall short of the performance seen in ITS. This paper identifies a key limitation in current video text spotters: limited recognition capability, even after extensive end-to-end training. To address this, we propose GoMatching++, a parameter- and data-efficient method that transforms an off-the-shelf image text spotter into a video specialist. The core idea lies in freezing the image text spotter and introducing a lightweight, trainable tracker, which can be optimized efficiently with minimal training data. Our approach includes two key components: (1) a rescoring mechanism to bridge the domain gap between image and video data, and (2) the LST-Matcher, which enhances the frozen image text spotter's ability to handle video text. We explore various architectures for LST-Matcher to ensure efficiency in both parameters and training data. As a result, GoMatching++ sets new performance records on challenging benchmarks such as ICDAR15-video, DSText, and BOVText, while significantly reducing training costs. To address the lack of curved text datasets in VTS, we introduce ArTVideo, a new benchmark featuring over 30% curved text with detailed annotations. We also provide a comprehensive statistical analysis and experimental results for ArTVideo. We believe that GoMatching++ and the ArTVideo benchmark will drive future advancements in video text spotting. The source code, models and dataset are publicly available at https://github.com/Hxyz-123/GoMatching.

3.7CVMay 9, 2024Code
Exploring Text-Guided Single Image Editing for Remote Sensing Images

Fangzhou Han, Lingyu Si, Zhizhuo Jiang et al.

Artificial intelligence generative content (AIGC) has significantly impacted image generation in the field of remote sensing. However, the equally important area of remote sensing image (RSI) editing has not received sufficient attention. Deep learning based editing methods generally involve two sequential stages: generation and editing. For natural images, these stages primarily rely on generative backbones pre-trained on large-scale benchmark datasets and text guidance facilitated by vision-language models (VLMs). However, it become less viable for RSIs: First, existing generative RSI benchmark datasets do not fully capture the diversity of RSIs, and is often inadequate for universal editing tasks. Second, the single text semantic corresponds to multiple image semantics, leading to the introduction of incorrect semantics. To solve above problems, this paper proposes a text-guided RSI editing method and can be trained using only a single image. A multi-scale training approach is adopted to preserve consistency without the need for training on extensive benchmarks, while leveraging RSI pre-trained VLMs and prompt ensembling (PE) to ensure accuracy and controllability. Experimental results on multiple RSI editing tasks show that the proposed method offers significant advantages in both CLIP scores and subjective evaluations compared to existing methods. Additionally, we explore the ability of the edited RSIs to support disaster assessment tasks in order to validate their practicality. Codes will be released at https://github.com/HIT-PhilipHan/remote_sensing_image_editing.

6.8CVMay 31, 2023Code
DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting

Maoyuan Ye, Jing Zhang, Shanshan Zhao et al.

End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. Besides, they overlook the exploring on multilingual text spotting which requires an extra script identification task. In this paper, we present DeepSolo++, a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Furthermore, we show the surprisingly good extensibility of our method, in terms of character class, language type, and task. On the one hand, our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese. On the other hand, our DeepSolo++ achieves better performance on the additionally introduced script identification task with a simpler training pipeline compared with previous methods. In addition, our models are also compatible with line annotations, which require much less annotation cost than polygons. The code is available at \url{https://github.com/ViTAE-Transformer/DeepSolo}.

29.2CVMay 3, 2023Code
SAMRS: Scaling-up Remote Sensing Segmentation Dataset with Segment Anything Model

Di Wang, Jing Zhang, Bo Du et al.

The success of the Segment Anything Model (SAM) demonstrates the significance of data-centric machine learning. However, due to the difficulties and high costs associated with annotating Remote Sensing (RS) images, a large amount of valuable RS data remains unlabeled, particularly at the pixel level. In this study, we leverage SAM and existing RS object detection datasets to develop an efficient pipeline for generating a large-scale RS segmentation dataset, dubbed SAMRS. SAMRS totally possesses 105,090 images and 1,668,241 instances, surpassing existing high-resolution RS segmentation datasets in size by several orders of magnitude. It provides object category, location, and instance information that can be used for semantic segmentation, instance segmentation, and object detection, either individually or in combination. We also provide a comprehensive analysis of SAMRS from various aspects. Moreover, preliminary experiments highlight the importance of conducting segmentation pre-training with SAMRS to address task discrepancies and alleviate the limitations posed by limited training data during fine-tuning. The code and dataset will be available at https://github.com/ViTAE-Transformer/SAMRS.

3.9AIMay 3, 2023Code
Empowering Agrifood System with Artificial Intelligence: A Survey of the Progress, Challenges and Opportunities

Tao Chen, Liang Lv, Di Wang et al.

With the world population rapidly increasing, transforming our agrifood systems to be more productive, efficient, safe, and sustainable is crucial to mitigate potential food shortages. Recently, artificial intelligence (AI) techniques such as deep learning (DL) have demonstrated their strong abilities in various areas, including language, vision, remote sensing (RS), and agrifood systems applications. However, the overall impact of AI on agrifood systems remains unclear. In this paper, we thoroughly review how AI techniques can transform agrifood systems and contribute to the modern agrifood industry. Firstly, we summarize the data acquisition methods in agrifood systems, including acquisition, storage, and processing techniques. Secondly, we present a progress review of AI methods in agrifood systems, specifically in agriculture, animal husbandry, and fishery, covering topics such as agrifood classification, growth monitoring, yield prediction, and quality assessment. Furthermore, we highlight potential challenges and promising research opportunities for transforming modern agrifood systems with AI. We hope this survey could offer an overall picture to newcomers in the field and serve as a starting point for their further research. The project website is https://github.com/Frenkie14/Agrifood-Survey.

11.6CVMay 2, 2023Code
Scalable Mask Annotation for Video Text Spotting

Haibin He, Jing Zhang, Mengyang Xu et al.

Video text spotting refers to localizing, recognizing, and tracking textual elements such as captions, logos, license plates, signs, and other forms of text within consecutive video frames. However, current datasets available for this task rely on quadrilateral ground truth annotations, which may result in including excessive background content and inaccurate text boundaries. Furthermore, methods trained on these datasets often produce prediction results in the form of quadrilateral boxes, which limits their ability to handle complex scenarios such as dense or curved text. To address these issues, we propose a scalable mask annotation pipeline called SAMText for video text spotting. SAMText leverages the SAM model to generate mask annotations for scene text images or video frames at scale. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips sourced from existing datasets and over 9 million mask annotations. We have also conducted a thorough statistical analysis of the generated masks and their quality, identifying several research topics that could be further explored based on this dataset. The code and dataset will be released at \url{https://github.com/ViTAE-Transformer/SAMText}.

14.0CVDec 24, 2021Code
Visual Semantics Allow for Textual Reasoning Better in Scene Text Recognition

Yue He, Chen Chen, Jing Zhang et al.

Existing Scene Text Recognition (STR) methods typically use a language model to optimize the joint probability of the 1D character sequence predicted by a visual recognition (VR) model, which ignore the 2D spatial context of visual semantics within and between character instances, making them not generalize well to arbitrary shape scene text. To address this issue, we make the first attempt to perform textual reasoning based on visual semantics in this paper. Technically, given the character segmentation maps predicted by a VR model, we construct a subgraph for each instance, where nodes represent the pixels in it and edges are added between nodes based on their spatial similarity. Then, these subgraphs are sequentially connected by their root nodes and merged into a complete graph. Based on this graph, we devise a graph convolutional network for textual reasoning (GTR) by supervising it with a cross-entropy loss. GTR can be easily plugged in representative STR models to improve their performance owing to better textual reasoning. Specifically, we construct our model, namely S-GTR, by paralleling GTR to the language model in a segmentation-based STR baseline, which can effectively exploit the visual-linguistic complementarity via mutual learning. S-GTR sets new state-of-the-art on six challenging STR benchmarks and generalizes well to multi-linguistic datasets. Code is available at https://github.com/adeline-cs/GTR.

8.0CVAug 3, 2021Code
I3CL:Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection

Bo Du, Jian Ye, Jing Zhang et al.

Existing methods for arbitrary-shaped text detection in natural scenes face two critical issues, i.e., 1) fracture detections at the gaps in a text instance; and 2) inaccurate detections of arbitrary-shaped text instances with diverse background context. To address these issues, we propose a novel method named Intra- and Inter-Instance Collaborative Learning (I3CL). Specifically, to address the first issue, we design an effective convolutional module with multiple receptive fields, which is able to collaboratively learn better character and gap feature representations at local and long ranges inside a text instance. To address the second issue, we devise an instance-based transformer module to exploit the dependencies between different text instances and a global context module to exploit the semantic context from the shared background, which are able to collaboratively learn more discriminative text feature representation. In this way, I3CL can effectively exploit the intra- and inter-instance dependencies together in a unified end-to-end trainable framework. Besides, to make full use of the unlabeled data, we design an effective semi-supervised learning method to leverage the pseudo labels via an ensemble strategy. Without bells and whistles, experimental results show that the proposed I3CL sets new state-of-the-art results on three challenging public benchmarks, i.e., an F-measure of 77.5% on ICDAR2019-ArT, 86.9% on Total-Text, and 86.4% on CTW-1500. Notably, our I3CL with the ResNeSt-101 backbone ranked 1st place on the ICDAR2019-ArT leaderboard. The source code will be available at https://github.com/ViTAE-Transformer/ViTAE-Transformer-Scene-Text-Detection.

5.6CVApr 8, 2021Code
Robust Self-Ensembling Network for Hyperspectral Image Classification

Yonghao Xu, Bo Du, Liangpei Zhang

Recent research has shown the great potential of deep learning algorithms in the hyperspectral image (HSI) classification task. Nevertheless, training these models usually requires a large amount of labeled data. Since the collection of pixel-level annotations for HSI is laborious and time-consuming, developing algorithms that can yield good performance in the small sample size situation is of great significance. In this study, we propose a robust self-ensembling network (RSEN) to address this problem. The proposed RSEN consists of two subnetworks including a base network and an ensemble network. With the constraint of both the supervised loss from the labeled data and the unsupervised loss from the unlabeled data, the base network and the ensemble network can learn from each other, achieving the self-ensembling mechanism. To the best of our knowledge, the proposed method is the first attempt to introduce the self-ensembling technique into the HSI classification task, which provides a different view on how to utilize the unlabeled data in HSI to assist the network training. We further propose a novel consistency filter to increase the robustness of self-ensembling learning. Extensive experiments on three benchmark HSI datasets demonstrate that the proposed algorithm can yield competitive performance compared with the state-of-the-art methods. Code is available online (\url{https://github.com/YonghaoXu/RSEN}).

21.6LGOct 29, 2024
Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging

Li Shen, Anke Tang, Enneng Yang et al.

Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. Recent research on task arithmetic-based MTL demonstrates that merging the parameters of independently fine-tuned models can effectively achieve MTL. However, existing merging methods primarily seek a static optimal solution within the original model parameter space, which often results in performance degradation due to the inherent diversity among tasks and potential interferences. To address this challenge, in this paper, we propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. Specifically, we first identify critical (or sensitive) modules by analyzing parameter variations in core modules of Transformer-based models before and after finetuning. Then, our WEMoE statically merges non-critical modules while transforming critical modules into a mixture-of-experts (MoE) structure. During inference, expert modules in the MoE are dynamically merged based on input samples, enabling a more flexible and adaptive merging approach. Building on WEMoE, we further introduce an efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE. Experimental results across various architectures and tasks demonstrate that both WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.

21.8CVDec 11, 2024
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs

Yihan Cao, Jiazhao Zhang, Zhinan Yu et al.

Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts.

22.3CVJan 22, 2025
UniUIR: Considering Underwater Image Restoration as An All-in-One Learner

Xu Zhang, Huan Zhang, Guoli Wang et al.

Existing underwater image restoration (UIR) methods generally only handle color distortion or jointly address color and haze issues, but they often overlook the more complex degradations that can occur in underwater scenes. To address this limitation, we propose a Universal Underwater Image Restoration method, termed as UniUIR, considering the complex scenario of real-world underwater mixed distortions as an all-in-one manner. To decouple degradation-specific issues and explore the inter-correlations among various degradations in UIR task, we designed the Mamba Mixture-of-Experts module. This module enables each expert to identify distinct types of degradation and collaboratively extract task-specific priors while maintaining global feature representation based on linear complexity. Building upon this foundation, to enhance degradation representation and address the task conflicts that arise when handling multiple types of degradation, we introduce the spatial-frequency prior generator. This module extracts degradation prior information in both spatial and frequency domains, and adaptively selects the most appropriate task-specific prompts based on image content, thereby improving the accuracy of image restoration. Finally, to more effectively address complex, region-dependent distortions in UIR task, we incorporate depth information derived from a large-scale pre-trained depth prediction model, thereby enabling the network to perceive and leverage depth variations across different image regions to handle localized degradation. Extensive experiments demonstrate that UniUIR can produce more attractive results across qualitative and quantitative comparisons, and shows strong generalization than state-of-the-art methods.

13.1CVMay 5, 2025
Advances in Automated Fetal Brain MRI Segmentation and Biometry: Insights from the FeTA 2024 Challenge

Vladyslav Zalevskyi, Thomas Sanchez, Misha Kaandorp et al.

Accurate fetal brain tissue segmentation and biometric analysis are essential for studying brain development in utero. The FeTA Challenge 2024 advanced automated fetal brain MRI analysis by introducing biometry prediction as a new task alongside tissue segmentation. For the first time, our diverse multi-centric test set included data from a new low-field (0.55T) MRI dataset. Evaluation metrics were also expanded to include the topology-specific Euler characteristic difference (ED). Sixteen teams submitted segmentation methods, most of which performed consistently across both high- and low-field scans. However, longitudinal trends indicate that segmentation accuracy may be reaching a plateau, with results now approaching inter-rater variability. The ED metric uncovered topological differences that were missed by conventional metrics, while the low-field dataset achieved the highest segmentation scores, highlighting the potential of affordable imaging systems when paired with high-quality reconstruction. Seven teams participated in the biometry task, but most methods failed to outperform a simple baseline that predicted measurements based solely on gestational age, underscoring the challenge of extracting reliable biometric estimates from image data alone. Domain shift analysis identified image quality as the most significant factor affecting model generalization, with super-resolution pipelines also playing a substantial role. Other factors, such as gestational age, pathology, and acquisition site, had smaller, though still measurable, effects. Overall, FeTA 2024 offers a comprehensive benchmark for multi-class segmentation and biometry estimation in fetal brain MRI, underscoring the need for data-centric approaches, improved topological evaluation, and greater dataset diversity to enable clinically robust and generalizable AI tools.

28.2CVMay 27, 2025Code
GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

Fengxiang Wang, Mingshuo Chen, Yueying Li et al.

Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key semantics.Integrating these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

3.9CVDec 12, 2023
Shifted Autoencoders for Point Annotation Restoration in Object Counting

Yuda Zou, Xin Xiao, Peilin Zhou et al.

Object counting typically uses 2D point annotations. The complexity of object shapes and the subjectivity of annotators may lead to annotation inconsistency, potentially confusing counting model training. Some sophisticated noise-resistance counting methods have been proposed to alleviate this issue. Differently, we aim to directly refine the initial point annotations before training counting models. For that, we propose the Shifted Autoencoders (SAE), which enhances annotation consistency. Specifically, SAE applies random shifts to initial point annotations and employs a UNet to restore them to their original positions. Similar to MAE reconstruction, the trained SAE captures general position knowledge and ignores specific manual offset noise. This allows to restore the initial point annotations to more general and thus consistent positions. Extensive experiments show that using such refined consistent annotations to train some advanced (including noise-resistance) object counting models steadily/significantly boosts their performances. Remarkably, the proposed SAE helps to set new records on nine datasets. We will make codes and refined point annotations available.

10.9CLMay 6, 2025
Faster MoE LLM Inference for Extremely Large Models

Haoqi Yang, Luohe Shi, Qiwei Li et al.

Sparse Mixture of Experts (MoE) large language models (LLMs) are gradually becoming the mainstream approach for ultra-large-scale models. Existing optimization efforts for MoE models have focused primarily on coarse-grained MoE architectures. With the emergence of DeepSeek Models, fine-grained MoE models are gaining popularity, yet research on them remains limited. Therefore, we want to discuss the efficiency dynamic under different service loads. Additionally, fine-grained models allow deployers to reduce the number of routed experts, both activated counts and total counts, raising the question of how this reduction affects the trade-off between MoE efficiency and performance. Our findings indicate that while deploying MoE models presents greater challenges, it also offers significant optimization opportunities. Reducing the number of activated experts can lead to substantial efficiency improvements in certain scenarios, with only minor performance degradation. Reducing the total number of experts provides limited efficiency gains but results in severe performance degradation. Our method can increase throughput by at least 10\% without any performance degradation. Overall, we conclude that MoE inference optimization remains an area with substantial potential for exploration and improvement.

17.4AISep 23, 2025
MAPO: Mixed Advantage Policy Optimization

Wenke Huang, Quan Zhang, Yiyang Fang et al.

Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.

4.9CLMay 31, 2025
How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG

Qiming Zeng, Xiao Yan, Hao Luo et al.

By retrieving contexts from knowledge graphs, graph-based retrieval-augmented generation (GraphRAG) enhances large language models (LLMs) to generate quality answers for user questions. Many GraphRAG methods have been proposed and reported inspiring performance in answer quality. However, we observe that the current answer evaluation framework for GraphRAG has two critical flaws, i.e., unrelated questions and evaluation biases, which may lead to biased or even wrong conclusions on performance. To tackle the two flaws, we propose an unbiased evaluation framework that uses graph-text-grounded question generation to produce questions that are more related to the underlying dataset and an unbiased evaluation procedure to eliminate the biases in LLM-based answer assessment. We apply our unbiased framework to evaluate 3 representative GraphRAG methods and find that their performance gains are much more moderate than reported previously. Although our evaluation framework may still have flaws, it calls for scientific evaluations to lay solid foundations for GraphRAG research.