RemoteCLIP: A Vision Language Foundation Model for Remote SensingFan Liu, Delong Chen, Zhangqingyun Guan et al.
General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $\textit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP
Bkd-FedGNN: A Benchmark for Classification Backdoor Attacks on Federated Graph Neural NetworkFan Liu, Siqi Lai, Yansong Ning et al. · gatech
Federated Graph Neural Network (FedGNN) has recently emerged as a rapidly growing research topic, as it integrates the strengths of graph neural networks and federated learning to enable advanced machine learning applications without direct access to sensitive data. Despite its advantages, the distributed nature of FedGNN introduces additional vulnerabilities, particularly backdoor attacks stemming from malicious participants. Although graph backdoor attacks have been explored, the compounded complexity introduced by the combination of GNNs and federated learning has hindered a comprehensive understanding of these attacks, as existing research lacks extensive benchmark coverage and in-depth analysis of critical factors. To address these limitations, we propose Bkd-FedGNN, a benchmark for backdoor attacks on FedGNN. Specifically, Bkd-FedGNN decomposes the graph backdoor attack into trigger generation and injection steps, and extending the attack to the node-level federated setting, resulting in a unified framework that covers both node-level and graph-level classification tasks. Moreover, we thoroughly investigate the impact of multiple critical factors in backdoor attacks on FedGNN. These factors are categorized into global-level and local-level factors, including data distribution, the number of malicious attackers, attack time, overlapping rate, trigger size, trigger type, trigger position, and poisoning rate. Finally, we conduct comprehensive evaluations on 13 benchmark datasets and 13 critical factors, comprising 1,725 experimental configurations for node-level and graph-level tasks from six domains. These experiments encompass over 8,000 individual tests, allowing us to provide a thorough evaluation and insightful observations that advance our understanding of backdoor attacks on FedGNN.The Bkd-FedGNN benchmark is publicly available at https://github.com/usail-hkust/BkdFedGCN.
ProtoCLIP: Prototypical Contrastive Language Image PretrainingDelong Chen, Zhao Wu, Fan Liu et al.
Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at https://github.com/megvii-research/protoclip.
5.9CVFeb 4, 2023
Learning to Agree on Vision Attention for Visual Commonsense ReasoningZhenyang Li, Yangyang Guo, Kejie Wang et al.
Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning. A VCR model generally aims at answering a textual question regarding an image, followed by the rationale prediction for the preceding answering process. Though these two processes are sequential and intertwined, existing methods always consider them as two independent matching-based instances. They, therefore, ignore the pivotal relationship between the two processes, leading to sub-optimal model performance. This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework. To achieve this, we first design a re-attention module for aggregating the vision attention map produced in each process. Thereafter, the resultant two sets of attention maps are carefully aligned to guide the two processes to make decisions based on the same image regions. We apply this method to both conventional attention and the recent Transformer models and carry out extensive experiments on the VCR benchmark dataset. The results demonstrate that with the attention alignment module, our method achieves a considerable improvement over the baseline methods, evidently revealing the feasibility of the coupling of the two processes as well as the effectiveness of the proposed method.
5.0CVApr 30, 2023
Few-shot Classification via Ensemble Learning with Multi-Order StatisticsSai Yang, Fan Liu, Delong Chen et al.
Transfer learning has been widely adopted for few-shot classification. Recent studies reveal that obtaining good generalization representation of images on novel classes is the key to improving the few-shot classification accuracy. To address this need, we prove theoretically that leveraging ensemble learning on the base classes can correspondingly reduce the true error in the novel classes. Following this principle, a novel method named Ensemble Learning with Multi-Order Statistics (ELMOS) is proposed in this paper. In this method, after the backbone network, we use multiple branches to create the individual learners in the ensemble learning, with the goal to reduce the storage cost. We then introduce different order statistics pooling in each branch to increase the diversity of the individual learners. The learners are optimized with supervised losses during the pre-training phase. After pre-training, features from different branches are concatenated for classifier evaluation. Extensive experiments demonstrate that each branch can complement the others and our method can produce a state-of-the-art performance on multiple few-shot classification benchmark datasets.
8.7CVAug 21, 2024
Making Large Vision Language Models to be Good Few-shot LearnersFan Liu, Wenwen Cai, Jian Huo et al.
Few-shot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk learning specific response formats rather than effectively extracting useful information from support data in FSC tasks. In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases. To tackle the above challenges, we adopt the meta-learning strategy to teach models "learn to learn". By constructing a rich set of meta-tasks for instruction fine-tuning, LVLMs enhance the ability to extract information from few-shot support data for classification. Additionally, we further boost LVLM's few-shot learning capabilities through label augmentation and candidate selection in the fine-tuning and inference stage, respectively. Label augmentation is implemented via a character perturbation strategy to ensure the model focuses on support information. Candidate selection leverages attribute descriptions to filter out unreliable candidates and simplify the task. Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proven beneficial for training-free LVLMs.
11.3CVAug 21, 2024
Domain-invariant Progressive Knowledge Distillation for UAV-based Object DetectionLiang Yao, Fan Liu, Chuanyi Zhang et al.
Knowledge distillation (KD) is an effective method for compressing models in object detection tasks. Due to limited computational capability, UAV-based object detection (UAV-OD) widely adopt the KD technique to obtain lightweight detectors. Existing methods often overlook the significant differences in feature space caused by the large gap in scale between the teacher and student models. This limitation hampers the efficiency of knowledge transfer during the distillation process. Furthermore, the complex backgrounds in UAV images make it challenging for the student model to efficiently learn the object features. In this paper, we propose a novel knowledge distillation framework for UAV-OD. Specifically, a progressive distillation approach is designed to alleviate the feature gap between teacher and student models. Then a new feature alignment method is provided to extract object-related features for enhancing student model's knowledge reception efficiency. Finally, extensive experiments are conducted to validate the effectiveness of our proposed approach. The results demonstrate that our proposed method achieves state-of-the-art (SoTA) performance in two UAV-OD datasets.
Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational FusionCaixia Dong, Duwei Dai, Xinyi Han et al.
Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Specifically, a vision transformer (ViT) encoder within the VFM captures global structural features, enhanced by the activation of the final two ViT blocks and the integration of an attention-guided enhancement (AGE) module, while a convolutional neural network (CNN) encoder extracts local details. These complementary features are adaptively fused using a cross-branch variational fusion (CVF) module, which models latent distributions and applies variational attention to assign modality-specific weights. Additionally, we introduce an evidential-learning uncertainty refinement (EUR) module, which quantifies uncertainty using evidence theory and refines uncertain regions by incorporating multi-scale feature aggregation and attention mechanisms, further enhancing segmentation accuracy. Extensive evaluations on one in-house and two public datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation and showcasing strong generalization across multiple datasets. The code is available at https://github.com/d1c2x3/CAseg.
20.5CVJan 3, 2024
Few-shot Adaptation of Multi-modal Foundation Models: A SurveyFan Liu, Tianshu Zhang, Wenwen Dai et al.
Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: 1) prompt-based methods, 2) adapter-based methods, and 3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: 1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive knowledge utilization.
9.6CLFeb 26, 2025
PEToolLLM: Towards Personalized Tool Learning in Large Language ModelsQiancheng Xu, Yongqi Li, Heming Xia et al.
Tool learning has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools. Existing tool learning studies primarily focus on the general-purpose tool-use capability, which addresses explicit user requirements in instructions. However, they overlook the importance of personalized tool-use capability, leading to an inability to handle implicit user preferences. To address the limitation, we first formulate the task of personalized tool learning, which integrates user's interaction history towards personalized tool usage. To fill the gap of missing benchmarks, we construct PEToolBench, featuring diverse user preferences reflected in interaction history under three distinct personalized settings, and encompassing a wide range of tool-use scenarios. Moreover, we propose a framework PEToolLLaMA to adapt LLMs to the personalized tool learning task, which is trained through supervised fine-tuning and direct preference optimization. Extensive experiments on PEToolBench demonstrate the superiority of PEToolLLaMA over existing LLMs.
2.0CVDec 9, 2024
Robust Noisy Correspondence Learning via Self-Drop and Dual-WeightFan Liu, Chenwei Dong, Chuanyi Zhang et al.
Many researchers collect data from the internet through crowd-sourcing or web crawling to alleviate the data-hungry challenge associated with cross-modal matching. Although such practice does not require expensive annotations, it inevitably introduces mismatched pairs and results in a noisy correspondence problem. Current approaches leverage the memorization effect of deep neural networks to distinguish noise and perform re-weighting. However, briefly lowering the weight of noisy pairs cannot eliminate the negative impact of noisy correspondence in the training process. In this paper, we propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua-partitioning the data. Specifically, our approach partitions all data into four types: clean and significant, clean yet insignificant, vague, and noisy. We analyze the effect of noisy and clean data pairs and find that for vision-language pre-training models, a small number of clean samples is more valuable than a majority of noisy ones. Based on this observation, we employ self-drop to discard noisy samples to effectively mitigate the impact of noise. In addition, we adopt a dual-weight strategy to ensure that the model focuses more on significant samples while appropriately leveraging vague samples. Compared to the prior works, our approach is more robust and demonstrates relatively more stable performance on noisy datasets, especially under a high noise ratio. Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, validate the effectiveness of our approach.
2.6CVJul 28, 2021
VirtualConductor: Music-driven Conducting Video Generation SystemDelong Chen, Fan Liu, Zewen Li et al.
In this demo, we present VirtualConductor, a system that can generate conducting video from any given music and a single user's image. First, a large-scale conductor motion dataset is collected and constructed. Then, we propose Audio Motion Correspondence Network (AMCNet) and adversarial-perceptual learning to learn the cross-modal relationship and generate diverse, plausible, music-synchronized motion. Finally, we combine 3D animation rendering and a pose transfer model to synthesize conducting video from a single given user's image. Therefore, any user can become a virtual conductor through the system.
5.5LGJul 20, 2021
Significant Wave Height Prediction based on Wavelet Graph Neural NetworkDelong Chen, Fan Liu, Zheqi Zhang et al.
Computational intelligence-based ocean characteristics forecasting applications, such as Significant Wave Height (SWH) prediction, are crucial for avoiding social and economic loss in coastal cities. Compared to the traditional empirical-based or numerical-based forecasting models, "soft computing" approaches, including machine learning and deep learning models, have shown numerous success in recent years. In this paper, we focus on enabling the deep learning model to learn both short-term and long-term spatial-temporal dependencies for SWH prediction. A Wavelet Graph Neural Network (WGNN) approach is proposed to integrate the advantages of wavelet transform and graph neural network. Several parallel graph neural networks are separately trained on wavelet decomposed data, and the reconstruction of each model's prediction forms the final SWH prediction. Experimental results show that the proposed WGNN approach outperforms other models, including the numerical models, the machine learning models, and several deep learning models.
1.2CVJun 9, 2020
Deep Learning Based Single Sample Per Person Face Recognition: A SurveyFan Liu, Delong Chen, Fei Wang et al.
Face recognition has long been an active research area in the field of artificial intelligence, particularly since the rise of deep learning in recent years. In some practical situations, each identity has only a single sample available for training. Face recognition under this situation is referred to as single sample face recognition and poses significant challenges to the effective training of deep models. Therefore, in recent years, researchers have attempted to unleash more potential of deep learning and improve the model recognition performance in the single sample situation. While several comprehensive surveys have been conducted on traditional single sample face recognition approaches, emerging deep learning based methods are rarely involved in these reviews. Accordingly, we focus on the deep learning-based methods in this paper, classifying them into virtual sample methods and generic learning methods. In the former category, virtual images or virtual features are generated to benefit the training of the deep model. In the latter one, additional multi-sample generic sets are used. There are three types of generic learning methods: combining traditional methods and deep features, improving the loss function, and improving network structure, all of which are covered in our analysis. Moreover, we review face datasets that have been commonly used for evaluating single sample face recognition models and go on to compare the results of different types of models. Additionally, we discuss problems with existing single sample face recognition methods, including identity information preservation in virtual sample methods, domain adaption in generic learning methods. Furthermore, we regard developing unsupervised methods is a promising future direction, and point out that the semantic gap as an important issue that needs to be further considered.
39.6CVApr 1, 2020
A Survey of Convolutional Neural Networks: Analysis, Applications, and ProspectsZewen Li, Wenjie Yang, Shouheng Peng et al.
Convolutional Neural Network (CNN) is one of the most significant networks in the deep learning field. Since CNN made impressive achievements in many areas, including but not limited to computer vision and natural language processing, it attracted much attention both of industry and academia in the past few years. The existing reviews mainly focus on the applications of CNN in different scenarios without considering CNN from a general perspective, and some novel ideas proposed recently are not covered. In this review, we aim to provide novel ideas and prospects in this fast-growing field as much as possible. Besides, not only two-dimensional convolution but also one-dimensional and multi-dimensional ones are involved. First, this review starts with a brief introduction to the history of CNN. Second, we provide an overview of CNN. Third, classic and advanced CNN models are introduced, especially those key points making them reach state-of-the-art results. Fourth, through experimental analysis, we draw some conclusions and provide several rules of thumb for function selection. Fifth, the applications of one-dimensional, two-dimensional, and multi-dimensional convolution are covered. Finally, some open issues and promising directions for CNN are discussed to serve as guidelines for future work.