Fan Liu

CV
h-index77
81papers
6,342citations
Novelty47%
AI Score61

81 Papers

82.5SPJun 4
From Ground to Sky: Architectures, Applications, and Challenges Shaping Low-Altitude Wireless Networks

Weijie Yuan, Yuanhao Cui, Jiacheng Wang et al.

In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communication systems, LAWN's distinctive feature is its tight integration of functional planes in which multiple functionalities continually reshape themselves to operate safely and efficiently in the low-altitude sky. With the LAWN, we discuss several enabling technologies, such as integrated sensing and communication (ISAC), semantic communication, and fully-actuated control systems. Finally, we identify potential applications and key cross-layer challenges. This article offers a comprehensive roadmap for future research and development in the low-altitude airspace.

92.1AIJun 2Code
EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

Zherui Yang, Fan Liu, Yansong Ning et al.

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.

CVJun 19, 2023Code
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

Fan Liu, Delong Chen, Zhangqingyun Guan et al.

General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $\textit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP

LGJun 17, 2023Code
Bkd-FedGNN: A Benchmark for Classification Backdoor Attacks on Federated Graph Neural Network

Fan Liu, Siqi Lai, Yansong Ning et al. · gatech

Federated Graph Neural Network (FedGNN) has recently emerged as a rapidly growing research topic, as it integrates the strengths of graph neural networks and federated learning to enable advanced machine learning applications without direct access to sensitive data. Despite its advantages, the distributed nature of FedGNN introduces additional vulnerabilities, particularly backdoor attacks stemming from malicious participants. Although graph backdoor attacks have been explored, the compounded complexity introduced by the combination of GNNs and federated learning has hindered a comprehensive understanding of these attacks, as existing research lacks extensive benchmark coverage and in-depth analysis of critical factors. To address these limitations, we propose Bkd-FedGNN, a benchmark for backdoor attacks on FedGNN. Specifically, Bkd-FedGNN decomposes the graph backdoor attack into trigger generation and injection steps, and extending the attack to the node-level federated setting, resulting in a unified framework that covers both node-level and graph-level classification tasks. Moreover, we thoroughly investigate the impact of multiple critical factors in backdoor attacks on FedGNN. These factors are categorized into global-level and local-level factors, including data distribution, the number of malicious attackers, attack time, overlapping rate, trigger size, trigger type, trigger position, and poisoning rate. Finally, we conduct comprehensive evaluations on 13 benchmark datasets and 13 critical factors, comprising 1,725 experimental configurations for node-level and graph-level tasks from six domains. These experiments encompass over 8,000 individual tests, allowing us to provide a thorough evaluation and insightful observations that advance our understanding of backdoor attacks on FedGNN.The Bkd-FedGNN benchmark is publicly available at https://github.com/usail-hkust/BkdFedGCN.

LGOct 5, 2022Code
Practical Adversarial Attacks on Spatiotemporal Traffic Forecasting Models

Fan Liu, Hao Liu, Wenzhao Jiang

Machine learning based traffic forecasting models leverage sophisticated spatiotemporal auto-correlations to provide accurate predictions of city-wide traffic states. However, existing methods assume a reliable and unbiased forecasting environment, which is not always available in the wild. In this work, we investigate the vulnerability of spatiotemporal traffic forecasting models and propose a practical adversarial spatiotemporal attack framework. Specifically, instead of simultaneously attacking all geo-distributed data sources, an iterative gradient-guided node saliency method is proposed to identify the time-dependent set of victim nodes. Furthermore, we devise a spatiotemporal gradient descent based scheme to generate real-valued adversarial traffic states under a perturbation constraint. Meanwhile, we theoretically demonstrate the worst performance bound of adversarial traffic forecasting attacks. Extensive experiments on two real-world datasets show that the proposed two-step framework achieves up to $67.8\%$ performance degradation on various advanced spatiotemporal forecasting models. Remarkably, we also show that adversarial training with our proposed attacks can significantly improve the robustness of spatiotemporal traffic forecasting models. Our code is available in \url{https://github.com/luckyfan-cs/ASTFA}.

CVDec 22, 2022Code
Multi-queue Momentum Contrast for Microvideo-Product Retrieval

Yali Du, Yinwei Wei, Wei Ji et al.

The booming development and huge market of micro-videos bring new e-commerce channels for merchants. Currently, more micro-video publishers prefer to embed relevant ads into their micro-videos, which not only provides them with business income but helps the audiences to discover their interesting products. However, due to the micro-video recording by unprofessional equipment, involving various topics and including multiple modalities, it is challenging to locate the products related to micro-videos efficiently, appropriately, and accurately. We formulate the microvideo-product retrieval task, which is the first attempt to explore the retrieval between the multi-modal and multi-modal instances. A novel approach named Multi-Queue Momentum Contrast (MQMC) network is proposed for bidirectional retrieval, consisting of the uni-modal feature and multi-modal instance representation learning. Moreover, a discriminative selection strategy with a multi-queue is used to distinguish the importance of different negatives based on their categories. We collect two large-scale microvideo-product datasets (MVS and MVS-large) for evaluation and manually construct the hierarchical category ontology, which covers sundry products in daily life. Extensive experiments show that MQMC outperforms the state-of-the-art baselines. Our replication package (including code, dataset, etc.) is publicly available at https://github.com/duyali2000/MQMC.

CVJun 22, 2022Code
ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Delong Chen, Zhao Wu, Fan Liu et al.

Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at https://github.com/megvii-research/protoclip.

LGJun 25, 2023Code
Robust Spatiotemporal Traffic Forecasting with Reinforced Dynamic Adversarial Training

Fan Liu, Weijia Zhang, Hao Liu

Machine learning-based forecasting models are commonly used in Intelligent Transportation Systems (ITS) to predict traffic patterns and provide city-wide services. However, most of the existing models are susceptible to adversarial attacks, which can lead to inaccurate predictions and negative consequences such as congestion and delays. Therefore, improving the adversarial robustness of these models is crucial for ITS. In this paper, we propose a novel framework for incorporating adversarial training into spatiotemporal traffic forecasting tasks. We demonstrate that traditional adversarial training methods designated for static domains cannot be directly applied to traffic forecasting tasks, as they fail to effectively defend against dynamic adversarial attacks. Then, we propose a reinforcement learning-based method to learn the optimal node selection strategy for adversarial examples, which simultaneously strengthens the dynamic attack defense capability and reduces the model overfitting. Additionally, we introduce a self-knowledge distillation regularization module to overcome the "forgetting issue" caused by continuously changing adversarial nodes during training. We evaluate our approach on two real-world traffic datasets and demonstrate its superiority over other baselines. Our method effectively enhances the adversarial robustness of spatiotemporal traffic forecasting models. The source code for our framework is available at https://github.com/usail-hkust/RDAT.

CVFeb 4, 2023
Learning to Agree on Vision Attention for Visual Commonsense Reasoning

Zhenyang Li, Yangyang Guo, Kejie Wang et al.

Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning. A VCR model generally aims at answering a textual question regarding an image, followed by the rationale prediction for the preceding answering process. Though these two processes are sequential and intertwined, existing methods always consider them as two independent matching-based instances. They, therefore, ignore the pivotal relationship between the two processes, leading to sub-optimal model performance. This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework. To achieve this, we first design a re-attention module for aggregating the vision attention map produced in each process. Thereafter, the resultant two sets of attention maps are carefully aligned to guide the two processes to make decisions based on the same image regions. We apply this method to both conventional attention and the recent Transformer models and carry out extensive experiments on the VCR benchmark dataset. The results demonstrate that with the attention alignment module, our method achieves a considerable improvement over the baseline methods, evidently revealing the feasibility of the coupling of the two processes as well as the effectiveness of the proposed method.

CVNov 26, 2023
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models

Yu-Wei Zhan, Fan Liu, Xin Luo et al.

Human-Object Interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions. However, conventional HOI detection methods often struggle to fully capture the contextual information needed to accurately identify these interactions. While large Vision-Language Models (VLMs) show promise in tasks involving human interactions, they are not tailored for HOI detection. The complexity of human behavior and the diverse contexts in which these interactions occur make it further challenging. Contextual cues, such as the participants involved, body language, and the surrounding environment, play crucial roles in predicting these interactions, especially those that are unseen or ambiguous. Moreover, large VLMs are trained on vast image and text data, enabling them to generate contextual cues that help in understanding real-world contexts, object relationships, and typical interactions. Building on this, in this paper we introduce ConCue, a novel approach for improving visual feature extraction in HOI detection. Specifically, we first design specialized prompts to utilize large VLMs to generate contextual cues within an image. To fully leverage these cues, we develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors. Extensive experiments and analyses demonstrate the effectiveness of using these contextual cues for HOI detection. The experimental results show that integrating ConCue with existing state-of-the-art methods significantly enhances their performance on two widely used datasets.

LGJul 24, 2023
Landslide Surface Displacement Prediction Based on VSXC-LSTM Algorithm

Menglin Kong, Ruichen Li, Fan Liu et al.

Landslide is a natural disaster that can easily threaten local ecology, people's lives and property. In this paper, we conduct modelling research on real unidirectional surface displacement data of recent landslides in the research area and propose a time series prediction framework named VMD-SegSigmoid-XGBoost-ClusterLSTM (VSXC-LSTM) based on variational mode decomposition, which can predict the landslide surface displacement more accurately. The model performs well on the test set. Except for the random item subsequence that is hard to fit, the root mean square error (RMSE) and the mean absolute percentage error (MAPE) of the trend item subsequence and the periodic item subsequence are both less than 0.1, and the RMSE is as low as 0.006 for the periodic item prediction module based on XGBoost\footnote{Accepted in ICANN2023}.

CVSep 7, 2024Code
Power Line Aerial Image Restoration under dverse Weather: Datasets and Baselines

Sai Yang, Bin Hu, Bojun Zhou et al.

Power Line Autonomous Inspection (PLAI) plays a crucial role in the construction of smart grids due to its great advantages of low cost, high efficiency, and safe operation. PLAI is completed by accurately detecting the electrical components and defects in the aerial images captured by Unmanned Aerial Vehicles (UAVs). However, the visible quality of aerial images is inevitably degraded by adverse weather like haze, rain, or snow, which are found to drastically decrease the detection accuracy in our research. To circumvent this problem, we propose a new task of Power Line Aerial Image Restoration under Adverse Weather (PLAIR-AW), which aims to recover clean and high-quality images from degraded images with bad weather thus improving detection performance for PLAI. In this context, we are the first to release numerous corresponding datasets, namely, HazeCPLID, HazeTTPLA, HazeInsPLAD for power line aerial image dehazing, RainCPLID, RainTTPLA, RainInsPLAD for power line aerial image deraining, SnowCPLID, SnowInsPLAD for power line aerial image desnowing, which are synthesized upon the public power line aerial image datasets of CPLID, TTPLA, InsPLAD following the mathematical models. Meanwhile, we select numerous state-of-the-art methods from image restoration community as the baseline methods for PLAIR-AW. At last, we conduct large-scale empirical experiments to evaluate the performance of baseline methods on the proposed datasets. The proposed datasets and trained models are available at https://github.com/ntuhubin/PLAIR-AW.

CVSep 3, 2024
DiVE: DiT-based Video Generation with Enhanced Control

Junpeng Jiang, Gangyi Hong, Lijun Zhou et al.

Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.

CVJul 24, 2024
(PASS) Visual Prompt Locates Good Structure Sparsity through a Recurrent HyperNetwork

Tianjin Huang, Fang Meng, Li Shen et al.

Large-scale neural networks have demonstrated remarkable performance in different domains like vision and language processing, although at the cost of massive computation resources. As illustrated by compression literature, structural model pruning is a prominent algorithm to encourage model efficiency, thanks to its acceleration-friendly sparsity patterns. One of the key questions of structural pruning is how to estimate the channel significance. In parallel, work on data-centric AI has shown that prompting-based techniques enable impressive generalization of large language models across diverse downstream tasks. In this paper, we investigate a charming possibility - \textit{leveraging visual prompts to capture the channel importance and derive high-quality structural sparsity}. To this end, we propose a novel algorithmic framework, namely \texttt{PASS}. It is a tailored hyper-network to take both visual prompts and network weight statistics as input, and output layer-wise channel sparsity in a recurrent manner. Such designs consider the intrinsic channel dependency between layers. Comprehensive experiments across multiple network architectures and six datasets demonstrate the superiority of \texttt{PASS} in locating good structural sparsity. For example, at the same FLOPs level, \texttt{PASS} subnetworks achieve $1\%\sim 3\%$ better accuracy on Food101 dataset; or with a similar performance of $80\%$ accuracy, \texttt{PASS} subnetworks obtain $0.35\times$ more speedup than the baselines.

92.9CLApr 12
TInR: Exploring Tool-Internalized Reasoning in Large Language Models

Qiancheng Xu, Yongqi Li, Fan Liu et al.

Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.

LGSep 24, 2024
Spatial-Temporal Mixture-of-Graph-Experts for Multi-Type Crime Prediction

Ziyang Wu, Fan Liu, Jindong Han et al.

As various types of crime continue to threaten public safety and economic development, predicting the occurrence of multiple types of crimes becomes increasingly vital for effective prevention measures. Although extensive efforts have been made, most of them overlook the heterogeneity of different crime categories and fail to address the issue of imbalanced spatial distribution. In this work, we propose a Spatial-Temporal Mixture-of-Graph-Experts (ST-MoGE) framework for collective multiple-type crime prediction. To enhance the model's ability to identify diverse spatial-temporal dependencies and mitigate potential conflicts caused by spatial-temporal heterogeneity of different crime categories, we introduce an attentive-gated Mixture-of-Graph-Experts (MGEs) module to capture the distinctive and shared crime patterns of each crime category. Then, we propose Cross-Expert Contrastive Learning(CECL) to update the MGEs and force each expert to focus on specific pattern modeling, thereby reducing blending and redundancy. Furthermore, to address the issue of imbalanced spatial distribution, we propose a Hierarchical Adaptive Loss Re-weighting (HALR) approach to eliminate biases and insufficient learning of data-scarce regions. To evaluate the effectiveness of our methods, we conduct comprehensive experiments on two real-world crime datasets and compare our results with twelve advanced baselines. The experimental results demonstrate the superiority of our methods.

CVApr 30, 2023
Few-shot Classification via Ensemble Learning with Multi-Order Statistics

Sai Yang, Fan Liu, Delong Chen et al.

Transfer learning has been widely adopted for few-shot classification. Recent studies reveal that obtaining good generalization representation of images on novel classes is the key to improving the few-shot classification accuracy. To address this need, we prove theoretically that leveraging ensemble learning on the base classes can correspondingly reduce the true error in the novel classes. Following this principle, a novel method named Ensemble Learning with Multi-Order Statistics (ELMOS) is proposed in this paper. In this method, after the backbone network, we use multiple branches to create the individual learners in the ensemble learning, with the goal to reduce the storage cost. We then introduce different order statistics pooling in each branch to increase the diversity of the individual learners. The learners are optimized with supervised losses during the pre-training phase. After pre-training, features from different branches are concatenated for classifier evaluation. Extensive experiments demonstrate that each branch can complement the others and our method can produce a state-of-the-art performance on multiple few-shot classification benchmark datasets.

65.2ITMar 30
Simultaneous Sensing Data Acquisition and Sharing in Low-Altitude Wireless Networks: Fundamental Limits and Optimal Signaling

Fuwang Dong, Fan Liu, Yifeng Xiong et al.

In the low-altitude wireless networks, the simultaneous sensing data acquisition and sharing (SDAS) through an ISAC signaling strategy becomes a typical application scenario. In this paper, we mainly investigate three primary aspects of the SDAS system, namely, the information-theoretic framework, the optimal distribution of channel input, and the optimal waveform design for Gaussian signaling. First, we establish the information-theoretic framework and develop a modified source-channel separation theorem (MSST) tailored for the SDAS systems. The proposed MSST elucidates the relationship between achievable distortion, coding rate, and communication channel capacity in cases where the distortion metric is separable for sensing and communication (S\&C) processes. Second, we present an optimal channel input design for dual-functional signaling, which aims to minimize SDAS distortion under the constraints of the MSST and resource budget. We then conceive a two-step Blahut-Arimoto (BA)-based optimal search algorithm to numerically solve the functional optimization problem. Third, to provide practical design insights, we further propose an optimal waveform design for Gaussian signaling in multi-input multi-output (MIMO) SDAS systems. The associated covariance matrix optimization problem is addressed using a successive convex approximation (SCA)-based waveform design algorithm. Finally, we provide numerical simulation results to demonstrate the effectiveness of the proposed algorithms, which characterize the unique performance tradeoff between S&C processes.

89.6NIMay 23
Low-Altitude Wireless Networks: The Next Horizon of Wireless Infrastructure

Yuanhao Cui, Jiali Nie, Weijie Yuan et al.

Low-altitude airspace, roughly defined as the region up to 3000 meters above ground level, is envisioned as a new spatial domain for daily human and machine activities. This article introduces the concept of the Low-Altitude Wireless Network (LAWN), which represents a paradigm shift from the current ground-based communication-only network to a three-dimensional (3D) multifunctional network. We analyze the key driving forces, network architecture, and limiting factors of LAWN, with a particular focus on the tight integration of communication, sensing, and control in highly dynamic airspace environments. By establishing the coupling between airspace capacity and wireless channel capacity, we reveal the intrinsic limits of airspace management and identify the fundamental challenges and opportunities associated with its evolution.

CVAug 21, 2024
Making Large Vision Language Models to be Good Few-shot Learners

Fan Liu, Wenwen Cai, Jian Huo et al.

Few-shot classification (FSC) is a fundamental yet challenging task in computer vision that involves recognizing novel classes from limited data. While previous methods have focused on enhancing visual features or incorporating additional modalities, Large Vision Language Models (LVLMs) offer a promising alternative due to their rich knowledge and strong visual perception. However, LVLMs risk learning specific response formats rather than effectively extracting useful information from support data in FSC tasks. In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases. To tackle the above challenges, we adopt the meta-learning strategy to teach models "learn to learn". By constructing a rich set of meta-tasks for instruction fine-tuning, LVLMs enhance the ability to extract information from few-shot support data for classification. Additionally, we further boost LVLM's few-shot learning capabilities through label augmentation and candidate selection in the fine-tuning and inference stage, respectively. Label augmentation is implemented via a character perturbation strategy to ensure the model focuses on support information. Candidate selection leverages attribute descriptions to filter out unreliable candidates and simplify the task. Extensive experiments demonstrate that our approach achieves superior performance on both general and fine-grained datasets. Furthermore, our candidate selection strategy has been proven beneficial for training-free LVLMs.

87.4CVMay 6
RemoteZero: Geospatial Reasoning with Zero Human Annotations

Liang Yao, Fan Liu, Shengxiang Xu et al.

Geospatial reasoning requires models to resolve complex spatial semantics and user intent into precise target locations for Earth observation. Recent progress has liberated the reasoning path from manual curation, allowing models to generate their own inference chains. Yet a final dependency remains: they are still supervised by human-annotated ground-truth coordinates. This leaves the reasoning process autonomous, but not its spatial endpoint, and prevents true self-evolution on abundant unlabeled remote sensing data. To break this bottleneck, we introduce RemoteZero, a box-supervision-free framework for geospatial reasoning. RemoteZero is motivated by a simple asymmetry: an MLLM is typically better at verifying whether a region satisfies a query than at directly generating precise coordinates. Leveraging this stronger discriminative ability, RemoteZero replaces geometric supervision with intrinsic semantic verification and enables GRPO training without box annotations. The resulting framework further supports iterative self-evolution, allowing the model to improve from unlabeled remote sensing imagery through its own verification signal. Experiments show that RemoteZero achieves competitive performance against strong supervised methods, demonstrating the potential of self-verifying training for geospatial reasoning localization.

CVJan 13, 2025Code
Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation

Han Liu, Yinwei Wei, Fan Liu et al.

Multimodal information (e.g., visual, acoustic, and textual) has been widely used to enhance representation learning for micro-video recommendation. For integrating multimodal information into a joint representation of micro-video, multimodal fusion plays a vital role in the existing micro-video recommendation approaches. However, the static multimodal fusion used in previous studies is insufficient to model the various relationships among multimodal information of different micro-videos. In this paper, we develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF), which dynamically assigns parameters to the multimodal fusion function for each micro-video during its representation learning. Specifically, MetaMMF regards the multimodal fusion of each micro-video as an independent task. Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner. We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models, like MMGCN, LATTICE, and InvRL. Furthermore, we lighten our model by adopting canonical polyadic decomposition to improve the training efficiency, and validate its effectiveness through experimental results. Codes are available at https://github.com/hanliu95/MetaMMF.

AIFeb 11, 2025Code
Bag of Tricks for Inference-time Computation of LLM Reasoning

Fan Liu, Wenshuo Chao, Naiqiang Tan et al.

With the advancement of large language models (LLMs), solving complex reasoning tasks has gained increasing attention. Inference-time computation methods (e.g., Best-of-N, beam search, et al.) are particularly valuable as they can enhance reasoning performance without modifying model parameters or requiring additional training. However, these techniques come with implementation challenges, and most existing methods remain at the proof-of-concept stage with limited practical adoption due to their computational complexity and varying effectiveness across different tasks. In this paper, we investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity. Since most current methods rely on a proposer-verifier pipeline that first generates candidate solutions (e.g., reasoning solutions) and then selects the best one based on reward signals (e.g., RLHF rewards, process rewards), our research focuses on optimizing both candidate solution generation (e.g., instructing prompts, hyperparameters such as temperature and top-p) and reward mechanisms (e.g., self-evaluation, reward types). Through extensive experiments (more than 20,000 A100-80G GPU hours with over 1,000 experiments) across a variety of models (e.g., Llama, Qwen, and Mistral families) of various sizes, our ablation studies reveal that previously overlooked strategies can significantly enhance performance (e.g., tuning temperature can improve reasoning task performance by up to 5%). Furthermore, we establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks. These findings provide a stronger foundation for future research. The code is available at https://github.com/usail-hkust/benchmark_inference_time_computation_LLM

CVAug 21, 2024
Domain-invariant Progressive Knowledge Distillation for UAV-based Object Detection

Liang Yao, Fan Liu, Chuanyi Zhang et al.

Knowledge distillation (KD) is an effective method for compressing models in object detection tasks. Due to limited computational capability, UAV-based object detection (UAV-OD) widely adopt the KD technique to obtain lightweight detectors. Existing methods often overlook the significant differences in feature space caused by the large gap in scale between the teacher and student models. This limitation hampers the efficiency of knowledge transfer during the distillation process. Furthermore, the complex backgrounds in UAV images make it challenging for the student model to efficiently learn the object features. In this paper, we propose a novel knowledge distillation framework for UAV-OD. Specifically, a progressive distillation approach is designed to alleviate the feature gap between teacher and student models. Then a new feature alignment method is provided to extract object-related features for enhancing student model's knowledge reception efficiency. Finally, extensive experiments are conducted to validate the effectiveness of our proposed approach. The results demonstrate that our proposed method achieves state-of-the-art (SoTA) performance in two UAV-OD datasets.

87.3CVApr 19
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation

Rui Min, Liang Yao, Shiyu Miao et al.

A robust Multimodal Large Language Model (MLLM) for Earth Observation should maintain consistent interpretation and reasoning under realistic input variations. However, current Remote Sensing MLLMs fail to meet this requirement. Trained on carefully curated clean datasets, they learn brittle mappings that do not generalize to noisy conditions in operational Earth Observation. Consequently, their performance degrades when confronted with imperfect inputs in deployment. To quantify this vulnerability, we construct a realistic set of multimodal perturbations, including visual degradations such as cloud and fog cover, together with diverse human-centric textual variations ranging from colloquialisms to vague or omitted instructions. Empirical evaluations show that these perturbations significantly impair the visual-semantic reasoning capabilities of leading RS foundation models. To address this limitation, we introduce RemoteShield, a robust Remote Sensing MLLM trained to maintain consistent outputs across realistic input variations. During training, each clean sample is paired with its image-text perturbed variants to form a semantic equivalence cluster. Rather than directly fitting noisy samples, RemoteShield is optimized through preference learning over clean and perturbed conditions within the same cluster. By comparing model responses to clean and corrupted inputs, the model is encouraged to favor stable responses over perturbation-induced failures. This cross-condition alignment helps the model focus on underlying task semantics despite visual degradations and textual noise. Experiments on three Earth Observation tasks show that RemoteShield consistently delivers stronger robustness and cross-condition consistency than representative baselines under realistic multimodal perturbations.

CVMay 23, 2025Code
RemoteSAM: Towards Segment Anything for Earth Observation

Liang Yao, Fan Liu, Delong Chen et al.

We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.

92.2SPMay 18
From Coverage to Sensing: ISAC meets FR3

Ahmad Bazzi, Florian Gast, Fan Liu et al.

Future 6G systems are expected to exploit upper midband spectrum in frequency range 3 (FR3) not only for high throughput communications, but also for sensing services such as localization, detection, and situational awareness. The following paper develops a concrete path from today's coverage-oriented deployments to FR3 networks that treat sensing as a native function. We first show how existing FR2 radars can be time-multiplexed and coordinated under a $6$G medium access control as radar-as-a-service, forming a bridge between legacy sensing and network-managed integrated sensing and communications (ISAC). We then propose a hierarchical FR3 beam-alignment strategy in which coarse access occurs at lower frequencies and refinement occurs at upper FR3, and quantify the resulting sensing and communication capabilities via range-angle Cram{é}r-Rao bounds in the near field. We identify intra- and inter-beam squint phenomena specific to wideband FR3 arrays, and discuss design approaches to mitigate them. On the signal-processing side, we argue that FR3 sensing cannot rely solely on pilot resources and discuss how much sensing information can be extracted from payload resource elements. We further highlight the role of calibrated FR3 channel simulators and real-time models as the core of wireless digital twins for training and evaluating ISAC algorithms, and discuss how massive MIMO and dense or distributed deployments at FR3 naturally act as large reconfigurable sensor arrays.

CVJun 16, 2022
A Simple Baseline for Adversarial Domain Adaptation-based Unsupervised Flood Forecasting

Delong Chen, Ruizhi Zhou, Yanling Pan et al.

Flood disasters cause enormous social and economic losses. However, both traditional physical models and learning-based flood forecasting models require massive historical flood data to train the model parameters. When come to some new site that does not have sufficient historical data, the model performance will drop dramatically due to overfitting. This technical report presents a Flood Domain Adaptation Network (FloodDAN), a baseline of applying Unsupervised Domain Adaptation (UDA) to the flood forecasting problem. Specifically, training of FloodDAN includes two stages: in the first stage, we train a rainfall encoder and a prediction head to learn general transferable hydrological knowledge on large-scale source domain data; in the second stage, we transfer the knowledge in the pretrained encoder into the rainfall encoder of target domain through adversarial domain alignment. During inference, we utilize the target domain rainfall encoder trained in the second stage and the prediction head trained in the first stage to get flood forecasting predictions. Experimental results on Tunxi and Changhua flood dataset show that FloodDAN can perform flood forecasting effectively with zero target domain supervision. The performance of the FloodDAN is on par with supervised models that uses 450-500 hours of supervision.

AIMay 20, 2025Code
MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem

Fan Liu, Zherui Yang, Cancheng Liu et al.

Mathematical modeling is a cornerstone of scientific discovery and engineering practice, enabling the translation of real-world problems into formal systems across domains such as physics, biology, and economics. Unlike mathematical reasoning, which assumes a predefined formulation, modeling requires open-ended problem analysis, abstraction, and principled formalization. While Large Language Models (LLMs) have shown strong reasoning capabilities, they fall short in rigorous model construction, limiting their utility in real-world problem-solving. To this end, we formalize the task of LLM-powered real-world mathematical modeling, where agents must analyze problems, construct domain-appropriate formulations, and generate complete end-to-end solutions. We introduce MM-Bench, a curated benchmark of 111 problems from the Mathematical Contest in Modeling (MCM/ICM), spanning the years 2000 to 2025 and across ten diverse domains such as physics, biology, and economics. To tackle this task, we propose MM-Agent, an expert-inspired framework that decomposes mathematical modeling into four stages: open-ended problem analysis, structured model formulation, computational problem solving, and report generation. Experiments on MM-Bench show that MM-Agent significantly outperforms baseline agents, achieving an 11.88\% improvement over human expert solutions while requiring only 15 minutes and \$0.88 per task using GPT-4o. Furthermore, under official MCM/ICM protocols, MM-Agent assisted two undergraduate teams in winning the Finalist Award (\textbf{top 2.0\% among 27,456 teams}) in MCM/ICM 2025, demonstrating its practical effectiveness as a modeling copilot. Our code is available at https://github.com/usail-hkust/LLM-MM-Agent

88.1CVApr 22Code
Evaluating Remote Sensing Image Captions Beyond Metric Biases

Ziyun Chen, Fan Liu, Liang Yao et al.

The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore's reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.

78.4CVApr 9
RemoteAgent: Bridging Vague Human Intents and Earth Observation with RL-based Agentic MLLMs

Liang Yao, Shengxiang Xu, Fan Liu et al.

Earth Observation (EO) systems are essentially designed to support domain experts who often express their requirements through vague natural language rather than precise, machine-friendly instructions. Depending on the specific application scenario, these vague queries can demand vastly different levels of visual precision. Consequently, a practical EO AI system must bridge the gap between ambiguous human queries and the appropriate multi-granularity visual analysis tasks, ranging from holistic image interpretation to fine-grained pixel-wise predictions. While Multi-modal Large Language Models (MLLMs) demonstrate strong semantic understanding, their text-based output format is inherently ill-suited for dense, precision-critical spatial predictions. Existing agentic frameworks address this limitation by delegating tasks to external tools, but indiscriminate tool invocation is computationally inefficient and underutilizes the MLLM's native capabilities. To this end, we propose RemoteAgent, an agentic framework that strategically respects the intrinsic capability boundaries of MLLMs. To empower this framework to understand real user intents, we construct VagueEO, a human-centric instruction dataset pairing EO tasks with simulated vague natural-language queries. By leveraging VagueEO for reinforcement fine-tuning, we align an MLLM into a robust cognitive core that directly resolves image- and sparse region-level tasks. Consequently, RemoteAgent processes suitable tasks internally while intelligently orchestrating specialized tools via the Model Context Protocol exclusively for dense predictions. Extensive experiments demonstrate that RemoteAgent achieves robust intent recognition capabilities while delivering highly competitive performance across diverse EO tasks.

98.2ITMay 14
CP-OFDM Achieves Lower Ranging CRB Than Frequency-Spread Waveforms in the Large-Sample Regime

Fan Liu, Yifeng Xiong, Ya-Feng Liu et al.

The inherent randomness of communication symbols creates a fundamental tension in Integrated Sensing and Communications (ISAC). On the one hand, they enable data transmission while allowing sensing to fully reuse communication resources. On the other hand, their randomness induces waveform-dependent fluctuations that directly affect sensing accuracy. This paper investigates a foundational question arising from this tradeoff: \textit{How does the modulation waveform affect the ranging Cramér--Rao Bound (CRB) when sensing reuses random data symbols?} We address this question by revealing a structural factorization of the Fisher information matrix (FIM) for joint delay-amplitude estimation, which separates the deterministic Jacobian of the target geometry from the random frequency-domain signal power induced by the data symbols. This structure yields a Jensen-type universal lower bound on the CRB, which is exactly attained by CP-OFDM under PSK constellations. For QAM and broader sub-Gaussian constellations, we develop an asymptotic perturbation analysis of the inverse FIM and prove that, when the number of transmitted symbols $N$ grows large, CP-OFDM achieves a lower ranging CRB than any frequency-spread orthogonal waveform over the almost-sure event where the random FIM is invertible. This superiority is further extended to amplitude estimation and full joint delay-amplitude estimation. We also characterize the local geometry of the stochastic CRB minimization problem over the unitary group. The analysis reveals that CP-OFDM is a stationary point for finite $N$, and its Riemannian Hessian is positive semidefinite for sufficiently large $N$, establishing its asymptotic local optimality. Numerical results confirm that OFDM outperforms representative waveforms including SC, OTFS, and AFDM.

LGOct 17, 2025Code
Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition

Fan Liu, Jindong Han, Tengfei Lyu et al.

Foundation models (FMs), such as GPT-4 and AlphaFold, are reshaping the landscape of scientific research. Beyond accelerating tasks such as hypothesis generation, experimental design, and result interpretation, they prompt a more fundamental question: Are FMs merely enhancing existing scientific methodologies, or are they redefining the way science is conducted? In this paper, we argue that FMs are catalyzing a transition toward a new scientific paradigm. We introduce a three-stage framework to describe this evolution: (1) Meta-Scientific Integration, where FMs enhance workflows within traditional paradigms; (2) Hybrid Human-AI Co-Creation, where FMs become active collaborators in problem formulation, reasoning, and discovery; and (3) Autonomous Scientific Discovery, where FMs operate as independent agents capable of generating new scientific knowledge with minimal human intervention. Through this lens, we review current applications and emerging capabilities of FMs across existing scientific paradigms. We further identify risks and future directions for FM-enabled scientific discovery. This position paper aims to support the scientific community in understanding the transformative role of FMs and to foster reflection on the future of scientific discovery. Our project is available at https://github.com/usail-hkust/Awesome-Foundation-Models-for-Scientific-Discovery.

LGAug 13, 2025Code
Combating Noisy Labels via Dynamic Connection Masking

Xinlei Zhang, Fan Liu, Chuanyi Zhang et al.

Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels can cause significant performance degradation. Existing research on mitigating the negative effects of noisy labels has mainly focused on robust loss functions and sample selection, with comparatively limited exploration of regularization in model architecture. Inspired by the sparsity regularization used in Kolmogorov-Arnold Networks (KANs), we propose a Dynamic Connection Masking (DCM) mechanism for both Multi-Layer Perceptron Networks (MLPs) and KANs to enhance the robustness of classifiers against noisy labels. The mechanism can adaptively mask less important edges during training by evaluating their information-carrying capacity. Through theoretical analysis, we demonstrate its efficiency in reducing gradient error. Our approach can be seamlessly integrated into various noise-robust training methods to build more robust deep networks, including robust loss functions, sample selection strategies, and regularization techniques. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our method consistently outperforms state-of-the-art (SOTA) approaches. Furthermore, we are also the first to investigate KANs as classifiers against noisy labels, revealing their superior noise robustness over MLPs in real-world noisy scenarios. Our code will soon be publicly available.

MAJul 1, 2025Code
Large Language Model Powered Intelligent Urban Agents: Concepts, Capabilities, and Applications

Jindong Han, Yansong Ning, Zirui Yuan et al.

The long-standing vision of intelligent cities is to create efficient, livable, and sustainable urban environments using big data and artificial intelligence technologies. Recently, the advent of Large Language Models (LLMs) has opened new ways toward realizing this vision. With powerful semantic understanding and reasoning capabilities, LLMs can be deployed as intelligent agents capable of autonomously solving complex problems across domains. In this article, we focus on Urban LLM Agents, which are LLM-powered agents that are semi-embodied within the hybrid cyber-physical-social space of cities and used for system-level urban decision-making. First, we introduce the concept of urban LLM agents, discussing their unique capabilities and features. Second, we survey the current research landscape from the perspective of agent workflows, encompassing urban sensing, memory management, reasoning, execution, and learning. Third, we categorize the application domains of urban LLM agents into five groups: urban planning, transportation, environment, public safety, and urban society, presenting representative works in each group. Finally, we discuss trustworthiness and evaluation issues that are critical for real-world deployment, and identify several open problems for future research. This survey aims to establish a foundation for the emerging field of urban LLM agents and to provide a roadmap for advancing the intersection of LLMs and urban intelligence. A curated list of relevant papers and open-source resources is maintained and continuously updated at https://github.com/usail-hkust/Awesome-Urban-LLM-Agents.

AIJun 1, 2025Code
FLEx: Personalized Federated Learning for Mixture-of-Experts LLMs via Expert Grafting

Fan Liu, Bikang Pan, Zhongyi Wang et al.

Federated instruction tuning of large language models (LLMs) is challenged by significant data heterogeneity across clients, demanding robust personalization. The Mixture of Experts (MoE) architecture, where experts can specialize in distinct data patterns, presents a natural architectural solution to this challenge. The inherent sparsity of the MoE architecture, achieved by selectively activating experts, poses a significant challenge to its integration with federated learning (FL). Conventional FL frameworks, designed for dense models, naively aggregate all expert parameters irrespective of their local activation patterns. This naive approach not only undermines MoE's dynamic sparsity but also risks corrupting the world knowledge within pretrained experts. To address this, we propose FLEx (Federated LLMs with Personalized Experts), a novel framework that leverages pretrained MoE-based LLMs for efficient personalization. By aggregating only the shared non-expert parameters, FLEx significantly reduces communication overhead and preserves the world knowledge stored within the frozen pretrained experts. For personalization, we introduce a novel expert grafting mechanism that leverages dynamic sparsity to construct a client-specific expert from selected components of pretrained experts, tailored to local data. This grafted expert is then fine-tuned locally alongside the gating mechanism. This joint training enables the model to learn when to leverage the shared knowledge from frozen experts and when to employ the personalized one. Evaluations on diverse, non-IID instruction tuning datasets show that FLEx consistently outperforms federated baselines on average, while demonstrating strong knowledge preservation on the knowledge-driven benchmark MMLU. Our code is available at \href{https://anonymous.4open.science/r/FLEx-8F12}{\texttt{https://anonymous.4open.science/r/FLEx-8F12}}.

CVJan 14
Disentangle Object and Non-object Infrared Features via Language Guidance

Fan Liu, Ting Wu, Chuanyi Zhang et al.

Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript{3}FD (83.7\% mAP), FLIR (86.1\% mAP). Our code will be publicly available once the paper is accepted.

IVJul 17, 2025Code
Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion

Caixia Dong, Duwei Dai, Xinyi Han et al.

Accurate coronary artery segmentation is critical for computeraided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Specifically, a vision transformer (ViT) encoder within the VFM captures global structural features, enhanced by the activation of the final two ViT blocks and the integration of an attention-guided enhancement (AGE) module, while a convolutional neural network (CNN) encoder extracts local details. These complementary features are adaptively fused using a cross-branch variational fusion (CVF) module, which models latent distributions and applies variational attention to assign modality-specific weights. Additionally, we introduce an evidential-learning uncertainty refinement (EUR) module, which quantifies uncertainty using evidence theory and refines uncertain regions by incorporating multi-scale feature aggregation and attention mechanisms, further enhancing segmentation accuracy. Extensive evaluations on one in-house and two public datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation and showcasing strong generalization across multiple datasets. The code is available at https://github.com/d1c2x3/CAseg.

CRJun 13, 2024Code
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs

Zhao Xu, Fan Liu, Hao Liu

Although Large Language Models (LLMs) have demonstrated significant capabilities in executing complex tasks in a zero-shot manner, they are susceptible to jailbreak attacks and can be manipulated to produce harmful outputs. Recently, a growing body of research has categorized jailbreak attacks into token-level and prompt-level attacks. However, previous work primarily overlooks the diverse key factors of jailbreak attacks, with most studies concentrating on LLM vulnerabilities and lacking exploration of defense-enhanced LLMs. To address these issues, we introduced $\textbf{JailTrickBench}$ to evaluate the impact of various attack settings on LLM performance and provide a baseline for jailbreak attacks, encouraging the adoption of a standardized evaluation framework. Specifically, we evaluate the eight key factors of implementing jailbreak attacks on LLMs from both target-level and attack-level perspectives. We further conduct seven representative jailbreak attacks on six defense methods across two widely used datasets, encompassing approximately 354 experiments with about 55,000 GPU hours on A800-80G. Our experimental results highlight the need for standardized benchmarking to evaluate these attacks on defense-enhanced LLMs. Our code is available at https://github.com/usail-hkust/JailTrickBench.

CVJun 10, 2024Code
UEMM-Air: Make Unmanned Aerial Vehicles Perform More Multi-modal Tasks

Liang Yao, Fan Liu, Shengxiang Xu et al.

The development of multi-modal learning for Unmanned Aerial Vehicles (UAVs) typically relies on a large amount of pixel-aligned multi-modal image data. However, existing datasets face challenges such as limited modalities, high construction costs, and imprecise annotations. To this end, we propose a synthetic multi-modal UAV-based multi-task dataset, UEMM-Air. Specifically, we simulate various UAV flight scenarios and object types using the Unreal Engine (UE). Then we design the UAV's flight logic to automatically collect data from different scenarios, perspectives, and altitudes. Furthermore, we propose a novel heuristic automatic annotation algorithm to generate accurate object detection labels. Finally, we utilize labels to generate text descriptions of images to make our UEMM-Air support more cross-modality tasks. In total, our UEMM-Air consists of 120k pairs of images with 6 modalities and precise annotations. Moreover, we conduct numerous experiments and establish new benchmark results on our dataset. We also found that models pre-trained on UEMM-Air exhibit better performance on downstream tasks compared to other similar datasets. The dataset is publicly available (https://github.com/1e12Leon/UEMM-Air) to support the research of multi-modal tasks on UAVs.

61.0IVMay 10
Uncertainty-Guided Dual-Domain Learning for Reliable Skin Lesion Segmentation

Duwei Dai, Caixia Dong, Guowei Dai et al.

Accurate skin lesion segmentation is vital for dermoscopic Computer-Aided Diagnosis. However, visual ambiguity and morphological irregularity often defeat spatial modeling, necessitating multi-domain architectures. Existing paradigms frequently overlook the active use of prediction uncertainty, leading to deterministic frameworks that suffer from blind cross-domain fusion and overfit to label noise. To address these issues, we propose the Uncertainty-Guided Dual-Domain Network (UGDD-Net). UGDD-Net introduces a novel "Glance-and-Gaze" mechanism to transform uncertainty into an active guiding signal. Specifically, the Uncertainty-Guided Bi-directional Feature Fusion (UGBFF) module uses pixel-level uncertainty to modulate spatial-spectral interactions. The Uncertainty-Guided Graph Refinement (UGGR) module constructs a topology-aware graph to propagate reliable semantic consensus and refine uncertain nodes. Finally, the Uncertainty-Guided Margin-Adaptive Loss (UGML) enforces strict constraints on confident pixels while relaxing penalties on uncertain ones to improve statistical calibration. Extensive experiments on ISIC2017, ISIC2018, PH2, and HAM10000 datasets demonstrate that UGDD-Net achieves state-of-the-art performance, especially on "Hard Samples". Our uncertainty maps align with expert inter-observer variability, providing robust interpretability for human-machine collaborative diagnosis.

43.5LGMar 18
Variational Rectification Inference for Learning with Noisy Labels

Haoliang Sun, Qi Wei, Lei Feng et al.

Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (\textit{e.g.}, re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.

CVJan 3, 2024
Few-shot Adaptation of Multi-modal Foundation Models: A Survey

Fan Liu, Tianshu Zhang, Wenwen Dai et al.

Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: 1) prompt-based methods, 2) adapter-based methods, and 3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: 1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive knowledge utilization.

CLOct 11, 2024
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

Fan Liu, Yue Feng, Zhao Xu et al.

Despite advancements in enhancing LLM safety against jailbreak attacks, evaluating LLM defenses remains a challenge, with current methods often lacking explainability and generalization to complex scenarios, leading to incomplete assessments (e.g., direct judgment without reasoning, low F1 score of GPT-4 in complex cases, bias in multilingual scenarios). To address this, we present JAILJUDGE, a comprehensive benchmark featuring diverse risk scenarios, including synthetic, adversarial, in-the-wild, and multilingual prompts, along with high-quality human-annotated datasets. The JAILJUDGE dataset includes over 35k+ instruction-tune data with reasoning explainability and JAILJUDGETEST, a 4.5k+ labeled set for risk scenarios, and a 6k+ multilingual set across ten languages. To enhance evaluation with explicit reasoning, we propose the JailJudge MultiAgent framework, which enables explainable, fine-grained scoring (1 to 10). This framework supports the construction of instruction-tuning ground truth and facilitates the development of JAILJUDGE Guard, an end-to-end judge model that provides reasoning and eliminates API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a moderation defense, both leveraging JAILJUDGE Guard. Our experiments demonstrate the state-of-the-art performance of JailJudge methods (JailJudge MultiAgent, JAILJUDGE Guard) across diverse models (e.g., GPT-4, Llama-Guard) and zero-shot scenarios. JailBoost and GuardShield significantly improve jailbreak attack and defense tasks under zero-shot settings, with JailBoost enhancing performance by 29.24% and GuardShield reducing defense ASR from 40.46% to 0.15%.

CVMay 21, 2024
Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models

Jiaqi Li, Qianshan Wei, Chuanyi Zhang et al.

Machine unlearning empowers individuals with the `right to be forgotten' by removing their private or sensitive information encoded in machine learning models. However, it remains uncertain whether MU can be effectively applied to Multimodal Large Language Models (MLLMs), particularly in scenarios of forgetting the leaked visual data of concepts. To overcome the challenge, we propose an efficient method, Single Image Unlearning (SIU), to unlearn the visual recognition of a concept by fine-tuning a single associated image for few steps. SIU consists of two key aspects: (i) Constructing Multifaceted fine-tuning data. We introduce four targets, based on which we construct fine-tuning data for the concepts to be forgotten; (ii) Jointly training loss. To synchronously forget the visual recognition of concepts and preserve the utility of MLLMs, we fine-tune MLLMs through a novel Dual Masked KL-divergence Loss combined with Cross Entropy loss. Alongside our method, we establish MMUBench, a new benchmark for MU in MLLMs and introduce a collection of metrics for its evaluation. Experimental results on MMUBench show that SIU completely surpasses the performance of existing methods. Furthermore, we surprisingly find that SIU can avoid invasive membership inference attacks and jailbreak attacks. To the best of our knowledge, we are the first to explore MU in MLLMs. We will release the code and benchmark in the near future.

82.2CVMar 18
VisionNVS: Self-Supervised Inpainting for Novel View Synthesis under the Virtual-Shift Paradigm

Hongbo Lu, Liang Yao, Chenghao He et al.

A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift'' strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.

CVMar 11, 2025
A Survey on Wi-Fi Sensing Generalizability: Taxonomy, Techniques, Datasets, and Future Research Prospects

Fei Wang, Tingting Zhang, Wei Xi et al.

Wi-Fi sensing has emerged as a powerful non-intrusive technology for recognizing human activities, monitoring vital signs, and enabling context-aware applications using commercial wireless devices. However, the performance of Wi-Fi sensing often degrades when applied to new users, devices, or environments due to significant domain shifts. To address this challenge, researchers have proposed a wide range of generalization techniques aimed at enhancing the robustness and adaptability of Wi-Fi sensing systems. In this survey, we provide a comprehensive and structured review of over 200 papers published since 2015, categorizing them according to the Wi-Fi sensing pipeline: experimental setup, signal preprocessing, feature learning, and model deployment. We analyze key techniques, including signal preprocessing, domain adaptation, meta-learning, metric learning, data augmentation, cross-modal alignment, federated learning, and continual learning. Furthermore, we summarize publicly available datasets across various tasks,such as activity recognition, user identification, indoor localization, and pose estimation, and provide insights into their domain diversity. We also discuss emerging trends and future directions, including large-scale pretraining, integration with multimodal foundation models, and continual deployment. To foster community collaboration, we introduce the Sensing Dataset Platform (SDP) for sharing datasets and models. This survey aims to serve as a valuable reference and practical guide for researchers and practitioners dedicated to improving the generalizability of Wi-Fi sensing systems.

SPDec 17, 2025
QoS-Aware Hierarchical Reinforcement Learning for Joint Link Selection and Trajectory Optimization in SAGIN-Supported UAV Mobility Management

Jiayang Wan, Ke He, Yafei Wang et al.

Due to the significant variations in unmanned aerial vehicle (UAV) altitude and horizontal mobility, it becomes difficult for any single network to ensure continuous and reliable threedimensional coverage. Towards that end, the space-air-ground integrated network (SAGIN) has emerged as an essential architecture for enabling ubiquitous UAV connectivity. To address the pronounced disparities in coverage and signal characteristics across heterogeneous networks, this paper formulates UAV mobility management in SAGIN as a constrained multi-objective joint optimization problem. The formulation couples discrete link selection with continuous trajectory optimization. Building on this, we propose a two-level multi-agent hierarchical deep reinforcement learning (HDRL) framework that decomposes the problem into two alternately solvable subproblems. To map complex link selection decisions into a compact discrete action space, we conceive a double deep Q-network (DDQN) algorithm in the top-level, which achieves stable and high-quality policy learning through double Q-value estimation. To handle the continuous trajectory action space while satisfying quality of service (QoS) constraints, we integrate the maximum-entropy mechanism of the soft actor-critic (SAC) and employ a Lagrangian-based constrained SAC (CSAC) algorithm in the lower-level that dynamically adjusts the Lagrange multipliers to balance constraint satisfaction and policy optimization. Moreover, the proposed algorithm can be extended to multi-UAV scenarios under the centralized training and decentralized execution (CTDE) paradigm, which enables more generalizable policies. Simulation results demonstrate that the proposed scheme substantially outperforms existing benchmarks in throughput, link switching frequency and QoS satisfaction.

CLJun 3, 2025
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG

Yang Tian, Fan Liu, Jingyuan Zhang et al.

Multimodal Retrieval-Augmented Generation (MMRAG) has been introduced to enhance Multimodal Large Language Models by incorporating externally retrieved multimodal knowledge, but it introduces two challenges: Parametric-Retrieved Knowledge Inconsistency (PRKI), where discrepancies between parametric and retrieved knowledge create uncertainty in determining reliability, and Visual-Textual Knowledge Inconsistency (VTKI), where misalignment between visual and textual sources disrupts entity representation. To address these challenges, we propose Cross-source knowledge \textbf{Re}conciliation for Multimodal RAG (CoRe-MMRAG), a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources. CoRe-MMRAG follows a four-stage pipeline: it first generates an internal response from parametric knowledge, then selects the most relevant multimodal evidence via joint similarity assessment, generates an external response, and finally integrates both to produce a reliable answer. Additionally, a specialized training paradigm enhances knowledge source discrimination, multimodal integration, and unified answer generation. Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods, achieving 5.6% and 9.3% performance gains on InfoSeek and Encyclopedic-VQA, respectively.

89.5ITApr 21
Uplink Signal Detection For Large-Scale MIMO-ISAC Systems

Jian Wang, Qiqiang Chen, Zheng Wang et al.

Next-generation wireless communication systems are unifying large-scale multiple-input multiple-output (MIMO) and integrated sensing and communication (ISAC) to enhance sensing and communication performance. In this paper, the signal detection problem for MIMO-ISAC systems is modeled as a mixed-integer least squares (MILS) problem. To solve it efficiently, we propose a projection-based neighborhood search-aided alternating direction method of multipliers (P-NS-ADMM) detection scheme. By theoretical analysis, we demonstrate that P-NS-ADMM achieves the same received diversity order as maximum likelihood (ML) detection. For further complexity reduction, an iteration-based NS-ADMM (I-NS-ADMM) is proposed to remove the complex projection operation. Complexity analysis shows its complexity advantage compared with P-NS-ADMM. Moreover, to better estimate the sensing signals for I-NS-ADMM, a flexible mechanism of ADMM iterations is given. Finally, simulations demonstrate the proposed NS-aided ADMM detection schemes have significant performance advantages in terms of both BER and NMSE.