CVNov 13, 2022
Visual Semantic Segmentation Based on Few/Zero-Shot Learning: An OverviewWenqi Ren, Yang Tang, Qiyu Sun et al.
Visual semantic segmentation aims at separating a visual sample into diverse blocks with specific semantic attributes and identifying the category for each block, and it plays a crucial role in environmental perception. Conventional learning-based visual semantic segmentation approaches count heavily on large-scale training data with dense annotations and consistently fail to estimate accurate semantic labels for unseen categories. This obstruction spurs a craze for studying visual semantic segmentation with the assistance of few/zero-shot learning. The emergence and rapid progress of few/zero-shot visual semantic segmentation make it possible to learn unseen-category from a few labeled or zero-labeled samples, which advances the extension to practical applications. Therefore, this paper focuses on the recently published few/zero-shot visual semantic segmentation methods varying from 2D to 3D space and explores the commonalities and discrepancies of technical settlements under different segmentation circumstances. Specifically, the preliminaries on few/zero-shot visual semantic segmentation, including the problem definitions, typical datasets, and technical remedies, are briefly reviewed and discussed. Moreover, three typical instantiations are involved to uncover the interactions of few/zero-shot learning with visual semantic segmentation, including image semantic segmentation, video object segmentation, and 3D segmentation. Finally, the future challenges of few/zero-shot visual semantic segmentation are discussed.
CVJul 3, 2023
LXL: LiDAR Excluded Lean 3D Object Detection with 4D Imaging Radar and Camera FusionWeiyi Xiong, Jianan Liu, Tao Huang et al.
As an emerging technology and a relatively affordable device, the 4D imaging radar has already been confirmed effective in performing 3D object detection in autonomous driving. Nevertheless, the sparsity and noisiness of 4D radar point clouds hinder further performance improvement, and in-depth studies about its fusion with other modalities are lacking. On the other hand, as a new image view transformation strategy, "sampling" has been applied in a few image-based detectors and shown to outperform the widely applied "depth-based splatting" proposed in Lift-Splat-Shoot (LSS), even without image depth prediction. However, the potential of "sampling" is not fully unleashed. This paper investigates the "sampling" view transformation strategy on the camera and 4D imaging radar fusion-based 3D object detection. LiDAR Excluded Lean (LXL) model, predicted image depth distribution maps and radar 3D occupancy grids are generated from image perspective view (PV) features and radar bird's eye view (BEV) features, respectively. They are sent to the core of LXL, called "radar occupancy-assisted depth-based sampling", to aid image view transformation. We demonstrated that more accurate view transformation can be performed by introducing image depths and radar information to enhance the "sampling" strategy. Experiments on VoD and TJ4DRadSet datasets show that the proposed method outperforms the state-of-the-art 3D object detection methods by a significant margin without bells and whistles. Ablation studies demonstrate that our method performs the best among different enhancement settings.
CVOct 5, 2023
Vehicle-to-Everything Cooperative Perception for Autonomous DrivingTao Huang, Jianan Liu, Xi Zhou et al.
Achieving fully autonomous driving with enhanced safety and efficiency relies on vehicle-to-everything cooperative perception, which enables vehicles to share perception data, thereby enhancing situational awareness and overcoming the limitations of the sensing ability of individual vehicles. Vehicle-to-everything cooperative perception plays a crucial role in extending the perception range, increasing detection accuracy, and supporting more robust decision-making and control in complex environments. This paper provides a comprehensive survey of recent developments in vehicle-to-everything cooperative perception, introducing mathematical models that characterize the perception process under different collaboration strategies. Key techniques for enabling reliable perception sharing, such as agent selection, data alignment, and feature fusion, are examined in detail. In addition, major challenges are discussed, including differences in agents and models, uncertainty in perception outputs, and the impact of communication constraints such as transmission delay and data loss. The paper concludes by outlining promising research directions, including privacy-preserving artificial intelligence methods, collaborative intelligence, and integrated sensing frameworks to support future advancements in vehicle-to-everything cooperative perception.
SYJun 21, 2022
GNN-PMB: A Simple but Effective Online 3D Multi-Object Tracker without Bells and WhistlesJianan Liu, Liping Bai, Yuxuan Xia et al.
Multi-object tracking (MOT) is among crucial applications in modern advanced driver assistance systems (ADAS) and autonomous driving (AD) systems. The global nearest neighbor (GNN) filter, as the earliest random vector-based Bayesian tracking framework, has been adopted in most of state-of-the-arts trackers in the automotive industry. The development of random finite set (RFS) theory facilitates a mathematically rigorous treatment of the MOT problem, and different variants of RFS-based Bayesian filters have then been proposed. However, their effectiveness in the real ADAS and AD application is still an open problem. In this paper, it is demonstrated that the latest RFS-based Bayesian tracking framework could be superior to typical random vector-based Bayesian tracking framework via a systematic comparative study of both traditional random vector-based Bayesian filters with rule-based heuristic track maintenance and RFS-based Bayesian filters on the nuScenes validation dataset. An RFS-based tracker, namely Poisson multi-Bernoulli filter using the global nearest neighbor (GNN-PMB), is proposed to LiDAR-based MOT tasks. This GNN-PMB tracker is simple to use, and it achieves competitive results on the nuScenes dataset. Specifically, the proposed GNN-PMB tracker outperforms most state-of-the-art LiDAR-only trackers and LiDAR and camera fusion-based trackers, ranking the $3^{rd}$ among all LiDAR-only trackers on nuScenes 3D tracking challenge leader board at the time of submission.
CVJul 20, 2023
SMURF: Spatial Multi-Representation Fusion for 3D Object Detection with 4D Imaging RadarJianan Liu, Qiuchi Zhao, Weiyi Xiong et al.
The 4D Millimeter wave (mmWave) radar is a promising technology for vehicle sensing due to its cost-effectiveness and operability in adverse weather conditions. However, the adoption of this technology has been hindered by sparsity and noise issues in radar point cloud data. This paper introduces spatial multi-representation fusion (SMURF), a novel approach to 3D object detection using a single 4D imaging radar. SMURF leverages multiple representations of radar detection points, including pillarization and density features of a multi-dimensional Gaussian mixture distribution through kernel density estimation (KDE). KDE effectively mitigates measurement inaccuracy caused by limited angular resolution and multi-path propagation of radar signals. Additionally, KDE helps alleviate point cloud sparsity by capturing density features. Experimental evaluations on View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate the effectiveness and generalization ability of SMURF, outperforming recently proposed 4D imaging radar-based single-representation models. Moreover, while using 4D imaging radar only, SMURF still achieves comparable performance to the state-of-the-art 4D imaging radar and camera fusion-based method, with an increase of 1.22% in the mean average precision on bird's-eye view of TJ4DRadSet dataset and 1.32% in the 3D mean average precision on the entire annotated area of VoD dataset. Our proposed method demonstrates impressive inference time and addresses the challenges of real-time detection, with the inference time no more than 0.05 seconds for most scans on both datasets. This research highlights the benefits of 4D mmWave radar and is a strong benchmark for subsequent works regarding 3D object detection with 4D imaging radar.
CVAug 19, 2023
LEGO: Learning and Graph-Optimized Modular Tracker for Online Multi-Object Tracking with Point CloudsZhenrong Zhang, Jianan Liu, Yuxuan Xia et al.
Online multi-object tracking (MOT) plays a pivotal role in autonomous systems. The state-of-the-art approaches usually employ a tracking-by-detection method, and data association plays a critical role. This paper proposes a learning and graph-optimized (LEGO) modular tracker to improve data association performance in the existing literature. The proposed LEGO tracker integrates graph optimization and self-attention mechanisms, which efficiently formulate the association score map, facilitating the accurate and efficient matching of objects across time frames. To further enhance the state update process, the Kalman filter is added to ensure consistent tracking by incorporating temporal coherence in the object states. Our proposed method utilizing LiDAR alone has shown exceptional performance compared to other online tracking approaches, including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked 1st at the time of submitting results to KITTI object tracking evaluation ranking board and remains 2nd at the time of submitting this paper, among all online trackers in the KITTI MOT benchmark for cars1
CVNov 11, 2022
RaLiBEV: Radar and LiDAR BEV Fusion Learning for Anchor Box Free Object Detection SystemsYanlong Yang, Jianan Liu, Tao Huang et al.
In autonomous driving, LiDAR and radar are crucial for environmental perception. LiDAR offers precise 3D spatial sensing information but struggles in adverse weather like fog. Conversely, radar signals can penetrate rain or mist due to their specific wavelength but are prone to noise disturbances. Recent state-of-the-art works reveal that the fusion of radar and LiDAR can lead to robust detection in adverse weather. The existing works adopt convolutional neural network architecture to extract features from each sensor data, then align and aggregate the two branch features to predict object detection results. However, these methods have low accuracy of predicted bounding boxes due to a simple design of label assignment and fusion strategies. In this paper, we propose a bird's-eye view fusion learning-based anchor box-free object detection system, which fuses the feature derived from the radar range-azimuth heatmap and the LiDAR point cloud to estimate possible objects. Different label assignment strategies have been designed to facilitate the consistency between the classification of foreground or background anchor points and the corresponding bounding box regressions. Furthermore, the performance of the proposed object detector is further enhanced by employing a novel interactive transformer module. The superior performance of the methods proposed in this paper has been demonstrated using the recently published Oxford Radar RobotCar dataset. Our system's average precision significantly outperforms the state-of-the-art method by 13.1% and 19.0% at Intersection of Union (IoU) of 0.8 under 'Clear+Foggy' training conditions for 'Clear' and 'Foggy' testing, respectively.
AIApr 20
Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled FuturesZixiang Wang, Mengjia Gong, Qiyu Sun et al.
With the rapid advancement of artificial intelligence, multi-agent systems (MASs) are evolving from classical paradigms toward architectures built upon large foundation models (LFMs). This survey provides a systematic review and comparative analysis of classical MASs (CMASs) and LFM-based MASs (LMASs). First, within a closed-loop coordination framework, CMASs are reviewed across four fundamental dimensions: perception, communication, decision-making, and control. Beyond this framework, LMASs integrate LFMs to lift collaboration from low-level state exchanges to semantic-level reasoning, enabling more flexible coordination and improved adaptability across diverse scenarios. Then, a comparative analysis is conducted to contrast CMASs and LMASs across architecture, operating mechanism, adaptability, and application. Finally, future perspectives on MASs are presented, summarizing open challenges and potential research opportunities.
ROMay 17
Tactile-based Multimodal Fusion in Embodied Intelligence: A Survey of Vision, Language, and Contact-Driven ParadigmsZhixiang Cao, Di Tian, Runwei Guan et al.
Tactile sensing is a fundamental modality for embodied intelligence, offering unique and direct feedback on contact geometry, material properties, and interaction dynamics that remote sensors cannot replace. However, unimodal tactile perception is inherently limited by its sparse spatial coverage and lack of global semantic context. With the recent explosion in deep learning and large language models, integrating tactile with vision and language has become essential to bridge physical interaction with semantic reasoning, leading to the emergence of Multimodal Tactile Fusion. Despite rapid progress, the existing researches remain fragmented across disparate datasets, sensing modalities, and tasks, lacking a unified theoretical framework. To address this gap, this paper provides a comprehensive survey of multimodal tactile fusion research up to the first quarter of 2026. We propose a hierarchical taxonomy that organizes the field into two primary dimensions: multimodal datasets and multimodal methods. On the data side, we categorize resources ranging from Tactile-Vision datasets, Tactile-Language datasets, Tactile-Vision-Language datasets, and Tactile-Vision-Other datasets. On the method side, we structure prior work into three core pillars: (1) Multimodal Perception and Recognition, which focuses on object understanding and grasp prediction; (2) Cross-Modal Generation, focusing on bidirectional translation between tactile, vision, and text; and (3) Multimodal Interaction, emphasizing feedback control and language-guided manipulation. Furthermore, we summarize representative tactile sensing hardware, review commonly used evaluation metrics and benchmark settings, and discuss current challenges and promising future directions.
AIJul 14, 2025Code
DeepSeek: Paradigm Shifts and Technical Evolution in Large AI ModelsLuolin Xiong, Haofen Wang, Xi Chen et al.
DeepSeek, a Chinese Artificial Intelligence (AI) startup, has released their V3 and R1 series models, which attracted global attention due to their low cost, high performance, and open-source advantages. This paper begins by reviewing the evolution of large AI models focusing on paradigm shifts, the mainstream Large Language Model (LLM) paradigm, and the DeepSeek paradigm. Subsequently, the paper highlights novel algorithms introduced by DeepSeek, including Multi-head Latent Attention (MLA), Mixture-of-Experts (MoE), Multi-Token Prediction (MTP), and Group Relative Policy Optimization (GRPO). The paper then explores DeepSeek engineering breakthroughs in LLM scaling, training, inference, and system-level optimization architecture. Moreover, the impact of DeepSeek models on the competitive AI landscape is analyzed, comparing them to mainstream LLMs across various fields. Finally, the paper reflects on the insights gained from DeepSeek innovations and discusses future trends in the technical and engineering development of large AI models, particularly in data, training, and reasoning.
CVMay 3
AFFormer: Adaptive Feature Fusion Transformer for V2X Cooperative Perception under Channel ImpairmentsXi Zhou, Tao Huang, Qing-Long Han et al.
Accurate 3D object detection is essential for ensuring the safety of autonomous vehicles. Cooperative perception, which leverages vehicle-to-everything (V2X) communication to share perceptual data, enhances detection but is vulnerable to channel impairments, such as noise, fading, and interference. To strengthen the reliability of intelligent transportation systems, this work improves the robustness of V2X cooperative perception under communication conditions that reflect common channel impairments. This paper proposes an Adaptive Feature Fusion Transformer (AFFormer), a Transformer-based framework that mitigates the adverse effects of corrupted features by modeling temporal, inter-agent, and spatial correlations. AFFormer introduces three key modules: Multi-Agent and Temporal Aggregation for context-aware fusion across agents and over time, Dual Spatial Attention for efficient modeling of spatial dependencies, and Uncertainty-Guided Fusion for entropy-driven refinement of fused features. A teacher-student knowledge distillation strategy further enhances robustness by aligning fused features with reliable early-collaboration supervision. AFFormer is validated on the V2XSet and DAIR-V2X datasets, where it consistently outperforms existing methods under both ideal and impaired communication conditions, demonstrating improved robustness to communication-induced feature degradation while maintaining a competitive efficiency-accuracy trade-off.
CVApr 26, 2024
On the Federated Learning Framework for Cooperative PerceptionZhenrong Zhang, Jianan Liu, Xi Zhou et al.
Cooperative perception is essential to enhance the efficiency and safety of future transportation systems, requiring extensive data sharing among vehicles on the road, which raises significant privacy concerns. Federated learning offers a promising solution by enabling data privacy-preserving collaborative enhancements in perception, decision-making, and planning among connected and autonomous vehicles (CAVs). However, federated learning is impeded by significant challenges arising from data heterogeneity across diverse clients, potentially diminishing model accuracy and prolonging convergence periods. This study introduces a specialized federated learning framework for CP, termed the federated dynamic weighted aggregation (FedDWA) algorithm, facilitated by dynamic adjusting loss (DALoss) function. This framework employs dynamic client weighting to direct model convergence and integrates a novel loss function that utilizes Kullback-Leibler divergence (KLD) to counteract the detrimental effects of non-independently and identically distributed (Non-IID) and unbalanced data. Utilizing the BEV transformer as the primary model, our rigorous testing on the OpenV2V dataset, augmented with FedBEVT data, demonstrates significant improvements in the average intersection over union (IoU). These results highlight the substantial potential of our federated learning framework to address data heterogeneity challenges in CP, thereby enhancing the accuracy of environmental perception models and facilitating more robust and efficient collaborative learning solutions in the transportation sector.
RONov 26, 2025
AerialMind: Towards Referring Multi-Object Tracking in UAV ScenariosChenglizhao Chen, Shaofeng Liang, Runwei Guan et al.
Referring Multi-Object Tracking (RMOT) aims to achieve precise object detection and tracking through natural language instructions, representing a fundamental capability for intelligent robotic systems. However, current RMOT research remains mostly confined to ground-level scenarios, which constrains their ability to capture broad-scale scene contexts and perform comprehensive tracking and path planning. In contrast, Unmanned Aerial Vehicles (UAVs) leverage their expansive aerial perspectives and superior maneuverability to enable wide-area surveillance. Moreover, UAVs have emerged as critical platforms for Embodied Intelligence, which has given rise to an unprecedented demand for intelligent aerial systems capable of natural language interaction. To this end, we introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios, which aims to bridge this research gap. To facilitate its construction, we develop an innovative semi-automated collaborative agent-based labeling assistant (COALA) framework that significantly reduces labor costs while maintaining annotation quality. Furthermore, we propose HawkEyeTrack (HETrack), a novel method that collaboratively enhances vision-language representation learning and improves the perception of UAV scenarios. Comprehensive experiments validated the challenging nature of our dataset and the effectiveness of our method.
CVApr 22, 2025
MS-Occ: Multi-Stage LiDAR-Camera Fusion for 3D Semantic Occupancy PredictionZhiqiang Wei, Lianqing Zheng, Jianan Liu et al.
Accurate 3D semantic occupancy perception is essential for autonomous driving in complex environments with diverse and irregular objects. While vision-centric methods suffer from geometric inaccuracies, LiDAR-based approaches often lack rich semantic information. To address these limitations, MS-Occ, a novel multi-stage LiDAR-camera fusion framework which includes middle-stage fusion and late-stage fusion, is proposed, integrating LiDAR's geometric fidelity with camera-based semantic richness via hierarchical cross-modal fusion. The framework introduces innovations at two critical stages: (1) In the middle-stage feature fusion, the Gaussian-Geo module leverages Gaussian kernel rendering on sparse LiDAR depth maps to enhance 2D image features with dense geometric priors, and the Semantic-Aware module enriches LiDAR voxels with semantic context via deformable cross-attention; (2) In the late-stage voxel fusion, the Adaptive Fusion (AF) module dynamically balances voxel features across modalities, while the High Classification Confidence Voxel Fusion (HCCVF) module resolves semantic inconsistencies using self-attention-based refinement. Experiments on two large-scale benchmarks demonstrate state-of-the-art performance. On nuScenes-OpenOccupancy, MS-Occ achieves an Intersection over Union (IoU) of 32.1% and a mean IoU (mIoU) of 25.3%, surpassing the state-of-the-art by +0.7% IoU and +2.4% mIoU. Furthermore, on the SemanticKITTI benchmark, our method achieves a new state-of-the-art mIoU of 24.08%, robustly validating its generalization capabilities.Ablation studies further confirm the effectiveness of each individual module, highlighting substantial improvements in the perception of small objects and reinforcing the practical value of MS-Occ for safety-critical autonomous driving scenarios.
LGApr 5, 2021
Deep Learning-Based Autonomous Driving Systems: A Survey of Attacks and DefensesYao Deng, Tiehua Zhang, Guannan Lou et al.
The rapid development of artificial intelligence, especially deep learning technology, has advanced autonomous driving systems (ADSs) by providing precise control decisions to counterpart almost any driving event, spanning from anti-fatigue safe driving to intelligent route planning. However, ADSs are still plagued by increasing threats from different attacks, which could be categorized into physical attacks, cyberattacks and learning-based adversarial attacks. Inevitably, the safety and security of deep learning-based autonomous driving are severely challenged by these attacks, from which the countermeasures should be analyzed and studied comprehensively to mitigate all potential risks. This survey provides a thorough analysis of different attacks that may jeopardize ADSs, as well as the corresponding state-of-the-art defense mechanisms. The analysis is unrolled by taking an in-depth overview of each step in the ADS workflow, covering adversarial attacks for various deep learning models and attacks in both physical and cyber context. Furthermore, some promising research directions are suggested in order to improve deep learning-based autonomous driving safety, including model robustness training, model testing and verification, and anomaly detection based on cloud/edge servers.
CRFeb 16, 2021
Machine Learning Based Cyber Attacks Targeting on Controlled Information: A SurveyYuantian Miao, Chao Chen, Lei Pan et al.
Stealing attack against controlled information, along with the increasing number of information leakage incidents, has become an emerging cyber security threat in recent years. Due to the booming development and deployment of advanced analytics solutions, novel stealing attacks utilize machine learning (ML) algorithms to achieve high success rate and cause a lot of damage. Detecting and defending against such attacks is challenging and urgent so that governments, organizations, and individuals should attach great importance to the ML-based stealing attacks. This survey presents the recent advances in this new type of attack and corresponding countermeasures. The ML-based stealing attack is reviewed in perspectives of three categories of targeted controlled information, including controlled user activities, controlled ML model-related information, and controlled authentication information. Recent publications are summarized to generalize an overarching attack methodology and to derive the limitations and future directions of ML-based stealing attacks. Furthermore, countermeasures are proposed towards developing effective protections from three aspects -- detection, disruption, and isolation.
CVFeb 6, 2019
Daedalus: Breaking Non-Maximum Suppression in Object Detection via Adversarial ExamplesDerui Wang, Chaoran Li, Sheng Wen et al.
This paper demonstrates that Non-Maximum Suppression (NMS), which is commonly used in Object Detection (OD) tasks to filter redundant detection results, is no longer secure. Considering that NMS has been an integral part of OD systems, thwarting the functionality of NMS can result in unexpected or even lethal consequences for such systems. In this paper, an adversarial example attack which triggers malfunctioning of NMS in end-to-end OD models is proposed. The attack, namely \texttt{Daedalus}, compresses the dimensions of detection boxes to evade NMS. As a result, the final detection output contains extremely dense false positives. This can be fatal for many OD applications such as autonomous vehicles and surveillance systems. The attack can be generalised to different end-to-end OD models, such that the attack cripples various OD applications. Furthermore, a way to craft robust adversarial examples is developed by using an ensemble of popular detection models as the substitutes. Considering the pervasive nature of model reusing in real-world OD scenarios, Daedalus examples crafted based on an \textit{ensemble of substitutes} can launch attacks without knowing the parameters of the victim models. Experimental results demonstrate that the attack effectively stops NMS from filtering redundant bounding boxes. As the evaluation results suggest, Daedalus increases the false positive rate in detection results to $99.9\%$ and reduces the mean average precision scores to $0$, while maintaining a low cost of distortion on the original inputs. It is also demonstrated that the attack can be practically launched against real-world OD systems via printed posters.