h-index25
34papers
732citations
Novelty54%
AI Score58

34 Papers

CVJun 18, 2023
QCNeXt: A Next-Generation Framework For Joint Multi-Agent Trajectory Prediction

Zikang Zhou, Zihao Wen, Jianping Wang et al.

Estimating the joint distribution of on-road agents' future trajectories is essential for autonomous driving. In this technical report, we propose a next-generation framework for joint multi-agent trajectory prediction called QCNeXt. First, we adopt the query-centric encoding paradigm for the task of joint multi-agent trajectory prediction. Powered by this encoding scheme, our scene encoder is equipped with permutation equivariance on the set elements, roto-translation invariance in the space dimension, and translation invariance in the time dimension. These invariance properties not only enable accurate multi-agent forecasting fundamentally but also empower the encoder with the capability of streaming processing. Second, we propose a multi-agent DETR-like decoder, which facilitates joint multi-agent trajectory prediction by modeling agents' interactions at future time steps. For the first time, we show that a joint prediction model can outperform marginal prediction models even on the marginal metrics, which opens up new research opportunities in trajectory prediction. Our approach ranks 1st on the Argoverse 2 multi-agent motion forecasting benchmark, winning the championship of the Argoverse Challenge at the CVPR 2023 Workshop on Autonomous Driving.

LGDec 21, 2022
A Survey of Mix-based Data Augmentation: Taxonomy, Methods, Applications, and Explainability

Chengtai Cao, Fan Zhou, Yurou Dai et al.

Data augmentation (DA) is indispensable in modern machine learning and deep neural networks. The basic idea of DA is to construct new training data to improve the model's generalization by adding slightly disturbed versions of existing data or synthesizing new data. This survey comprehensively reviews a crucial subset of DA techniques, namely Mix-based Data Augmentation (MixDA), which generates novel samples by combining multiple examples. In contrast to traditional DA approaches that operate on single samples or entire datasets, MixDA stands out due to its effectiveness, simplicity, flexibility, computational efficiency, theoretical foundation, and broad applicability. We begin by introducing a novel taxonomy that categorizes MixDA into Mixup-based, Cutmix-based, and mixture approaches based on a hierarchical perspective of the data mixing operation. Subsequently, we provide an in-depth review of various MixDA techniques, focusing on their underlying motivations. Owing to its versatility, MixDA has penetrated a wide range of applications, which we also thoroughly investigate in this survey. Moreover, we delve into the underlying mechanisms of MixDA's effectiveness by examining its impact on model generalization and calibration while providing insights into the model's behavior by analyzing the inherent properties of MixDA. Finally, we recapitulate the critical findings and fundamental challenges of current MixDA studies while outlining the potential directions for future works. Different from previous related surveys that focus on DA approaches in specific domains (e.g., CV and NLP) or only review a limited subset of MixDA studies, we are the first to provide a systematical survey of MixDA, covering its taxonomy, methodology, application, and explainability. Furthermore, we provide promising directions for researchers interested in this exciting area.

72.3CVMay 8Code
One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

Yulong Chen, Xiaoyun Dong, Haoyu Zhang et al.

Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source's true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fréchet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony. Code is available at https://anonymous.4open.science/r/DUST-6A55.

CVJul 30, 2023
Uncertainty-Encoded Multi-Modal Fusion for Robust Object Detection in Autonomous Driving

Yang Lou, Qun Song, Qian Xu et al.

Multi-modal fusion has shown initial promising results for object detection of autonomous driving perception. However, many existing fusion schemes do not consider the quality of each fusion input and may suffer from adverse conditions on one or more sensors. While predictive uncertainty has been applied to characterize single-modal object detection performance at run time, incorporating uncertainties into the multi-modal fusion still lacks effective solutions due primarily to the uncertainty's cross-modal incomparability and distinct sensitivities to various adverse conditions. To fill this gap, this paper proposes Uncertainty-Encoded Mixture-of-Experts (UMoE) that explicitly incorporates single-modal uncertainties into LiDAR-camera fusion. UMoE uses individual expert network to process each sensor's detection result together with encoded uncertainty. Then, the expert networks' outputs are analyzed by a gating network to determine the fusion weights. The proposed UMoE module can be integrated into any proposal fusion pipeline. Evaluation shows that UMoE achieves a maximum of 10.67%, 3.17%, and 5.40% performance gain compared with the state-of-the-art proposal-level multi-modal object detectors under extreme weather, adversarial, and blinding attack scenarios.

38.4CVMay 21
Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling

Zikang Zhou, Haibo Hu, Xinhong Chen et al.

Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.

OPTICSOct 27, 2023
Deep Learning Enables Large Depth-of-Field Images for Sub-Diffraction-Limit Scanning Superlens Microscopy

Hui Sun, Hao Luo, Feifei Wang et al.

Scanning electron microscopy (SEM) is indispensable in diverse applications ranging from microelectronics to food processing because it provides large depth-of-field images with a resolution beyond the optical diffraction limit. However, the technology requires coating conductive films on insulator samples and a vacuum environment. We use deep learning to obtain the mapping relationship between optical super-resolution (OSR) images and SEM domain images, which enables the transformation of OSR images into SEM-like large depth-of-field images. Our custom-built scanning superlens microscopy (SSUM) system, which requires neither coating samples by conductive films nor a vacuum environment, is used to acquire the OSR images with features down to ~80 nm. The peak signal-to-noise ratio (PSNR) and structural similarity index measure values indicate that the deep learning method performs excellently in image-to-image translation, with a PSNR improvement of about 0.74 dB over the optical super-resolution images. The proposed method provides a high level of detail in the reconstructed results, indicating that it has broad applicability to chip-level defect detection, biological sample analysis, forensics, and various other fields.

CVFeb 12
Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception

Zesheng Jia, Jin Wang, Siao Liu et al.

Fast domain adaptation remains a fundamental challenge for deploying multi-agent systems across diverse environments in Vehicle-to-Everything (V2X) collaborative perception. Despite the success of Parameter-Efficient Fine-Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi-agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter-frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine-grained semantics in deep-layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter-efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early-stage representations into later stages through learnable pathways, alleviating semantic degradation in late-stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state-of-the-art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.

CVDec 12, 2025
SATMapTR: Satellite Image Enhanced Online HD Map Construction

Bingyuan Huang, Guanyi Zhao, Qian Xu et al.

High-definition (HD) maps are evolving from pre-annotated to real-time construction to better support autonomous driving in diverse scenarios. However, this process is hindered by low-quality input data caused by onboard sensors limited capability and frequent occlusions, leading to incomplete, noisy, or missing data, and thus reduced mapping accuracy and robustness. Recent efforts have introduced satellite images as auxiliary input, offering a stable, wide-area view to complement the limited ego perspective. However, satellite images in Bird's Eye View are often degraded by shadows and occlusions from vegetation and buildings. Prior methods using basic feature extraction and fusion remain ineffective. To address these challenges, we propose SATMapTR, a novel online map construction model that effectively fuses satellite image through two key components: (1) a gated feature refinement module that adaptively filters satellite image features by integrating high-level semantics with low-level structural cues to extract high signal-to-noise ratio map-relevant representations; and (2) a geometry-aware fusion module that consistently fuse satellite and BEV features at a grid-to-grid level, minimizing interference from irrelevant regions and low-quality inputs. Experimental results on the nuScenes dataset show that SATMapTR achieves the highest mean average precision (mAP) of 73.8, outperforming state-of-the-art satellite-enhanced models by up to 14.2 mAP. It also shows lower mAP degradation under adverse weather and sensor failures, and achieves nearly 3 times higher mAP at extended perception ranges.

CLNov 28, 2023
Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions

Xinhong Chen, Zongxi Li, Yaowei Wang et al.

The study of causal relationships between emotions and causes in texts has recently received much attention. Most works focus on extracting causally related clauses from documents. However, none of these works has considered that the causal relationships among the extracted emotion and cause clauses can only be valid under some specific context clauses. To highlight the context in such special causal relationships, we propose a new task to determine whether or not an input pair of emotion and cause has a valid causal relationship under different contexts and extract the specific context clauses that participate in the causal relationship. Since the task is new for which no existing dataset is available, we conduct manual annotation on a benchmark dataset to obtain the labels for our tasks and the annotations of each context clause's type that can also be used in some other applications. We adopt negative sampling to construct the final dataset to balance the number of documents with and without causal relationships. Based on the constructed dataset, we propose an end-to-end multi-task framework, where we design two novel and general modules to handle the two goals of our task. Specifically, we propose a context masking module to extract the context clauses participating in the causal relationships. We propose a prediction aggregation module to fine-tune the prediction results according to whether the input emotion and causes depend on specific context clauses. Results of extensive comparative experiments and ablation studies demonstrate the effectiveness and generality of our proposed framework.

CVSep 19, 2025Code
Global Regulation and Excitation via Attention Tuning for Stereo Matching

Jiahao Li, Xinhong Chen, Zhengmin Jiang et al.

Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code is available at https://github.com/JarvisLee0423/GREAT-Stereo.

AIJan 8
Orion-RAG: Path-Aligned Hybrid Retrieval for Graphless Data

Zhen Chen, Weihao Xie, Peilin Chen et al.

Retrieval-Augmented Generation (RAG) has proven effective for knowledge synthesis, yet it encounters significant challenges in practical scenarios where data is inherently discrete and fragmented. In most environments, information is distributed across isolated files like reports and logs that lack explicit links. Standard search engines process files independently, ignoring the connections between them. Furthermore, manually building Knowledge Graphs is impractical for such vast data. To bridge this gap, we present Orion-RAG. Our core insight is simple yet effective: we do not need heavy algorithms to organize this data. Instead, we use a low-complexity strategy to extract lightweight paths that naturally link related concepts. We demonstrate that this streamlined approach suffices to transform fragmented documents into semi-structured data, enabling the system to link information across different files effectively. Extensive experiments demonstrate that Orion-RAG consistently outperforms mainstream frameworks across diverse domains, supporting real-time updates and explicit Human-in-the-Loop verification with high cost-efficiency. Experiments on FinanceBench demonstrate superior precision with a 25.2% relative improvement over strong baselines.

CVFeb 12, 2025Code
CoDynTrust: Robust Asynchronous Collaborative Perception via Dynamic Feature Trust Modulus

Yunjiang Xu, Lingzhi Li, Jin Wang et al.

Collaborative perception, fusing information from multiple agents, can extend perception range so as to improve perception performance. However, temporal asynchrony in real-world environments, caused by communication delays, clock misalignment, or sampling configuration differences, can lead to information mismatches. If this is not well handled, then the collaborative performance is patchy, and what's worse safety accidents may occur. To tackle this challenge, we propose CoDynTrust, an uncertainty-encoded asynchronous fusion perception framework that is robust to the information mismatches caused by temporal asynchrony. CoDynTrust generates dynamic feature trust modulus (DFTM) for each region of interest by modeling aleatoric and epistemic uncertainty as well as selectively suppressing or retaining single-vehicle features, thereby mitigating information mismatches. We then design a multi-scale fusion module to handle multi-scale feature maps processed by DFTM. Compared to existing works that also consider asynchronous collaborative perception, CoDynTrust combats various low-quality information in temporally asynchronous scenarios and allows uncertainty to be propagated to downstream tasks such as planning and control. Experimental results demonstrate that CoDynTrust significantly reduces performance degradation caused by temporal asynchrony across multiple datasets, achieving state-of-the-art detection performance even with temporal asynchrony. The code is available at https://github.com/CrazyShout/CoDynTrust.

RODec 7, 2025
FedDSR: Federated Deep Supervision and Regularization Towards Autonomous Driving

Wei-Bin Kou, Guangxu Zhu, Bingyang Cheng et al.

Federated Learning (FL) enables collaborative training of autonomous driving (AD) models across distributed vehicles while preserving data privacy. However, FL encounters critical challenges such as poor generalization and slow convergence due to non-independent and identically distributed (non-IID) data from diverse driving environments. To overcome these obstacles, we introduce Federated Deep Supervision and Regularization (FedDSR), a paradigm that incorporates multi-access intermediate layer supervision and regularization within federated AD system. Specifically, FedDSR comprises following integral strategies: (I) to select multiple intermediate layers based on predefined architecture-agnostic standards. (II) to compute mutual information (MI) and negative entropy (NE) on those selected layers to serve as intermediate loss and regularizer. These terms are integrated into the output-layer loss to form a unified optimization objective, enabling comprehensive optimization across the network hierarchy. (III) to aggregate models from vehicles trained based on aforementioned rules of (I) and (II) to generate the global model on central server. By guiding and penalizing the learning of feature representations at intermediate stages, FedDSR enhances the model generalization and accelerates model convergence for federated AD. We then take the semantic segmentation task as an example to assess FedDSR and apply FedDSR to multiple model architectures and FL algorithms. Extensive experiments demonstrate that FedDSR achieves up to 8.93% improvement in mIoU and 28.57% reduction in training rounds, compared to other FL baselines, making it highly suitable for practical deployment in federated AD ecosystems.

CVJan 21, 2025Code
RALAD: Bridging the Real-to-Sim Domain Gap in Autonomous Driving with Retrieval-Augmented Learning

Jiacheng Zuo, Haibo Hu, Zikang Zhou et al.

In the pursuit of robust autonomous driving systems, models trained on real-world datasets often struggle to adapt to new environments, particularly when confronted with corner cases such as extreme weather conditions. Collecting these corner cases in the real world is non-trivial, which necessitates the use of simulators for validation. However,the high computational cost and the domain gap in data distribution have hindered the seamless transition between real and simulated driving scenarios. To tackle this challenge, we propose Retrieval-Augmented Learning for Autonomous Driving (RALAD), a novel framework designed to bridge the real-to-sim gap at a low cost. RALAD features three primary designs, including (1) domain adaptation via an enhanced Optimal Transport (OT) method that accounts for both individual and grouped image distances, (2) a simple and unified framework that can be applied to various models, and (3) efficient fine-tuning techniques that freeze the computationally expensive layers while maintaining robustness. Experimental results demonstrate that RALAD compensates for the performance degradation in simulated environments while maintaining accuracy in real-world scenarios across three different models. Taking Cross View as an example, the mIOU and mAP metrics in real-world scenarios remain stable before and after RALAD fine-tuning, while in simulated environments,the mIOU and mAP metrics are improved by 10.30% and 12.29%, respectively. Moreover, the re-training cost of our approach is reduced by approximately 88.1%. Our code is available at https://github.com/JiachengZuo/RALAD.git.

CVOct 22, 2025Code
Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism

Junfei Zhou, Penglin Dai, Quanmin Wei et al.

Multi-agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real-world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi-agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial-Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent's semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V-H, DAIR-V2X and V2X-Real datasets demonstrate that GenComm outperforms existing state-of-the-art methods, achieving an 81% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at https://github.com/jeffreychou777/GenComm.

ROApr 8, 2019Code
CoUAV: A Cooperative UAV Fleet Control and Monitoring Platform

Weiwei Wu, Ziyao Huang, Feng Shan et al.

In the past decade, unmanned aerial vehicles (UAVs) have been widely used in various civilian applications, most of which only require a single UAV. In the near future, it is expected that more and more applications will be enabled by the cooperation of multiple UAVs. To facilitate such applications, it is desirable to utilize a general control platform for cooperative UAVs. However, existing open-source control platforms cannot fulfill such a demand because (1) they only support the leader-follower mode, which limits the design options for fleet control, (2) existing platforms can support only certain UAVs and thus lack of compatibility, and (3) these platforms cannot accurately simulate a flight mission, which may cause a big gap between simulation and real flight. To address these issues, we propose a general control and monitoring platform for cooperative UAV fleet, namely, CoUAV, which provides a set of core cooperation services of UAVs, including synchronization, connectivity management, path planning, energy simulation, etc. To verify the applicability of CoUAV, we design and develop a prototype and we use the new system to perform an emergency search application that aims to complete a task with the minimum flying time. To achieve this goal, we design and implement a path planning service that takes both the UAV network connectivity and coverage into consideration so as to maximize the efficiency of a fleet. Experimental results by both simulation and field test demonstrate that the proposed system is viable.

LGNov 17, 2024
ModeSeq: Taming Sparse Multimodal Motion Prediction with Sequential Mode Modeling

Zikang Zhou, Hengjian Zhou, Haibo Hu et al.

Anticipating the multimodality of future events lays the foundation for safe autonomous driving. However, multimodal motion prediction for traffic agents has been clouded by the lack of multimodal ground truth. Existing works predominantly adopt the winner-take-all training strategy to tackle this challenge, yet still suffer from limited trajectory diversity and uncalibrated mode confidence. While some approaches address these limitations by generating excessive trajectory candidates, they necessitate a post-processing stage to identify the most representative modes, a process lacking universal principles and compromising trajectory accuracy. We are thus motivated to introduce ModeSeq, a new multimodal prediction paradigm that models modes as sequences. Unlike the common practice of decoding multiple plausible trajectories in one shot, ModeSeq requires motion decoders to infer the next mode step by step, thereby more explicitly capturing the correlation between modes and significantly enhancing the ability to reason about multimodality. Leveraging the inductive bias of sequential mode prediction, we also propose the Early-Match-Take-All (EMTA) training strategy to diversify the trajectories further. Without relying on dense mode prediction or heuristic post-processing, ModeSeq considerably improves the diversity of multimodal output while attaining satisfactory trajectory accuracy, resulting in balanced performance on motion prediction benchmarks. Moreover, ModeSeq naturally emerges with the capability of mode extrapolation, which supports forecasting more behavior modes when the future is highly uncertain.

AIMar 3, 2025
CoT-VLM4Tar: Chain-of-Thought Guided Vision-Language Models for Traffic Anomaly Resolution

Tianchi Ren, Haibo Hu, Jiacheng Zuo et al.

With the acceleration of urbanization, modern urban traffic systems are becoming increasingly complex, leading to frequent traffic anomalies. These anomalies encompass not only common traffic jams but also more challenging issues such as phantom traffic jams, intersection deadlocks, and accident liability analysis, which severely impact traffic flow, vehicular safety, and overall transportation efficiency. Currently, existing solutions primarily rely on manual intervention by traffic police or artificial intelligence-based detection systems. However, these methods often suffer from response delays and inconsistent management due to inadequate resources, while AI detection systems, despite enhancing efficiency to some extent, still struggle to handle complex traffic anomalies in a real-time and precise manner. To address these issues, we propose CoT-VLM4Tar: (Chain of Thought Visual-Language Model for Traffic Anomaly Resolution), this innovative approach introduces a new chain-of-thought to guide the VLM in analyzing, reasoning, and generating solutions for traffic anomalies with greater reasonable and effective solution, and to evaluate the performance and effectiveness of our method, we developed a closed-loop testing framework based on the CARLA simulator. Furthermore, to ensure seamless integration of the solutions generated by the VLM with the CARLA simulator, we implement an itegration module that converts these solutions into executable commands. Our results demonstrate the effectiveness of VLM in the resolution of real-time traffic anomalies, providing a proof-of-concept for its integration into autonomous traffic management systems.

32.8CVApr 10
Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

Jiahao Li, Xinhong Chen, Zhengmin Jiang et al.

Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.

ROMar 29, 2025
VLM-C4L: Continual Core Dataset Learning with Corner Case Optimization via Vision-Language Models for Autonomous Driving

Haibo Hu, Jiacheng Zuo, Yang Lou et al.

With the widespread adoption and deployment of autonomous driving, handling complex environments has become an unavoidable challenge. Due to the scarcity and diversity of extreme scenario datasets, current autonomous driving models struggle to effectively manage corner cases. This limitation poses a significant safety risk, according to the National Highway Traffic Safety Administration (NHTSA), autonomous vehicle systems have been involved in hundreds of reported crashes annually in the United States, occurred in corner cases like sun glare and fog, which caused a few fatal accident. Furthermore, in order to consistently maintain a robust and reliable autonomous driving system, it is essential for models not only to perform well on routine scenarios but also to adapt to newly emerging scenarios, especially those corner cases that deviate from the norm. This requires a learning mechanism that incrementally integrates new knowledge without degrading previously acquired capabilities. However, to the best of our knowledge, no existing continual learning methods have been proposed to ensure consistent and scalable corner case learning in autonomous driving. To address these limitations, we propose VLM-C4L, a continual learning framework that introduces Vision-Language Models (VLMs) to dynamically optimize and enhance corner case datasets, and VLM-C4L combines VLM-guided high-quality data extraction with a core data replay strategy, enabling the model to incrementally learn from diverse corner cases while preserving performance on previously routine scenarios, thus ensuring long-term stability and adaptability in real-world autonomous driving. We evaluate VLM-C4L on large-scale real-world autonomous driving datasets, including Waymo and the corner case dataset CODA.

AIApr 19, 2025
A Knowledge-Informed Deep Learning Paradigm for Generalizable and Stability-Optimized Car-Following Models

Chengming Wang, Dongyao Jia, Wei Wang et al.

Car-following models (CFMs) are fundamental to traffic flow analysis and autonomous driving. Although calibrated physics-based and trained data-driven CFMs can replicate human driving behavior, their reliance on specific datasets limits generalization across diverse scenarios and reduces reliability in real-world deployment. Moreover, these models typically focus on behavioral fidelity and do not support the explicit optimization of local and string stability, which are increasingly important for the safe and efficient operation of autonomous vehicles (AVs). To address these limitations, we propose a Knowledge-Informed Deep Learning (KIDL) paradigm that distills the generalization capabilities of pre-trained Large Language Models (LLMs) into a lightweight and stability-aware neural architecture. LLMs are used to extract fundamental car-following knowledge beyond dataset-specific patterns, and this knowledge is transferred to a reliable, tractable, and computationally efficient model through knowledge distillation. KIDL also incorporates stability constraints directly into its training objective, ensuring that the resulting model not only emulates human-like behavior but also satisfies the local and string stability requirements essential for real-world AV deployment. We evaluate KIDL on the real-world NGSIM and HighD datasets, comparing its performance with representative physics-based, data-driven, and hybrid CFMs. Both empirical and theoretical results consistently demonstrate KIDL's superior behavioral generalization and traffic flow stability, offering a robust and scalable solution for next-generation traffic systems.

MAFeb 18, 2025
Communication Strategy on Macro-and-Micro Traffic State in Cooperative Deep Reinforcement Learning for Regional Traffic Signal Control

Hankang Gu, Shangbo Wang, Dongyao Jia et al.

Adaptive Traffic Signal Control (ATSC) has become a popular research topic in intelligent transportation systems. Regional Traffic Signal Control (RTSC) using the Multi-agent Deep Reinforcement Learning (MADRL) technique has become a promising approach for ATSC due to its ability to achieve the optimum trade-off between scalability and optimality. Most existing RTSC approaches partition a traffic network into several disjoint regions, followed by applying centralized reinforcement learning techniques to each region. However, the pursuit of cooperation among RTSC agents still remains an open issue and no communication strategy for RTSC agents has been investigated. In this paper, we propose communication strategies to capture the correlation of micro-traffic states among lanes and the correlation of macro-traffic states among intersections. We first justify the evolution equation of the RTSC process is Markovian via a system of store-and-forward queues. Next, based on the evolution equation, we propose two GAT-Aggregated (GA2) communication modules--GA2-Naive and GA2-Aug to extract both intra-region and inter-region correlations between macro and micro traffic states. While GA2-Naive only considers the movements at each intersection, GA2-Aug also considers the lane-changing behavior of vehicles. Two proposed communication modules are then aggregated into two existing novel RTSC frameworks--RegionLight and Regional-DRL. Experimental results demonstrate that both GA2-Naive and GA2-Aug effectively improve the performance of existing RTSC frameworks under both real and synthetic scenarios. Hyperparameter testing also reveals the robustness and potential of our communication modules in large-scale traffic networks.

85.2CRMar 28
Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

Xiao Li, Xiang Zheng, Yifeng Gao et al.

Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open-world, safety-critical environments. As these systems gain autonomy and enter domains such as transportation, healthcare, and industrial or assistive robotics, ensuring their safety becomes both technically challenging and socially indispensable. Unlike digital AI systems, embodied agents must act under uncertain sensing, incomplete knowledge, and dynamic human-robot interactions, where failures can directly lead to physical harm. This survey provides a comprehensive and structured review of safety research in embodied AI, examining attacks and defenses across the full embodied pipeline, from perception and cognition to planning, action and interaction, and agentic system. We introduce a multi-level taxonomy that unifies fragmented lines of work and connects embodied-specific safety findings with broader advances in vision, language, and multimodal foundation models. Our review synthesizes insights from over 400 papers spanning adversarial, backdoor, jailbreak, and hardware-level attacks; attack detection, safe training and robust inference; and risk-aware human-agent interaction. This analysis reveals several overlooked challenges, including the fragility of multimodal perception fusion, the instability of planning under jailbreak attacks, and the trustworthiness of human-agent interaction in open-ended scenarios. By organizing the field into a coherent framework and identifying critical research gaps, this survey provides a roadmap for building embodied agents that are not only capable and autonomous but also safe, robust, and reliable in real-world deployment.

CRJun 17, 2024
A First Physical-World Trajectory Prediction Attack via LiDAR-induced Deceptions in Autonomous Driving

Yang Lou, Yi Zhu, Qun Song et al.

Trajectory prediction forecasts nearby agents' moves based on their historical trajectories. Accurate trajectory prediction is crucial for autonomous vehicles. Existing attacks compromise the prediction model of a victim AV by directly manipulating the historical trajectory of an attacker AV, which has limited real-world applicability. This paper, for the first time, explores an indirect attack approach that induces prediction errors via attacks against the perception module of a victim AV. Although it has been shown that physically realizable attacks against LiDAR-based perception are possible by placing a few objects at strategic locations, it is still an open challenge to find an object location from the vast search space in order to launch effective attacks against prediction under varying victim AV velocities. Through analysis, we observe that a prediction model is prone to an attack focusing on a single point in the scene. Consequently, we propose a novel two-stage attack framework to realize the single-point attack. The first stage of prediction-side attack efficiently identifies, guided by the distribution of detection results under object-based attacks against perception, the state perturbations for the prediction model that are effective and velocity-insensitive. In the second stage of location matching, we match the feasible object locations with the found state perturbations. Our evaluation using a public autonomous driving dataset shows that our attack causes a collision rate of up to 63% and various hazardous responses of the victim AV. The effectiveness of our attack is also demonstrated on a real testbed car. To the best of our knowledge, this study is the first security analysis spanning from LiDAR-based perception to prediction in autonomous driving, leading to a realistic attack on prediction. To counteract the proposed attack, potential defenses are discussed.

ROMay 31, 2023
TOFG: A Unified and Fine-Grained Environment Representation in Autonomous Driving

Zihao Wen, Yifan Zhang, Xinhong Chen et al.

In autonomous driving, an accurate understanding of environment, e.g., the vehicle-to-vehicle and vehicle-to-lane interactions, plays a critical role in many driving tasks such as trajectory prediction and motion planning. Environment information comes from high-definition (HD) map and historical trajectories of vehicles. Due to the heterogeneity of the map data and trajectory data, many data-driven models for trajectory prediction and motion planning extract vehicle-to-vehicle and vehicle-to-lane interactions in a separate and sequential manner. However, such a manner may capture biased interpretation of interactions, causing lower prediction and planning accuracy. Moreover, separate extraction leads to a complicated model structure and hence the overall efficiency and scalability are sacrificed. To address the above issues, we propose an environment representation, Temporal Occupancy Flow Graph (TOFG). Specifically, the occupancy flow-based representation unifies the map information and vehicle trajectories into a homogeneous data format and enables a consistent prediction. The temporal dependencies among vehicles can help capture the change of occupancy flow timely to further promote model performance. To demonstrate that TOFG is capable of simplifying the model architecture, we incorporate TOFG with a simple graph attention (GAT) based neural network and propose TOFG-GAT, which can be used for both trajectory prediction and motion planning. Experiment results show that TOFG-GAT achieves better or competitive performance than all the SOTA baselines with less training time.

AIDec 10, 2021
A Generative Car-following Model Conditioned On Driving Styles

Yifan Zhang, Xinhong Chen, Jianping Wang et al.

Car-following (CF) modeling, an essential component in simulating human CF behaviors, has attracted increasing research interest in the past decades. This paper pushes the state of the art by proposing a novel generative hybrid CF model, which achieves high accuracy in characterizing dynamic human CF behaviors and is able to generate realistic human CF behaviors for any given observed or even unobserved driving style. Specifically, the ability of accurately capturing human CF behaviors is ensured by designing and calibrating an Intelligent Driver Model (IDM) with time-varying parameters. The reason behind is that such time-varying parameters can express both the inter-driver heterogeneity, i.e., diverse driving styles of different drivers, and the intra-driver heterogeneity, i.e., changing driving styles of the same driver. The ability of generating realistic human CF behaviors of any given observed driving style is achieved by applying a neural process (NP) based model. The ability of inferring CF behaviors of unobserved driving styles is supported by exploring the relationship between the calibrated time-varying IDM parameters and an intermediate variable of NP. To demonstrate the effectiveness of our proposed models, we conduct extensive experiments and comparisons, including CF model parameter calibration, CF behavior prediction, and trajectory simulation for different driving styles.

CVOct 20, 2021
Detecting and Identifying Optical Signal Attacks on Autonomous Driving Systems

Jindi Zhang, Yifan Zhang, Kejie Lu et al.

For autonomous driving, an essential task is to detect surrounding objects accurately. To this end, most existing systems use optical devices, including cameras and light detection and ranging (LiDAR) sensors, to collect environment data in real time. In recent years, many researchers have developed advanced machine learning models to detect surrounding objects. Nevertheless, the aforementioned optical devices are vulnerable to optical signal attacks, which could compromise the accuracy of object detection. To address this critical issue, we propose a framework to detect and identify sensors that are under attack. Specifically, we first develop a new technique to detect attacks on a system that consists of three sensors. Our main idea is to: 1) use data from three sensors to obtain two versions of depth maps (i.e., disparity) and 2) detect attacks by analyzing the distribution of disparity errors. In our study, we use real data sets and the state-of-the-art machine learning model to evaluate our attack detection scheme and the results confirm the effectiveness of our detection method. Based on the detection scheme, we further develop an identification model that is capable of identifying up to n-2 attacked sensors in a system with one LiDAR and n cameras. We prove the correctness of our identification scheme and conduct experiments to show the accuracy of our identification method. Finally, we investigate the overall sensitivity of our framework.

LGAug 7, 2021
Jointly Attacking Graph Neural Network and its Explanations

Wenqi Fan, Wei Jin, Xiaorui Liu et al.

Graph Neural Networks (GNNs) have boosted the performance for many graph-related tasks. Despite the great success, recent studies have shown that GNNs are highly vulnerable to adversarial attacks, where adversaries can mislead the GNNs' prediction by modifying graphs. On the other hand, the explanation of GNNs (GNNExplainer) provides a better understanding of a trained GNN model by generating a small subgraph and features that are most influential for its prediction. In this paper, we first perform empirical studies to validate that GNNExplainer can act as an inspection tool and have the potential to detect the adversarial perturbations for graphs. This finding motivates us to further initiate a new problem investigation: Whether a graph neural network and its explanations can be jointly attacked by modifying graphs with malicious desires? It is challenging to answer this question since the goals of adversarial attacks and bypassing the GNNExplainer essentially contradict each other. In this work, we give a confirmative answer to this question by proposing a novel attack framework (GEAttack), which can attack both a GNN model and its explanations by simultaneously exploiting their vulnerabilities. Extensive experiments on two explainers (GNNExplainer and PGExplainer) under various real-world datasets demonstrate the effectiveness of the proposed method.

CVAug 6, 2021
Evaluating Adversarial Attacks on Driving Safety in Vision-Based Autonomous Vehicles

Jindi Zhang, Yang Lou, Jianping Wang et al.

In recent years, many deep learning models have been adopted in autonomous driving. At the same time, these models introduce new vulnerabilities that may compromise the safety of autonomous vehicles. Specifically, recent studies have demonstrated that adversarial attacks can cause a significant decline in detection precision of deep learning-based 3D object detection models. Although driving safety is the ultimate concern for autonomous driving, there is no comprehensive study on the linkage between the performance of deep learning models and the driving safety of autonomous vehicles under adversarial attacks. In this paper, we investigate the impact of two primary types of adversarial attacks, perturbation attacks and patch attacks, on the driving safety of vision-based autonomous vehicles rather than the detection precision of deep learning models. In particular, we consider two state-of-the-art models in vision-based 3D object detection, Stereo R-CNN and DSGN. To evaluate driving safety, we propose an end-to-end evaluation framework with a set of driving safety performance metrics. By analyzing the results of our extensive evaluation experiments, we find that (1) the attack's impact on the driving safety of autonomous vehicles and the attack's impact on the precision of 3D object detectors are decoupled, and (2) the DSGN model demonstrates stronger robustness to adversarial attacks than the Stereo R-CNN model. In addition, we further investigate the causes behind the two findings with an ablation study. The findings of this paper provide a new perspective to evaluate adversarial attacks and guide the selection of deep learning models in autonomous driving.

ROOct 19, 2020
A Learning-based Discretionary Lane-Change Decision-Making Model with Driving Style Awareness

Yifan Zhang, Qian Xu, Jianping Wang et al.

Discretionary lane change (DLC) is a basic but complex maneuver in driving, which aims at reaching a faster speed or better driving conditions, e.g., further line of sight or better ride quality. Although many DLC decision-making models have been studied in traffic engineering and autonomous driving, the impact of human factors, which is an integral part of current and future traffic flow, is largely ignored in the existing literature. In autonomous driving, the ignorance of human factors of surrounding vehicles will lead to poor interaction between the ego vehicle and the surrounding vehicles, thus, a high risk of accidents. The human factors are also a crucial part to simulate a human-like traffic flow in the traffic engineering area. In this paper, we integrate the human factors that are represented by driving styles to design a new DLC decision-making model. Specifically, our proposed model takes not only the contextual traffic information but also the driving styles of surrounding vehicles into consideration and makes lane-change/keep decisions. Moreover, the model can imitate human drivers' decision-making maneuvers to the greatest extent by learning the driving style of the ego vehicle. Our evaluation results show that the proposed model almost follows the human decision-making maneuvers, which can achieve 98.66% prediction accuracy with respect to human drivers' decisions against the ground truth. Besides, the lane-change impact analysis results demonstrate that our model even performs better than human drivers in terms of improving the safety and speed of traffic.

CVSep 29, 2020
MetaMix: Improved Meta-Learning with Interpolation-based Consistency Regularization

Yangbin Chen, Yun Ma, Tom Ko et al.

Model-Agnostic Meta-Learning (MAML) and its variants are popular few-shot classification methods. They train an initializer across a variety of sampled learning tasks (also known as episodes) such that the initialized model can adapt quickly to new tasks. However, current MAML-based algorithms have limitations in forming generalizable decision boundaries. In this paper, we propose an approach called MetaMix. It generates virtual feature-target pairs within each episode to regularize the backbone models. MetaMix can be integrated with any of the MAML-based algorithms and learn the decision boundaries generalizing better to new tasks. Experiments on the mini-ImageNet, CUB, and FC100 datasets show that MetaMix improves the performance of MAML-based algorithms and achieves state-of-the-art result when integrated with Meta-Transfer Learning.

IRMay 17, 2020
Attacking Black-box Recommendations via Copying Cross-domain User Profiles

Wenqi Fan, Tyler Derr, Xiangyu Zhao et al.

Recently, recommender systems that aim to suggest personalized lists of items for users to interact with online have drawn a lot of attention. In fact, many of these state-of-the-art techniques have been deep learning based. Recent studies have shown that these deep learning models (in particular for recommendation systems) are vulnerable to attacks, such as data poisoning, which generates users to promote a selected set of items. However, more recently, defense strategies have been developed to detect these generated users with fake profiles. Thus, advanced injection attacks of creating more `realistic' user profiles to promote a set of items is still a key challenge in the domain of deep learning based recommender systems. In this work, we present our framework CopyAttack, which is a reinforcement learning based black-box attack method that harnesses real users from a source domain by copying their profiles into the target domain with the goal of promoting a subset of items. CopyAttack is constructed to both efficiently and effectively learn policy gradient networks that first select, and then further refine/craft, user profiles from the source domain to ultimately copy into the target domain. CopyAttack's goal is to maximize the hit ratio of the targeted items in the Top-$k$ recommendation list of the users in the target domain. We have conducted experiments on two real-world datasets and have empirically verified the effectiveness of our proposed framework and furthermore performed a thorough model analysis.

IRJul 16, 2019
Deep Social Collaborative Filtering

Wenqi Fan, Yao Ma, Dawei Yin et al.

Recommender systems are crucial to alleviate the information overload problem in online worlds. Most of the modern recommender systems capture users' preference towards items via their interactions based on collaborative filtering techniques. In addition to the user-item interactions, social networks can also provide useful information to understand users' preference as suggested by the social theories such as homophily and influence. Recently, deep neural networks have been utilized for social recommendations, which facilitate both the user-item interactions and the social network information. However, most of these models cannot take full advantage of the social network information. They only use information from direct neighbors, but distant neighbors can also provide helpful information. Meanwhile, most of these models treat neighbors' information equally without considering the specific recommendations. However, for a specific recommendation case, the information relevant to the specific item would be helpful. Besides, most of these models do not explicitly capture the neighbor's opinions to items for social recommendations, while different opinions could affect the user differently. In this paper, to address the aforementioned challenges, we propose DSCF, a Deep Social Collaborative Filtering framework, which can exploit the social relations with various aspects for recommender systems. Comprehensive experiments on two-real world datasets show the effectiveness of the proposed framework.

IRMay 30, 2019
Deep Adversarial Social Recommendation

Wenqi Fan, Tyler Derr, Yao Ma et al.

Recent years have witnessed rapid developments on social recommendation techniques for improving the performance of recommender systems due to the growing influence of social networks to our daily life. The majority of existing social recommendation methods unify user representation for the user-item interactions (item domain) and user-user connections (social domain). However, it may restrain user representation learning in each respective domain, since users behave and interact differently in the two domains, which makes their representations to be heterogeneous. In addition, most of traditional recommender systems can not efficiently optimize these objectives, since they utilize negative sampling technique which is unable to provide enough informative guidance towards the training during the optimization process. In this paper, to address the aforementioned challenges, we propose a novel deep adversarial social recommendation framework DASO. It adopts a bidirectional mapping method to transfer users' information between social domain and item domain using adversarial learning. Comprehensive experiments on two real-world datasets show the effectiveness of the proposed framework.