SYJul 25, 2023
Towards Integrated Traffic Control with Operating Decentralized Autonomous OrganizationShengyue Yao, Jingru Yu, Yi Yu et al.
With a growing complexity of the intelligent traffic system (ITS), an integrated control of ITS that is capable of considering plentiful heterogeneous intelligent agents is desired. However, existing control methods based on the centralized or the decentralized scheme have not presented their competencies in considering the optimality and the scalability simultaneously. To address this issue, we propose an integrated control method based on the framework of Decentralized Autonomous Organization (DAO). The proposed method achieves a global consensus on energy consumption efficiency (ECE), meanwhile to optimize the local objectives of all involved intelligent agents, through a consensus and incentive mechanism. Furthermore, an operation algorithm is proposed regarding the issue of structural rigidity in DAO. Specifically, the proposed operation approach identifies critical agents to execute the smart contract in DAO, which ultimately extends the capability of DAO-based control. In addition, a numerical experiment is designed to examine the performance of the proposed method. The experiment results indicate that the controlled agents can achieve a consensus faster on the global objective with improved local objectives by the proposed method, compare to existing decentralized control methods. In general, the proposed method shows a great potential in developing an integrated control system in the ITS
CVJan 14Code
CogRail: Benchmarking VLMs in Cognitive Intrusion Perception for Intelligent Railway Transportation SystemsYonglin Tian, Qiyao Zhang, Wei Xu et al.
Accurate and early perception of potential intrusion targets is essential for ensuring the safety of railway transportation systems. However, most existing systems focus narrowly on object classification within fixed visual scopes and apply rule-based heuristics to determine intrusion status, often overlooking targets that pose latent intrusion risks. Anticipating such risks requires the cognition of spatial context and temporal dynamics for the object of interest (OOI), which presents challenges for conventional visual models. To facilitate deep intrusion perception, we introduce a novel benchmark, CogRail, which integrates curated open-source datasets with cognitively driven question-answer annotations to support spatio-temporal reasoning and prediction. Building upon this benchmark, we conduct a systematic evaluation of state-of-the-art visual-language models (VLMs) using multimodal prompts to identify their strengths and limitations in this domain. Furthermore, we fine-tune VLMs for better performance and propose a joint fine-tuning framework that integrates three core tasks, position perception, movement prediction, and threat analysis, facilitating effective adaptation of general-purpose foundation models into specialized models tailored for cognitive intrusion perception. Extensive experiments reveal that current large-scale multimodal models struggle with the complex spatial-temporal reasoning required by the cognitive intrusion perception task, underscoring the limitations of existing foundation models in this safety-critical domain. In contrast, our proposed joint fine-tuning framework significantly enhances model performance by enabling targeted adaptation to domain-specific reasoning demands, highlighting the advantages of structured multi-task learning in improving both accuracy and interpretability. Code will be available at https://github.com/Hub-Tian/CogRail.
ROJan 4, 2025Code
UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude MobilityYonglin Tian, Fei Lin, Yiduo Li et al.
Low-altitude mobility, exemplified by unmanned aerial vehicles (UAVs), has introduced transformative advancements across various domains, like transportation, logistics, and agriculture. Leveraging flexible perspectives and rapid maneuverability, UAVs extend traditional systems' perception and action capabilities, garnering widespread attention from academia and industry. However, current UAV operations primarily depend on human control, with only limited autonomy in simple scenarios, and lack the intelligence and adaptability needed for more complex environments and tasks. The emergence of large language models (LLMs) demonstrates remarkable problem-solving and generalization capabilities, offering a promising pathway for advancing UAV intelligence. This paper explores the integration of LLMs and UAVs, beginning with an overview of UAV systems' fundamental components and functionalities, followed by an overview of the state-of-the-art in LLM technology. Subsequently, it systematically highlights the multimodal data resources available for UAVs, which provide critical support for training and evaluation. Furthermore, it categorizes and analyzes key tasks and application scenarios where UAVs and LLMs converge. Finally, a reference roadmap towards agentic UAVs is proposed, aiming to enable UAVs to achieve agentic intelligence through autonomous perception, memory, reasoning, and tool utilization. Related resources are available at https://github.com/Hub-Tian/UAVs_Meet_LLMs.
LGAug 1, 2023
Counterfactual Graph Transformer for Traffic Flow PredictionYing Yang, Kai Du, Xingyuan Dai et al.
Traffic flow prediction (TFP) is a fundamental problem of the Intelligent Transportation System (ITS), as it models the latent spatial-temporal dependency of traffic flow for potential congestion prediction. Recent graph-based models with multiple kinds of attention mechanisms have achieved promising performance. However, existing methods for traffic flow prediction tend to inherit the bias pattern from the dataset and lack interpretability. To this end, we propose a Counterfactual Graph Transformer (CGT) model with an instance-level explainer (e.g., finding the important subgraphs) specifically designed for TFP. We design a perturbation mask generator over input sensor features at the time dimension and the graph structure on the graph transformer module to obtain spatial and temporal counterfactual explanations. By searching the optimal perturbation masks on the input data feature and graph structures, we can obtain the concise and dominant data or graph edge links for the subsequent TFP task. After re-training the utilized graph transformer model after counterfactual perturbation, we can obtain improved and interpretable traffic flow prediction. Extensive results on three real-world public datasets show that CGT can produce reliable explanations and is promising for traffic flow prediction.
LGFeb 27, 2024Code
RIME: Robust Preference-based Reinforcement Learning with Noisy PreferencesJie Cheng, Gang Xiong, Xingyuan Dai et al.
Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. However, current PbRL methods excessively depend on high-quality feedback from domain experts, which results in a lack of robustness. In this paper, we present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences. Our method utilizes a sample selection-based discriminator to dynamically filter out noise and ensure robust training. To counteract the cumulative error stemming from incorrect selection, we suggest a warm start for the reward model, which additionally bridges the performance gap during the transition from pre-training to online training in PbRL. Our experiments on robotic manipulation and locomotion tasks demonstrate that RIME significantly enhances the robustness of the state-of-the-art PbRL method. Code is available at https://github.com/CJReinforce/RIME_ICML2024.
CVSep 11, 2024
MiniDrive: More Efficient Vision-Language Models with Multi-Level 2D Features as Text Tokens for Autonomous DrivingEnming Zhang, Xingyuan Dai, Min Huang et al.
Vision-language models (VLMs) serve as general-purpose end-to-end models in autonomous driving, performing subtasks such as prediction, planning, and perception through question-and-answer interactions. However, most existing methods rely on computationally expensive visual encoders and large language models (LLMs), making them difficult to deploy in real-world scenarios and real-time applications. Meanwhile, most existing VLMs lack the ability to process multiple images, making it difficult to adapt to multi-camera perception in autonomous driving. To address these issues, we propose a novel framework called MiniDrive, which incorporates our proposed Feature Engineering Mixture of Experts (FE-MoE) module and Dynamic Instruction Adapter (DI-Adapter). The FE-MoE effectively maps 2D features into visual token embeddings before being input into the language model. The DI-Adapter enables the visual token embeddings to dynamically change with the instruction text embeddings, resolving the issue of static visual token embeddings for the same image in previous approaches. Compared to previous works, MiniDrive achieves state-of-the-art performance in terms of parameter size, floating point operations, and response efficiency, with the smallest version containing only 83M parameters.
CVMar 20, 2024Code
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language ModelsTongtian Yue, Jie Cheng, Longteng Guo et al.
Recent trends in Large Vision Language Models (LVLMs) research have been increasingly focusing on advancing beyond general image understanding towards more nuanced, object-level referential comprehension. In this paper, we present and delve into the self-consistency capability of LVLMs, a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process. This capability significantly mirrors the precision and reliability of fine-grained visual-language understanding. Our findings reveal that the self-consistency level of existing LVLMs falls short of expectations, posing limitations on their practical applicability and potential. To address this gap, we introduce a novel fine-tuning paradigm named Self-Consistency Tuning (SC-Tune). It features the synergistic learning of a cyclic describer-locator system. This paradigm is not only data-efficient but also exhibits generalizability across multiple LVLMs. Through extensive experiments, we demonstrate that SC-Tune significantly elevates performance across a spectrum of object-level vision-language benchmarks and maintains competitive or improved performance on image-level vision-language benchmarks. Both our model and code will be publicly available at https://github.com/ivattyue/SC-Tune.
AIJun 4, 2025Code
SUMO-MCP: Leveraging the Model Context Protocol for Autonomous Traffic Simulation and OptimizationChenglong Ye, Gang Xiong, Junyou Shang et al.
Traffic simulation tools, such as SUMO, are essential for urban mobility research. However, such tools remain challenging for users due to complex manual workflows involving network download, demand generation, simulation setup, and result analysis. In this paper, we introduce SUMO-MCP, a novel platform that not only wraps SUMO' s core utilities into a unified tool suite but also provides additional auxiliary utilities for common preprocessing and postprocessing tasks. Using SUMO-MCP, users can issue simple natural-language prompts to generate traffic scenarios from OpenStreetMap data, create demand from origin-destination matrices or random patterns, run batch simulations with multiple signal-control strategies, perform comparative analyses with automated reporting, and detect congestion for signal-timing optimization. Furthermore, the platform allows flexible custom workflows by dynamically combining exposed SUMO tools without additional coding. Experiments demonstrate that SUMO-MCP significantly makes traffic simulation more accessible and reliable for researchers. We will release code for SUMO-MCP at https://github.com/ycycycl/SUMO-MCP in the future.
ROApr 18, 2024
S4TP: Social-Suitable and Safety-Sensitive Trajectory Planning for Autonomous VehiclesXiao Wang, Ke Tang, Xingyuan Dai et al.
In public roads, autonomous vehicles (AVs) face the challenge of frequent interactions with human-driven vehicles (HDVs), which render uncertain driving behavior due to varying social characteristics among humans. To effectively assess the risks prevailing in the vicinity of AVs in social interactive traffic scenarios and achieve safe autonomous driving, this article proposes a social-suitable and safety-sensitive trajectory planning (S4TP) framework. Specifically, S4TP integrates the Social-Aware Trajectory Prediction (SATP) and Social-Aware Driving Risk Field (SADRF) modules. SATP utilizes Transformers to effectively encode the driving scene and incorporates an AV's planned trajectory during the prediction decoding process. SADRF assesses the expected surrounding risk degrees during AVs-HDVs interactions, each with different social characteristics, visualized as two-dimensional heat maps centered on the AV. SADRF models the driving intentions of the surrounding HDVs and predicts trajectories based on the representation of vehicular interactions. S4TP employs an optimization-based approach for motion planning, utilizing the predicted HDVs'trajectories as input. With the integration of SADRF, S4TP executes real-time online optimization of the planned trajectory of AV within lowrisk regions, thus improving the safety and the interpretability of the planned trajectory. We have conducted comprehensive tests of the proposed method using the SMARTS simulator. Experimental results in complex social scenarios, such as unprotected left turn intersections, merging, cruising, and overtaking, validate the superiority of our proposed S4TP in terms of safety and rationality. S4TP achieves a pass rate of 100% across all scenarios, surpassing the current state-of-the-art methods Fanta of 98.25% and Predictive-Decision of 94.75%.
CVMar 9, 2025
Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous DrivingEnming Zhang, Peizhe Gong, Xingyuan Dai et al.
Ensuring the safety of vision-language models (VLMs) in autonomous driving systems is of paramount importance, yet existing research has largely focused on conventional benchmarks rather than safety-critical evaluation. In this work, we present SCD-Bench (Safety Cognition Driving Benchmark) a novel framework specifically designed to assess the safety cognition capabilities of VLMs within interactive driving scenarios. To address the scalability challenge of data annotation, we introduce ADA (Autonomous Driving Annotation), a semi-automated labeling system, further refined through expert review by professionals with domain-specific knowledge in autonomous driving. To facilitate scalable and consistent evaluation, we also propose an automated assessment pipeline leveraging large language models, which demonstrates over 98% agreement with human expert judgments. In addressing the broader challenge of aligning VLMs with safety cognition in driving environments, we construct SCD-Training, the first large-scale dataset tailored for this task, comprising 324.35K high-quality samples. Through extensive experiments, we show that models trained on SCD-Training exhibit marked improvements not only on SCD-Bench, but also on general and domain-specific benchmarks, offering a new perspective on enhancing safety-aware interactions in vision-language systems for autonomous driving.
CVJun 3, 2025
Hierarchical Self-Prompting SAM: A Prompt-Free Medical Image Segmentation FrameworkMengmeng Zhang, Xingyuan Dai, Yicheng Sun et al.
Although the Segment Anything Model (SAM) is highly effective in natural image segmentation, it requires dependencies on prompts, which limits its applicability to medical imaging where manual prompts are often unavailable. Existing efforts to fine-tune SAM for medical segmentation typically struggle to remove this dependency. We propose Hierarchical Self-Prompting SAM (HSP-SAM), a novel self-prompting framework that enables SAM to achieve strong performance in prompt-free medical image segmentation. Unlike previous self-prompting methods that remain limited to positional prompts similar to vanilla SAM, we are the first to introduce learning abstract prompts during the self-prompting process. This simple and intuitive self-prompting framework achieves superior performance on classic segmentation tasks such as polyp and skin lesion segmentation, while maintaining robustness across diverse medical imaging modalities. Furthermore, it exhibits strong generalization to unseen datasets, achieving improvements of up to 14.04% over previous state-of-the-art methods on some challenging benchmarks. These results suggest that abstract prompts encapsulate richer and higher-dimensional semantic information compared to positional prompts, thereby enhancing the model's robustness and generalization performance. All models and codes will be released upon acceptance.
LGMar 26, 2025
Offline Reinforcement Learning with Discrete Diffusion SkillsRuiXi Qiao, Jie Cheng, Xingyuan Dai et al.
Skills have been introduced to offline reinforcement learning (RL) as temporal abstractions to tackle complex, long-horizon tasks, promoting consistent behavior and enabling meaningful exploration. While skills in offline RL are predominantly modeled within a continuous latent space, the potential of discrete skill spaces remains largely underexplored. In this paper, we propose a compact discrete skill space for offline RL tasks supported by state-of-the-art transformer-based encoder and diffusion-based decoder. Coupled with a high-level policy trained via offline RL techniques, our method establishes a hierarchical RL framework where the trained diffusion decoder plays a pivotal role. Empirical evaluations show that the proposed algorithm, Discrete Diffusion Skill (DDS), is a powerful offline RL method. DDS performs competitively on Locomotion and Kitchen tasks and excels on long-horizon tasks, achieving at least a 12 percent improvement on AntMaze-v2 benchmarks compared to existing offline RL approaches. Furthermore, DDS offers improved interpretability, training stability, and online exploration compared to previous skill-based methods.
AIAug 6, 2018
An Efficient Deep Reinforcement Learning Model for Urban Traffic ControlYilun Lin, Xingyuan Dai, Li Li et al.
Urban Traffic Control (UTC) plays an essential role in Intelligent Transportation System (ITS) but remains difficult. Since model-based UTC methods may not accurately describe the complex nature of traffic dynamics in all situations, model-free data-driven UTC methods, especially reinforcement learning (RL) based UTC methods, received increasing interests in the last decade. However, existing DL approaches did not propose an efficient algorithm to solve the complicated multiple intersections control problems whose state-action spaces are vast. To solve this problem, we propose a Deep Reinforcement Learning (DRL) algorithm that combines several tricks to master an appropriate control strategy within an acceptable time. This new algorithm relaxes the fixed traffic demand pattern assumption and reduces human invention in parameter tuning. Simulation experiments have shown that our method outperforms traditional rule-based approaches and has the potential to handle more complex traffic problems in the real world.
LGJul 11, 2017
DeepTrend: A Deep Hierarchical Neural Network for Traffic Flow PredictionXingyuan Dai, Rui Fu, Yilun Lin et al.
In this paper, we consider the temporal pattern in traffic flow time series, and implement a deep learning model for traffic flow prediction. Detrending based methods decompose original flow series into trend and residual series, in which trend describes the fixed temporal pattern in traffic flow and residual series is used for prediction. Inspired by the detrending method, we propose DeepTrend, a deep hierarchical neural network used for traffic flow prediction which considers and extracts the time-variant trend. DeepTrend has two stacked layers: extraction layer and prediction layer. Extraction layer, a fully connected layer, is used to extract the time-variant trend in traffic flow by feeding the original flow series concatenated with corresponding simple average trend series. Prediction layer, an LSTM layer, is used to make flow prediction by feeding the obtained trend from the output of extraction layer and calculated residual series. To make the model more effective, DeepTrend needs first pre-trained layer-by-layer and then fine-tuned in the entire network. Experiments show that DeepTrend can noticeably boost the prediction performance compared with some traditional prediction models and LSTM with detrending based methods.