CVApr 2, 2024Code
WcDT: World-centric Diffusion Transformer for Traffic Scene GenerationChen Yang, Yangfan He, Aaron Xuxiang Tian et al.
In this paper, we introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models (a.k.a., diffusion models) and transformers. Our proposed framework, termed the "World-Centric Diffusion Transformer"(WcDT), optimizes the entire trajectory generation process, from feature extraction to model inference. To enhance the scene diversity and stochasticity, the historical trajectory data is first preprocessed into "Agent Move Statement" and encoded into latent space using Denoising Diffusion Probabilistic Models (DDPM) enhanced with Diffusion with Transformer (DiT) blocks. Then, the latent features, historical trajectories, HD map features, and historical traffic signal information are fused with various transformer-based encoders that are used to enhance the interaction of agents with other elements in the traffic scene. The encoded traffic scenes are then decoded by a trajectory decoder to generate multimodal future trajectories. Comprehensive experimental results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories, showing its potential for integration into automatic driving simulation systems. Our code is available at \url{https://github.com/yangchen1997/WcDT}.
CLFeb 3, 2025
PARA: Parameter-Efficient Fine-tuning with Prompt Aware Representation AdjustmentZequan Liu, Yi Zhao, Ming Tan et al.
In the realm of parameter-efficient fine-tuning (PEFT) methods, while options like LoRA are available, there is a persistent demand in the industry for a PEFT approach that excels in both efficiency and performance within the context of single-backbone multi-tenant applications. This paper introduces a new and straightforward PEFT technique, termed \underline{P}rompt \underline{A}ware \underline{R}epresentation \underline{A}djustment (PARA). The core of our proposal is to integrate a lightweight vector generator within each Transformer layer. This generator produces vectors that are responsive to input prompts, thereby adjusting the hidden representations accordingly. Our extensive experimentation across diverse tasks has yielded promising results. Firstly, the PARA method has been shown to surpass current PEFT benchmarks in terms of performance, despite having a similar number of adjustable parameters. Secondly, it has proven to be more efficient than LoRA in the single-backbone multi-tenant scenario, highlighting its significant potential for industrial adoption.
AISep 28, 2025
Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on BenchmarksAaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang et al.
We study multi-turn multi-agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single-LLM baselines; and (ii) ablations on GPQA-Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best-achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self-voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.
CRJul 31, 2025
Measuring Harmfulness of Computer-Using AgentsAaron Xuxiang Tian, Ruofan Zhang, Janet Tang et al.
Computer-using agents (CUAs), which can autonomously control computers to perform multi-step actions, might pose significant safety risks if misused. However, existing benchmarks mainly evaluate LMs in chatbots or simple tool use. To more comprehensively evaluate CUAs' misuse risks, we introduce a new benchmark: CUAHarm. CUAHarm consists of 104 expert-written realistic misuse risks, such as disabling firewalls, leaking data, or installing backdoors. We provide a sandbox with rule-based verifiable rewards to measure CUAs' success rates in executing these tasks (e.g., whether the firewall is indeed disabled), beyond refusal rates. We evaluate frontier LMs including GPT-5, Claude 4 Sonnet, Gemini 2.5 Pro, Llama-3.3-70B, and Mistral Large 2. Even without jailbreaking prompts, these frontier LMs comply with executing these malicious tasks at a high success rate (e.g., 90\% for Gemini 2.5 Pro). Furthermore, while newer models are safer in previous safety benchmarks, their misuse risks as CUAs become even higher, e.g., Gemini 2.5 Pro is riskier than Gemini 1.5 Pro. Additionally, while these LMs are robust to common malicious prompts (e.g., creating a bomb) when acting as chatbots, they could still act unsafely as CUAs. We further evaluate a leading agentic framework (UI-TARS-1.5) and find that while it improves performance, it also amplifies misuse risks. To mitigate the misuse risks of CUAs, we explore using LMs to monitor CUAs' actions. We find monitoring unsafe computer-using actions is significantly harder than monitoring conventional unsafe chatbot responses. While monitoring chain-of-thoughts leads to modest gains, the average monitoring accuracy is only 77\%. A hierarchical summarization strategy improves performance by up to 13\%, a promising direction though monitoring remains unreliable. The benchmark will be released publicly to facilitate further research on mitigating these risks.