Xi Xiong

CL
h-index117
17papers
9,191citations
Novelty50%
AI Score54

17 Papers

CVJul 10, 2024
PaliGemma: A versatile 3B VLM for transfer

Lucas Beyer, Andreas Steiner, André Susano Pinto et al. · deepmind, oxford

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

CVOct 13, 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Xi Chen, Xiao Wang, Lucas Beyer et al. · deepmind

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

AIJul 8, 2024Code
iLLM-TSC: Integration reinforcement learning and large language model for traffic signal control policy improvement

Aoyu Pang, Maonan Wang, Man-On Pun et al.

Urban congestion remains a critical challenge, with traffic signal control (TSC) emerging as a potent solution. TSC is often modeled as a Markov Decision Process problem and then solved using reinforcement learning (RL), which has proven effective. However, the existing RL-based TSC system often overlooks imperfect observations caused by degraded communication, such as packet loss, delays, and noise, as well as rare real-life events not included in the reward function, such as unconsidered emergency vehicles. To address these limitations, we introduce a novel integration framework that combines a large language model (LLM) with RL. This framework is designed to manage overlooked elements in the reward function and gaps in state information, thereby enhancing the policies of RL agents. In our approach, RL initially makes decisions based on observed data. Subsequently, LLMs evaluate these decisions to verify their reasonableness. If a decision is found to be unreasonable, it is adjusted accordingly. Additionally, this integration approach can be seamlessly integrated with existing RL-based TSC systems without necessitating modifications. Extensive testing confirms that our approach reduces the average waiting time by $17.5\%$ in degraded communication conditions as compared to traditional RL methods, underscoring its potential to advance practical RL applications in intelligent transportation systems. The related code can be found at \url{https://github.com/Traffic-Alpha/iLLM-TSC}.

71.4CVMay 31
3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Yipeng Gao, Lei Shu, Genzhi Ye et al.

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

SYOct 24, 2022
ADLight: A Universal Approach of Traffic Signal Control with Augmented Data Using Reinforcement Learning

Maonan Wang, Yutong Xu, Xi Xiong et al.

Traffic signal control has the potential to reduce congestion in dynamic networks. Recent studies show that traffic signal control with reinforcement learning (RL) methods can significantly reduce the average waiting time. However, a shortcoming of existing methods is that they require model retraining for new intersections with different structures. In this paper, we propose a novel reinforcement learning approach with augmented data (ADLight) to train a universal model for intersections with different structures. We propose a new agent design incorporating features on movements and actions with set current phase duration to allow the generalized model to have the same structure for different intersections. A new data augmentation method named \textit{movement shuffle} is developed to improve the generalization performance. We also test the universal model with new intersections in Simulation of Urban MObility (SUMO). The results show that the performance of our approach is close to the models trained in a single environment directly (only a 5% loss of average waiting time), and we can reduce more than 80% of training time, which saves a lot of computational resources in scalable operations of traffic lights.

SYSep 27, 2019
Analysis of a Stochastic Model for Coordinated Platooning of Heavy-duty Vehicles

Xi Xiong, Erdong Xiao, Li Jin

Platooning of heavy-duty vehicles (HDVs) is a key component of smart and connected highways and is expected to bring remarkable fuel savings and emission reduction. In this paper, we study the coordination of HDV platooning on a highway section. We model the arrival of HDVs as a Poisson process. Multiple HDVs are merged into one platoon if their headways are below a given threshold. The merging is done by accelerating the following vehicles to catch up with the leading ones. We characterize the following random variables: (i) platoon size, (ii) headway between platoons, and (iii) travel time increment due to platoon formation. We formulate and solve an optimization problem to determine the headway threshold for platooning that leads to minimal cost (time plus fuel). We also compare our results with that from Simulation of Urban MObility (SUMO).

SYDec 8, 2023Code
UniTSA: A Universal Reinforcement Learning Framework for V2X Traffic Signal Control

Maonan Wang, Xi Xiong, Yuheng Kan et al.

Traffic congestion is a persistent problem in urban areas, which calls for the development of effective traffic signal control (TSC) systems. While existing Reinforcement Learning (RL)-based methods have shown promising performance in optimizing TSC, it is challenging to generalize these methods across intersections of different structures. In this work, a universal RL-based TSC framework is proposed for Vehicle-to-Everything (V2X) environments. The proposed framework introduces a novel agent design that incorporates a junction matrix to characterize intersection states, making the proposed model applicable to diverse intersections. To equip the proposed RL-based framework with enhanced capability of handling various intersection structures, novel traffic state augmentation methods are tailor-made for signal light control systems. Finally, extensive experimental results derived from multiple intersection configurations confirm the effectiveness of the proposed framework. The source code in this work is available at https://github.com/wmn7/Universal_Light

CLMar 8, 2024
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei et al. · deepmind, mila

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

LGMay 3, 2024
Dyna-Style Learning with A Macroscopic Model for Vehicle Platooning in Mixed-Autonomy Traffic

Yichuan Zou, Li Jin, Xi Xiong

Platooning of connected and autonomous vehicles (CAVs) plays a vital role in modernizing highways, ushering in enhanced efficiency and safety. This paper explores the significance of platooning in smart highways, employing a coupled partial differential equation (PDE) and ordinary differential equation (ODE) model to elucidate the complex interaction between bulk traffic flow and CAV platoons. Our study focuses on developing a Dyna-style planning and learning framework tailored for platoon control, with a specific goal of reducing fuel consumption. By harnessing the coupled PDE-ODE model, we improve data efficiency in Dyna-style learning through virtual experiences. Simulation results validate the effectiveness of our macroscopic model in modeling platoons within mixed-autonomy settings, demonstrating a notable $10.11\%$ reduction in vehicular fuel consumption compared to conventional approaches.

LGMay 9, 2025
A Large Language Model-Enhanced Q-learning for Capacitated Vehicle Routing Problem with Time Windows

Linjiang Cao, Maonan Wang, Xi Xiong

The Capacitated Vehicle Routing Problem with Time Windows (CVRPTW) is a classic NP-hard combinatorial optimization problem widely applied in logistics distribution and transportation management. Its complexity stems from the constraints of vehicle capacity and time windows, which pose significant challenges to traditional approaches. Advances in Large Language Models (LLMs) provide new possibilities for finding approximate solutions to CVRPTW. This paper proposes a novel LLM-enhanced Q-learning framework to address the CVRPTW with real-time emergency constraints. Our solution introduces an adaptive two-phase training mechanism that transitions from the LLM-guided exploration phase to the autonomous optimization phase of Q-network. To ensure reliability, we design a three-tier self-correction mechanism based on the Chain-of-Thought (CoT) for LLMs: syntactic validation, semantic verification, and physical constraint enforcement. In addition, we also prioritized replay of the experience generated by LLMs to amplify the regulatory role of LLMs in the architecture. Experimental results demonstrate that our framework achieves a 7.3\% average reduction in cost compared to traditional Q-learning, with fewer training steps required for convergence.

ROMar 7
Kinematics-Aware Latent World Models for Data-Efficient Autonomous Driving

Jiazhuo Li, Linjiang Cao, Qi Liu et al.

Data-efficient learning remains a central challenge in autonomous driving due to the high cost and safety risks of large-scale real-world interaction. Although world-model-based reinforcement learning enables policy optimization through latent imagination, existing approaches often lack explicit mechanisms to encode spatial and kinematic structure essential for driving tasks. In this work, we build upon the Recurrent State-Space Model (RSSM) and propose a kinematics-aware latent world model framework for autonomous driving. Vehicle kinematic information is incorporated into the observation encoder to ground latent transitions in physically meaningful motion dynamics, while geometry-aware supervision regularizes the RSSM latent state to capture task-relevant spatial structure beyond pixel reconstruction. The resulting structured latent dynamics improve long-horizon imagination fidelity and stabilize policy optimization. Experiments in a driving simulation benchmark demonstrate consistent gains over both model-free and pixel-based world-model baselines in terms of sample efficiency and driving performance. Ablation studies further verify that the proposed design enhances spatial representation quality within the latent space. These results suggest that integrating kinematic grounding into RSSM-based world models provides a scalable and physically grounded paradigm for autonomous driving policy learning.

CLDec 19, 2023
Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud et al.

This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.

IRMar 24, 2021
From Semantic Retrieval to Pairwise Ranking: Applying Deep Learning in E-commerce Search

Rui Li, Yunjiang Jiang, Wenyun Yang et al.

We introduce deep learning models to the two most important stages in product search at JD.com, one of the largest e-commerce platforms in the world. Specifically, we outline the design of a deep learning system that retrieves semantically relevant items to a query within milliseconds, and a pairwise deep re-ranking system, which learns subtle user preferences. Compared to traditional search systems, the proposed approaches are better at semantic retrieval and personalized ranking, achieving significant improvements.

LGMay 1, 2019
Dynamic Origin-Destination Matrix Prediction with Line Graph Neural Networks and Kalman Filter

Xi Xiong, Kaan Ozbay, Li Jin et al.

Modern intelligent transportation systems provide data that allow real-time dynamic demand prediction, which is essential for planning and operations. The main challenge of prediction of dynamic Origin-Destination (O-D) demand matrices is that demands cannot be directly measured by traffic sensors; instead, they have to be inferred from aggregate traffic flow data on traffic links. Specifically, spatial correlation, congestion and time dependent factors need to be considered in general transportation networks. In this paper we propose a novel O-D prediction framework combining heterogeneous prediction in graph neural networks and Kalman filter to recognize spatial and temporal patterns simultaneously. The underlying road network topology is converted into a corresponding line graph in the newly designed Fusion Line Graph Convolutional Networks (FL-GCNs), which provide a general framework of predicting spatial-temporal O-D flows from link information. Data from New Jersey Turnpike network are used to evaluate the proposed model. The results show that our proposed approach yields the best performance under various prediction scenarios. In addition, the advantage of combining deep neural networks and Kalman filter is demonstrated.

RODec 1, 2016
Combining Deep Reinforcement Learning and Safety Based Control for Autonomous Driving

Xi Xiong, Jianqiang Wang, Fang Zhang et al.

With the development of state-of-art deep reinforcement learning, we can efficiently tackle continuous control problems. But the deep reinforcement learning method for continuous control is based on historical data, which would make unpredicted decisions in unfamiliar scenarios. Combining deep reinforcement learning and safety based control can get good performance for self-driving and collision avoidance. In this passage, we use the Deep Deterministic Policy Gradient algorithm to implement autonomous driving without vehicles around. The vehicle can learn the driving policy in a stable and familiar environment, which is efficient and reliable. Then we use the artificial potential field to design collision avoidance algorithm with vehicles around. The path tracking method is also taken into consideration. The combination of deep reinforcement learning and safety based control performs well in most scenarios.

OSMay 12, 2013
Practical Fine-grained Privilege Separation in Multithreaded Applications

Jun Wang, Xi Xiong, Peng Liu

An inherent security limitation with the classic multithreaded programming model is that all the threads share the same address space and, therefore, are implicitly assumed to be mutually trusted. This assumption, however, does not take into consideration of many modern multithreaded applications that involve multiple principals which do not fully trust each other. It remains challenging to retrofit the classic multithreaded programming model so that the security and privilege separation in multi-principal applications can be resolved. This paper proposes ARBITER, a run-time system and a set of security primitives, aimed at fine-grained and data-centric privilege separation in multithreaded applications. While enforcing effective isolation among principals, ARBITER still allows flexible sharing and communication between threads so that the multithreaded programming paradigm can be preserved. To realize controlled sharing in a fine-grained manner, we created a novel abstraction named ARBITER Secure Memory Segment (ASMS) and corresponding OS support. Programmers express security policies by labeling data and principals via ARBITER's API following a unified model. We ported a widely-used, in-memory database application (memcached) to ARBITER system, changing only around 100 LOC. Experiments indicate that only an average runtime overhead of 5.6% is induced to this security enhanced version of application.