Xiaoyu Mo

RO
h-index25
12papers
730citations
Novelty47%
AI Score44

12 Papers

LGAug 24, 2022
Augmenting Reinforcement Learning with Transformer-based Scene Representation Learning for Decision-making of Autonomous Driving

Haochen Liu, Zhiyu Huang, Xiaoyu Mo et al.

Decision-making for urban autonomous driving is challenging due to the stochastic nature of interactive traffic participants and the complexity of road structures. Although reinforcement learning (RL)-based decision-making scheme is promising to handle urban driving scenarios, it suffers from low sample efficiency and poor adaptability. In this paper, we propose Scene-Rep Transformer to improve the RL decision-making capabilities with better scene representation encoding and sequential predictive latent distillation. Specifically, a multi-stage Transformer (MST) encoder is constructed to model not only the interaction awareness between the ego vehicle and its neighbors but also intention awareness between the agents and their candidate routes. A sequential latent Transformer (SLT) with self-supervised learning objectives is employed to distill the future predictive information into the latent scene representation, in order to reduce the exploration space and speed up training. The final decision-making module based on soft actor-critic (SAC) takes as input the refined latent scene representation from the Scene-Rep Transformer and outputs driving actions. The framework is validated in five challenging simulated urban scenarios with dense traffic, and its performance is manifested quantitatively by the substantial improvements in data efficiency and performance in terms of success rate, safety, and efficiency. The qualitative results reveal that our framework is able to extract the intentions of neighbor agents to help make decisions and deliver more diversified driving behaviors.

CVMar 13, 2023
A Generalized Multi-Modal Fusion Detection Framework

Leichao Cui, Xiuxian Li, Min Meng et al.

LiDAR point clouds have become the most common data source in autonomous driving. However, due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios. Because of their complementarity with point clouds, images are getting increasing attention. Although with some success, existing fusion methods either perform hard fusion or do not fuse in a direct manner. In this paper, we propose a generic 3D detection framework called MMFusion, using multi-modal features. The framework aims to achieve accurate fusion between LiDAR and images to improve 3D detection in complex scenes. Our framework consists of two separate streams: the LiDAR stream and the camera stream, which can be compatible with any single-modal feature extraction network. The Voxel Local Perception Module in the LiDAR stream enhances local feature representation, and then the Multi-modal Feature Fusion Module selectively combines feature output from different streams to achieve better fusion. Extensive experiments have shown that our framework not only outperforms existing benchmarks but also improves their detection, especially for detecting cyclists and pedestrians on KITTI benchmarks, with strong robustness and generalization capabilities. Hopefully, our work will stimulate more research into multi-modal fusion for autonomous driving tasks.

LGFeb 18
Efficient Tail-Aware Generative Optimization via Flow Model Fine-Tuning

Zifan Wang, Riccardo De Santi, Xiaoyu Mo et al.

Fine-tuning pre-trained diffusion and flow models to optimize downstream utilities is central to real-world deployment. Existing entropy-regularized methods primarily maximize expected reward, providing no mechanism to shape tail behavior. However, tail control is often essential: the lower tail determines reliability by limiting low-reward failures, while the upper tail enables discovery by prioritizing rare, high-reward outcomes. In this work, we present Tail-aware Flow Fine-Tuning (TFFT), a principled and efficient distributional fine-tuning algorithm based on the Conditional Value-at-Risk (CVaR). We address two distinct tail-shaping goals: right-CVaR for seeking novel samples in the high-reward tail and left-CVaR for controlling worst-case samples in the low-reward tail. Unlike prior approaches that rely on non-linear optimization, we leverage the variational dual formulation of CVaR to decompose it into a decoupled two-stage procedure: a lightweight one-dimensional threshold optimization step, and a single entropy-regularized fine-tuning process via a specific pseudo-reward. This decomposition achieves CVaR fine-tuning efficiently with computational cost comparable to standard expected fine-tuning methods. We demonstrate the effectiveness of TFFT across illustrative experiments, high-dimensional text-to-image generation, and molecular design.

ROJul 30, 2024
Survey of Design Paradigms for Social Robots

Rita Frieske, Xiaoyu Mo, Yini Fang et al.

The demand for social robots in fields like healthcare, education, and entertainment increases due to their emotional adaptation features. These robots leverage multimodal communication, incorporating speech, facial expressions, and gestures to enhance user engagement and emotional support. The understanding of design paradigms of social robots is obstructed by the complexity of the system and the necessity to tune it to a specific task. This article provides a structured review of social robot design paradigms, categorizing them into cognitive architectures, role design models, linguistic models, communication flow, activity system models, and integrated design models. By breaking down the articles on social robot design and application based on these paradigms, we highlight the strengths and areas for improvement in current approaches. We further propose our original integrated design model that combines the most important aspects of the design of social robots. Our approach shows the importance of integrating operational, communicational, and emotional dimensions to create more adaptive and empathetic interactions between robots and humans.

ROFeb 4, 2024
Hybrid-Prediction Integrated Planning for Autonomous Driving

Haochen Liu, Zhiyu Huang, Wenhui Huang et al.

Autonomous driving systems require the ability to fully understand and predict the surrounding environment to make informed decisions in complex scenarios. Recent advancements in learning-based systems have highlighted the importance of integrating prediction and planning modules. However, this integration has brought forth three major challenges: inherent trade-offs by sole prediction, consistency between prediction patterns, and social coherence in prediction and planning. To address these challenges, we introduce a hybrid-prediction integrated planning (HPP) system, which possesses three novelly designed modules. First, we introduce marginal-conditioned occupancy prediction to align joint occupancy with agent-wise perceptions. Our proposed MS-OccFormer module achieves multi-stage alignment per occupancy forecasting with consistent awareness from agent-wise motion predictions. Second, we propose a game-theoretic motion predictor, GTFormer, to model the interactive future among individual agents with their joint predictive awareness. Third, hybrid prediction patterns are concurrently integrated with Ego Planner and optimized by prediction guidance. HPP achieves state-of-the-art performance on the nuScenes dataset, demonstrating superior accuracy and consistency for end-to-end paradigms in prediction and planning. Moreover, we test the long-term open-loop and closed-loop performance of HPP on the Waymo Open Motion Dataset and CARLA benchmark, surpassing other integrated prediction and planning pipelines with enhanced accuracy and compatibility.

CLAug 13, 2025
Speed Always Wins: A Survey on Efficient Architectures for Large Language Models

Weigao Sun, Jiaxi Hu, Yucheng Zhou et al.

Large Language Models (LLMs) have delivered impressive results in language understanding, generation, reasoning, and pushes the ability boundary of multimodal models. Transformer models, as the foundation of modern LLMs, offer a strong baseline with excellent scaling properties. However, the traditional transformer architecture requires substantial computations and poses significant obstacles for large-scale training and practical deployment. In this survey, we offer a systematic examination of innovative LLM architectures that address the inherent limitations of transformers and boost the efficiency. Starting from language modeling, this survey covers the background and technical details of linear and sparse sequence modeling methods, efficient full attention variants, sparse mixture-of-experts, hybrid model architectures incorporating the above techniques, and emerging diffusion LLMs. Additionally, we discuss applications of these techniques to other modalities and consider their wider implications for developing scalable, resource-aware foundation models. By grouping recent studies into the above category, this survey presents a blueprint of modern efficient LLM architectures, and we hope this could help motivate future research toward more efficient, versatile AI systems.

ROSep 14, 2021
Multi-modal Motion Prediction with Transformer-based Neural Network for Autonomous Driving

Zhiyu Huang, Xiaoyu Mo, Chen Lv

Predicting the behaviors of other agents on the road is critical for autonomous driving to ensure safety and efficiency. However, the challenging part is how to represent the social interactions between agents and output different possible trajectories with interpretability. In this paper, we introduce a neural prediction framework based on the Transformer structure to model the relationship among the interacting agents and extract the attention of the target agent on the map waypoints. Specifically, we organize the interacting agents into a graph and utilize the multi-head attention Transformer encoder to extract the relations between them. To address the multi-modality of motion prediction, we propose a multi-modal attention Transformer encoder, which modifies the multi-head attention mechanism to multi-modal attention, and each predicted trajectory is conditioned on an independent attention mode. The proposed model is validated on the Argoverse motion forecasting dataset and shows state-of-the-art prediction accuracy while maintaining a small model size and a simple training process. We also demonstrate that the multi-modal attention module can automatically identify different modes of the target agent's attention on the map, which improves the interpretability of the model.

ROJul 8, 2021
Graph and Recurrent Neural Network-based Vehicle Trajectory Prediction For Highway Driving

Xiaoyu Mo, Yang Xing, Chen Lv

Integrating trajectory prediction to the decision-making and planning modules of modular autonomous driving systems is expected to improve the safety and efficiency of self-driving vehicles. However, a vehicle's future trajectory prediction is a challenging task since it is affected by the social interactive behaviors of neighboring vehicles, and the number of neighboring vehicles can vary in different situations. This work proposes a GNN-RNN based Encoder-Decoder network for interaction-aware trajectory prediction, where vehicles' dynamics features are extracted from their historical tracks using RNN, and the inter-vehicular interaction is represented by a directed graph and encoded using a GNN. The parallelism of GNN implies the proposed method's potential to predict multi-vehicular trajectories simultaneously. Evaluation on the dataset extracted from the NGSIM US-101 dataset shows that the proposed model is able to predict a target vehicle's trajectory in situations with a variable number of surrounding vehicles.

ROJun 14, 2021
Heterogeneous Edge-Enhanced Graph Attention Network For Multi-Agent Trajectory Prediction

Xiaoyu Mo, Yang Xing, Chen Lv

Simultaneous trajectory prediction for multiple heterogeneous traffic participants is essential for the safe and efficient operation of connected automated vehicles under complex driving situations in the real world. The multi-agent prediction task is challenging, as the motions of traffic participants are affected by many factors, including their individual dynamics, their interactions with surrounding agents, the traffic infrastructures, and the number and modalities of the target agents. To further advance the trajectory prediction techniques, in this work we propose a three-channel framework together with a novel Heterogeneous Edge-enhanced graph ATtention network (HEAT), which is able to deal with the heterogeneity of the target agents and traffic participants involved. Specifically, the agent's dynamics are extracted from their historical states using type-specific encoders. The inter-agent interactions are represented with a directed edge-featured heterogeneous graph, and then interaction features are extracted using the proposed HEAT network. Besides, the map features are shared across all agents by introducing a selective gate mechanism. And finally, the trajectories of multi-agent are executed simultaneously. Validations using both urban and highway driving datasets show that the proposed model can realize simultaneous trajectory predictions for multiple agents under complex traffic situations, and achieve state-of-the-art performance with respect to prediction accuracy, demonstrating its feasibility and effectiveness.

RODec 9, 2020
ReCoG: A Deep Learning Framework with Heterogeneous Graph for Interaction-Aware Trajectory Prediction

Xiaoyu Mo, Yang Xing, Chen Lv

Predicting the future trajectory of surrounding vehicles is essential for the navigation of autonomous vehicles in complex real-world driving scenarios. It is challenging as a vehicle's motion is affected by many factors, including its surrounding infrastructures and vehicles. In this work, we develop the ReCoG (Recurrent Convolutional and Graph Neural Networks), which is a general scheme that represents vehicle interactions with infrastructure information as a heterogeneous graph and applies graph neural networks (GNNs) to model the high-level interactions for trajectory prediction. Nodes in the graph contain corresponding features, where a vehicle node contains its sequential feature encoded using Recurrent Neural Network (RNN), and an infrastructure node contains spatial feature encoded using Convolutional Neural Network (CNN). Then the ReCoG predicts the future trajectory of the target vehicle by jointly considering all of the features. Experiments are conducted by using the INTERACTION dataset. Experimental results show that the proposed ReCoG outperforms other state-of-the-art methods in terms of different types of displacement error, validating the feasibility and effectiveness of the developed approach.

ROMay 25, 2020
Interaction-Aware Trajectory Prediction of Connected Vehicles using CNN-LSTM Networks

Xiaoyu Mo, Yang Xing, Chen Lv

Predicting the future trajectory of a surrounding vehicle in congested traffic is one of the basic abilities of an autonomous vehicle. In congestion, a vehicle's future movement is the result of its interaction with surrounding vehicles. A vehicle in congestion may have many neighbors in a relatively short distance, while only a small part of neighbors affect its future trajectory mostly. In this work, An interaction-aware method which predicts the future trajectory of an ego vehicle considering its interaction with eight surrounding vehicles is proposed. The dynamics of vehicles are encoded by LSTMs with shared weights, and the interaction is extracted with a simple CNN. The proposed model is trained and tested on trajectories extracted from the publicly accessible NGSIM US-101 dataset. Quantitative experimental results show that the proposed model outperforms previous models in terms of root-mean-square error (RMSE). Results visualization shows that the model is able to predict future trajectory induced by lane change before the vehicle operate obvious lateral movement to initiate lane changing.

CVJun 5, 2019
Learning to Compose and Reason with Language Tree Structures for Visual Grounding

Richang Hong, Daqing Liu, Xiaoyu Mo et al.

Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained and compositional language space. However, existing solutions rely on the association between the holistic language features and visual features, while neglect the nature of compositional reasoning implied in the language. In this paper, we propose a natural language grounding model that can automatically compose a binary tree structure for parsing the language and then perform visual reasoning along the tree in a bottom-up fashion. We call our model RVG-TREE: Recursive Grounding Tree, which is inspired by the intuition that any language expression can be recursively decomposed into two constituent parts, and the grounding confidence score can be recursively accumulated by calculating their grounding scores returned by sub-trees. RVG-TREE can be trained end-to-end by using the Straight-Through Gumbel-Softmax estimator that allows the gradients from the continuous score functions passing through the discrete tree construction. Experiments on several benchmarks show that our model achieves the state-of-the-art performance with more explainable reasoning.