Yue Guan

DC
h-index29
22papers
1,840citations
Novelty55%
AI Score60

22 Papers

CLMay 15, 2022
Transkimmer: Transformer Learns to Layer-wise Skim

Yue Guan, Zhengyi Li, Jingwen Leng et al. · meta-ai, mila

Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation.

CYNov 28, 2019
Cumulative Prospect Theory Based Dynamic Pricing for Shared Mobility on Demand Services

Yue Guan, Anuradha M. Annaswamy, H. Eric Tseng

Cumulative Prospect Theory (CPT) is a modeling tool widely used in behavioral economics and cognitive psychology that captures subjective decision making of individuals under risk or uncertainty. In this paper, we propose a dynamic pricing strategy for Shared Mobility on Demand Services (SMoDSs) using a passenger behavioral model based on CPT. This dynamic pricing strategy together with dynamic routing via a constrained optimization algorithm that we have developed earlier, provide a complete solution customized for SMoDS of multi-passenger transportation. The basic principles of CPT and the derivation of the passenger behavioral model in the SMoDS context are described in detail. The implications of CPT on dynamic pricing of the SMoDS are delineated using computational experiments involving passenger preferences. These implications include interpretation of the classic fourfold pattern of risk attitudes, strong risk aversion over mixed prospects, and behavioral preferences of self reference. Overall, it is argued that the use of the CPT framework corresponds to a crucial building block in designing socio-technical systems by allowing quantification of subjective decision making under risk or uncertainty that is perceived to be otherwise qualitative.

MAMar 24
Dynamic Adversarial Resource Allocation: the dDAB Game

Yue Guan, Daigo Shishika, Jason R. Marden et al. · gatech

This work introduces the dynamic Defender-Attacker Blotto (dDAB) game, extending the classical static Blotto game to a dynamic resource allocation setting over graphs. In the dDAB game, a defender is required to maintain numerical superiority against attacker resources across a set of key nodes in a connected graph. The engagement unfolds as a discrete-time game, where each player reallocates its resources in turn, with resources allowed to move at most one hop per time step. The primary goal is to determine the necessary and sufficient amount of defender resources required to guarantee sustained defense, along with the corresponding strategies. To address the central challenge arising from graph-constrained resource reallocation, we conduct a reachability analysis, starting with simplified settings where attacker resources act as a single cohesive group. We then extend the framework to allow attacker resources to split and merge arbitrarily, and construct defender strategies using superposition principles. A set-based dynamic programming algorithm is developed to compute the optimal strategies, as well as the minimum amount of defender resources to ensure successful defense. The effectiveness of our approach is demonstrated through numerical simulations and hardware experiments on the Georgia Tech Robotarium platform.

GTApr 27
Asymmetric-Information Resource Allocation Games: An LP Approach to Purposeful Deception

Longxu Pan, Yue Guan, Daigo Shishika et al.

In this work, we introduce the Deceptive Resource Allocation Game (DRAG), which studies purposeful deception within a Bayesian game framework. In DRAG, a Defender allocates resources across the true asset and several decoys to influence an Attacker's beliefs and actions, with the goal of diverting the Attacker away from the true asset. We seek to characterize purposeful deception, whereby the Defender deceives only when doing so improves its performance. To this end, we solve for the Perfect Bayesian Nash Equilibrium (PBNE) of the corresponding game. We show that, despite the coupled belief-policy interdependence, the problem admits an efficient, non-iterative linear programming formulation. Numerical results demonstrate that the resulting policies naturally balance effective allocation and belief manipulation, giving rise to purposeful and emergent deceptive behaviors.

SYApr 16
Nonlinear Stochastic Density Steering via Gaussian Mixture Schrodinger Bridges and Multiple Linearizations

Mattia Mosso, George Rapakoulias, Yue Guan et al. · gatech

The paper studies the optimal density steering problem for nonlinear continuous-time stochastic systems. To accurately capture nonlinear dynamics in high-uncertainty regions that deviate significantly from a nominal linearization point, we introduce the concept of Multiple Distribution-to-Distribution Linearization. The proposed approach first approximates the boundary distributions using Gaussian Mixture Models (GMMs), and decomposes the original nonlinear problem into a collection of Gaussian-to-Gaussian Optimal Covariance Steering (OCS) subproblems between pairs of mixture components. Each elementary OCS problem is solved via local linearization around the mean trajectory connecting the corresponding initial and terminal Gaussian components. The resulting elementary policies are then combined according to their associated conditional densities. We prove that the proposed multi-linearization approach yields tighter approximation error bounds than single-linearization for a broad class of problems. The effectiveness of the approach is demonstrated through numerical experiments on an Earth-to-Mars orbit transfer scenario.

ARMay 11Code
TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments

Yue Guan, Hongtao Yu, Peng Chen et al.

Modern GPUs increasingly rely on specialized hardware units and asynchronous coordination mechanisms, so performance depends on orchestrating data movement, tensor-core computation, and synchronization rather than exposing more thread-level parallelism. This creates a programming-model tension: if too much execution structure is hidden, the compiler must catch up to new hardware mechanisms; if too much is exposed, the burden of orchestration falls back onto the programmer. We present TLX (Triton Low-level Language Extensions), built around MIMW (Multi-Instruction, Multi-Warp), which expresses orchestration at warp-group granularity while preserving Triton's productive blocked programming model for regular computation. TLX realizes this idea as an embedded extension to Triton, exposing explicit interfaces for multi-warp execution, local-memory orchestration, asynchronous operations, and cluster-aware control. Our evaluation shows that TLX supports substantial customization with limited development effort while remaining competitive with state-of-the-art implementations. TLX-authored kernels have been deployed in large-scale training and inference production systems. Our code is open sourced at https://github.com/facebookexperimental/triton.

LGDec 29, 2025
Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

Yue Guan, Changming Yu, Shihan Fang et al.

Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.

SYMay 15
Linear Programming Approach to Deceptive Path Planning Game with Goal Selection

Violetta Rostobaya, Yue Guan, James Berneburg et al.

In adversarial settings, a mobile agent may strategically plan its motion to influence an opponent's inference about its intended goal. We study deceptive path planning in a scenario where a mobile agent aims to reach a privately selected goal while an adversarial observer allocates limited defensive resources based on the observed trajectory. Unlike classical path-planning and goal-recognition approaches that model observers as passive inference process, our game-theoretic formulation models them as strategic decision-makers. For the resulting dynamic asymmetric-information game, we develop an efficient solution method that combines a linear programming formulation with the Double Oracle algorithm. To evaluate performance, we introduce metrics that quantify both the risk and the effectiveness of deception and provide illustrative numerical examples.

AIJan 29
ScaleSim: Serving Large-Scale Multi-Agent Simulation with Invocation Distance-Based Memory Management

Zaifeng Pan, Yipeng Shen, Zhengding Hu et al.

LLM-based multi-agent simulations are increasingly adopted across application domains, but remain difficult to scale due to GPU memory pressure. Each agent maintains private GPU-resident states, including models, prefix caches, and adapters, which quickly exhaust device memory as the agent count grows. We identify two key properties of these workloads: sparse agent activation and an estimable agent invocation order. Based on an analysis of representative workload classes, we introduce invocation distance, a unified abstraction that estimates the relative order in which agents will issue future LLM requests. Leveraging this abstraction, we present ScaleSim, a memory-efficient LLM serving system for large-scale multi-agent simulations. ScaleSim enables proactive prefetching and priority-based eviction, supports diverse agent-specific memory through a modular interface, and achieves up to 1.74x speedup over SGLang on simulation benchmarks.

DCOct 7, 2025Code
Orders in Chaos: Enhancing Large-Scale MoE LLM Serving with Data Movement Forecasting

Zhongkai Yu, Yue Guan, Zihao Yu et al.

Large Language Models (LLMs) with Mixture of Experts (MoE) architectures achieve remarkable performance improvements, but their random expert selection mechanism introduces significant data movement overhead that becomes the dominant bottleneck in multi-unit serving systems. To forecast the patterns underlying this data movement, we conduct comprehensive data-movement-centric profiling across three state-of-the-art large-scale MoE models (200B- 671B) using over 24,000 requests spanning diverse workloads. With the resulting 150GB+ trace files, we perform systematic analysis from both temporal and spatial perspectives and distill six key insights to guide the design of diverse future serving systems. Taking wafer-scale GPUs as a case study, we demonstrate that minor architectural modifications leveraging our insights achieve substantial performance gains, delivering 6.3X and 4.0X average speedups on DeepSeek V3 and Qwen3, respectively. Our work provides the first comprehensive data-centric analysis of MoE models at scale. Our profiling traces and analysis results are publicly available at {https://huggingface.co/datasets/core12345/MoE_expert_selection_trace. We will also release our simulation framework shortly to facilitate future research in this area.

CVSep 15, 2021Code
PointManifoldCut: Point-wise Augmentation in the Manifold for Point Clouds

Tianfang Zhu, Yue Guan, Anan Li

Mixed-based point cloud augmentation is a popular solution to the problem of limited availability of large-scale public datasets. But the mismatch between mixed points and corresponding semantic labels hinders the further application in point-wise tasks such as part segmentation. This paper proposes a point cloud augmentation approach, PointManifoldCut(PMC), which replaces the neural network embedded points, rather than the Euclidean space coordinates. This approach takes the advantage that points at the higher levels of the neural network are already trained to embed its neighbors relations and mixing these representation will not mingle the relation between itself and its label. We set up a spatial transform module after PointManifoldCut operation to align the new instances in the embedded space. The effects of different hidden layers and methods of replacing points are also discussed in this paper. The experiments show that our proposed approach can enhance the performance of point cloud classification as well as segmentation networks, and brings them additional robustness to attacks and geometric transformations. The code of this paper is available at: https://github.com/fun0515/PointManifoldCut.

LGMay 8
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

Zhengding Hu, Mingge Lu, Zhen Wang et al.

LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. FlashEvolve further improves throughput and token efficiency with speculative stage completion and adaptive workflow control. On GEPA workloads, FlashEvolve improves proposal throughput by $3.5\times$ on local vLLM and $4.9\times$ on API serving over synchronous GEPA. The same design also applies to ACE and Meta-Harness.

DCMar 27
Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

Xinwei Qiang, Yue Guan, Zhengding Hu et al.

Communication has become a first-order bottleneck in large-cale GPU workloads, and existing distributed compilers address it mainly by overlapping whole compute and communication kernels at the stream level. This coarse granularity incurs extra kernel launches, forces device-wide synchronizations at kernel boundaries, and leaves substantial slack when the slowest tile or kernel stretches the communication tail. We present Syncopate, a compiler and runtime that enables automatic fine-grained overlap inside a single fused kernel. Syncopate introduces a communication chunk abstraction that decouples communication granularity from kernel structure and backend mechanisms, allowing chunk-level plans to be ported from existing distributed compilers, written directly by users, or instantiated from reusable templates. Given a local Triton kernel and a chunk schedule, Syncopate performs transformations to align computation with chunk availability. Implemented as a source-to-source compiler on Triton, Syncopate delivers an average end-to-end speedup of 1.3$\times$ and up to 4.7$\times$ on multi-GPU workloads.

LGApr 26
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

Zhengding Hu, Hehua Ouyang, Chang Chen et al.

We present JigsawRL, a cost-efficient framework that explores Pipeline Multiplexing as a new dimension of RL parallelism. JigsawRL decomposes each pipeline into a Sub-Stage Graph that exposes the intra-stage and inter-worker imbalance hidden by stage-level systems. On this abstraction, JigsawRL resolves multiplexing interference through dynamic resource allocation, eliminates fragmented utilization by migrating long-tail rollouts across workers, and formulates their coordination as a graph scheduling problem solved with a look-ahead heuristic. On 4-64 H100/A100 GPUs across different agentic RL pipelines and models, JigsawRL achieves up to 1.85x throughput over Verl on synchronous RL, 1.54x over StreamRL and AReaL on asynchronous RL, and supports heterogeneous pipelines with moderate latency trade-off.

DCMar 23, 2025
WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training

Zheng Wang, Anna Cai, Xinfeng Xie et al.

In this work, we present WLB-LLM, a workLoad-balanced 4D parallelism for large language model training. We first thoroughly analyze the workload imbalance issue in LLM training and identify two primary sources of imbalance at the pipeline parallelism and context parallelism levels. Then, to address the imbalance issue, at the pipeline parallelism level, WLB-LLM incorporates a workload-aware variable-length document packing method to balance the computation and communication workload across micro-batches. Additionally, at the context parallelism level, WLB-LLM introduces a novel fine-grained per-document sharding strategy, ensuring each worker within a context parallelism group has an identical workload. Comprehensive experiments under different model scales demonstrate that WLB-LLM significantly mitigates the workload imbalance during 4D parallelism LLM training and achieves an average speedup of 1.23x when applying WLB-LLM in our internal LLM training framework.

CRMay 21, 2025
An Efficient Private GPT Never Autoregressively Decodes

Zhengyi Li, Yue Guan, Kang Yang et al.

The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance overhead.To accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.

CLDec 16, 2021
Block-Skim: Efficient Question Answering for Transformer

Yue Guan, Zhengyi Li, Jingwen Leng et al.

Transformer models have achieved promising results on natural language processing (NLP) tasks including extractive question answering (QA). Common Transformer encoders used in NLP tasks process the hidden states of all input tokens in the context paragraph throughout all layers. However, different from other tasks such as sequence classification, answering the raised question does not necessarily need all the tokens in the context paragraph. Following this motivation, we propose Block-skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance. The key idea of Block-Skim is to identify the context that must be further processed and those that could be safely discarded early on during inference. Critically, we find that such information could be sufficiently derived from the self-attention weights inside the Transformer model. We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup. To our surprise, we observe that models pruned in this way outperform their full-size counterparts. Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.

IVSep 22, 2021
Joint Optical Neuroimaging Denoising with Semantic Tasks

Tianfang Zhu, Yue Guan, Anan Li

Optical neuroimaging is a vital tool for understanding the brain structure and the connection between regions and nuclei. However, the image noise introduced in the sample preparation and the imaging system hinders the extraction of the possible knowlege from the dataset, thus denoising for the optical neuroimaging is usually necessary. The supervised denoisng methods often outperform the unsupervised ones, but the training of the supervised denoising models needs the corresponding clean labels, which is not always avaiable due to the high labeling cost. On the other hand, those semantic labels, such as the located soma positions, the reconstructed neuronal fibers, and the nuclei segmentation result, are generally available and accumulated from everyday neuroscience research. This work connects a supervised denoising and a semantic segmentation model together to form a end-to-end model, which can make use of the semantic labels while still provides a denoised image as an intermediate product. We use both the supervised and the self-supervised models for the denoising and introduce a new cost term for the joint denoising and the segmentation setup. We test the proposed approach on both the synthetic data and the real-world data, including the optical neuroimaing dataset and the electron microscope dataset. The result shows that the joint denoising result outperforms the one using the denoising method alone and the joint model benefits the segmentation and other downstream task as well.

CLNov 2, 2020
How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT$'$s Attention

Yue Guan, Jingwen Leng, Chao Li et al.

Recent research on the multi-head attention mechanism, especially that in pre-trained models such as BERT, has shown us heuristics and clues in analyzing various aspects of the mechanism. As most of the research focus on probing tasks or hidden states, previous works have found some primitive patterns of attention head behavior by heuristic analytical methods, but a more systematic analysis specific on the attention patterns still remains primitive. In this work, we clearly cluster the attention heatmaps into significantly different patterns through unsupervised clustering on top of a set of proposed features, which corroborates with previous observations. We further study their corresponding functions through analytical study. In addition, our proposed features can be used to explain and calibrate different attention heads in Transformer models.

LGSep 1, 2020
Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation

Yue Guan, Qifan Zhang, Panagiotis Tsiotras

We explore the use of policy approximations to reduce the computational cost of learning Nash equilibria in zero-sum stochastic games. We propose a new Q-learning type algorithm that uses a sequence of entropy-regularized soft policies to approximate the Nash policy during the Q-function updates. We prove that under certain conditions, by updating the regularized Q-function, the algorithm converges to a Nash equilibrium. We also demonstrate the proposed algorithm's ability to transfer previous training experiences, enabling the agents to adapt quickly to new environments. We provide a dynamic hyper-parameter scheduling scheme to further expedite convergence. Empirical results applied to a number of stochastic games verify that the proposed algorithm converges to the Nash equilibrium, while exhibiting a major speed-up over existing algorithms.

DCAug 29, 2020
Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Cong Guo, Bo Yang Hsueh, Jingwen Leng et al.

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse models cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix computations. As such, prior works usually modify or design completely new sparsity-optimized architectures for exploiting sparsity. We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures. Our work builds upon the insight that the matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We propose a tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy. We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup over the dense model.

CVSep 2, 2019
Geometry Normalization Networks for Accurate Scene Text Detection

Youjiang Xu, Jiaqi Duan, Zhanghui Kuang et al.

Large geometry (e.g., orientation) variances are the key challenges in the scene text detection. In this work, we first conduct experiments to investigate the capacity of networks for learning geometry variances on detecting scene texts, and find that networks can handle only limited text geometry variances. Then, we put forward a novel Geometry Normalization Module (GNM) with multiple branches, each of which is composed of one Scale Normalization Unit and one Orientation Normalization Unit, to normalize each text instance to one desired canonical geometry range through at least one branch. The GNM is general and readily plugged into existing convolutional neural network based text detectors to construct end-to-end Geometry Normalization Networks (GNNets). Moreover, we propose a geometry-aware training scheme to effectively train the GNNets by sampling and augmenting text instances from a uniform geometry variance distribution. Finally, experiments on popular benchmarks of ICDAR 2015 and ICDAR 2017 MLT validate that our method outperforms all the state-of-the-art approaches remarkably by obtaining one-forward test F-scores of 88.52 and 74.54 respectively.