51.0CRMay 6
Order Flow Exclusivity and Value Extraction Mechanisms: An Analysis of Ethereum Builder CentralizationAo Zhang, Yunwen Liu, Ren Zhang et al.
This study investigates the rapid centralization of the Ethereum builder market under the Proposer-Builder Separation (PBS) architecture. We argue that existing research, by focusing predominantly on influential order flows, lacks a comprehensive evaluation of order flow behavioral patterns and economic purposes. To address this gap, we analyze Ethereum transactions from September 2023 to August 2025 to characterize Exclusive Order Flows (EOFs) and non-atomic Maximal Extractable Value (MEV) -- the missing components corresponding to these behavioral and economic dimensions, respectively. We introduce a novel exclusivity metric based on Kullback-Leibler divergence and employ supervised learning to identify 75 EOFs and 322 non-atomic MEV flows, which account for 71\% and 23\% of trading-related builder revenue. A longitudinal analysis of builder strategies across these dimensions delineates the market's evolution into four distinct eras, revealing that while EOFs were instrumental in establishing early dominance, incumbents have since decoupled market share from immediate EOF dependency by leveraging entrenched network effects. Ultimately, we conclude that builder centralization is an emergent property of the PBS framework itself, as the architecture systematically violates the fundamental prerequisites of a competitive market.
94.6DCMar 21
TrEnv-X: Transparently Share Serverless Execution Environments Across Different Functions and NodesJialiang Huang, Teng Ma, Zheng Liu et al.
Serverless computing is renowned for its computation elasticity, yet its full potential is often constrained by the requirement for functions to operate within local and dedicated background environments, resulting in limited memory elasticity. To address this limitation, this paper introduces TrEnv-X, a co-designed integration of the serverless platform with the operating system and CXL/RDMA-based remote memory pools. TrEnv-X's core innovations are repurposable sandboxes, which can be shared across different functions to decrease the associated creation overhead, and OS-level memory templates, which enable rapid state restoration from CXL/RDMA-based remote memory pools. To further demonstrate TrEnv-X's versatility, we generalize its design from traditional containers for microVM-based agent workloads and introduce new optimizations, including browser sharing and a page cache bypassing mechanism. Our evaluation shows that TrEnv-X achieves up to 7x reduction in P99 latency and 48% memory savings for container-based functions. When applied to LLM agents, it reduces the P99 latency by up to 58% and memory usage by 61% compared to state-of-the-art systems like E2B.
92.9DCMay 11
Surviving Partial Rank Failures in Wide Expert-Parallel MoE InferenceXun Sun, Shaoyuan Chen, Pingchuan Ma et al.
Mixture-of-Experts (MoE) serving relies on wide expert parallelism (EP) to aggregate the memory capacity and bandwidth of many GPUs within one inference instance. This efficiency comes with a systems cost: every decoding step depends on token dispatch and combination across all active EP ranks, so even one rank failure can disrupt the entire service. Existing EP stacks handle such failures poorly because they treat membership as a fixed configuration established at initialization. The same rank set determines communicator state, expert placement, and the routing metadata baked into CUDA execution graphs, leaving the system with no way to shrink around a failure while keeping the instance valid. This paper argues that partial-failure tolerance should instead be formulated as a live EP validity problem. We present EEP, a communication and runtime substrate that represents membership as explicit, mutable runtime state. EEP repairs the specific state invalidated by a fault: it restores peer reachability without rebuilding the communication substrate, repairs lost expert coverage through a bandwidth-aware hierarchy, and reintegrates repaired ranks without forcing healthy ranks to recapture their CUDA graphs. We implement EEP in an EP serving stack integrated with SGLang and evaluate it under steady-state serving, failure recovery, and rank reintegration. The results show that explicit mutable membership preserves the steady-state fast path, staying within 4.4% of a fixed-membership DeepEP baseline under static serving, while turning a local rank fault from whole-instance downtime into two bounded interruptions. On a single-rank failure workload, EEP incurs an 11s recovery pause and an 8s reintegration pause, and restores throughput to within 95% of the pre-fault level within 52s, whereas a fixed-membership full-restart baseline remains unavailable until 348s.
LGMay 3, 2024
Efficient Heterogeneous Large Language Model Decoding with Model-Attention DisaggregationShaoyuan Chen, Wencong Xiao, Yutong Lin et al.
Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators. Although disaggregated serving architectures have been proposed to split different phases of LLM inference, the efficiency of decoding phase is still low. This is caused by the varying resource demands of different operators in the transformer-based LLMs. Specifically, the attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially for long context requests. To enhance the efficiency of LLM decoding, we introduce model-attention disaggregation. This approach leverages a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model. This heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost efficiency. Our comprehensive analysis and experiments confirm the viability of splitting the attention computation over multiple devices. Also, the communication bandwidth required between heterogeneous devices proves to be manageable with prevalent networking technologies. To further validate our theory, we develop and deploy Lamina, an LLM inference system that incorporates model-attention disaggregation in a distributed heterogeneous cluster. Experimental results indicate that Lamina can provide 16.1 ~ 90.1% higher estimated throughput than existing solutions with similar costs.
DCNov 18, 2025
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement LearningRuoyu Qin, Weiran He, Weixiao Huang et al.
Reinforcement Learning (RL) has become critical for advancing modern Large Language Models (LLMs), yet existing synchronous RL systems face severe performance bottlenecks. The rollout phase, which dominates end-to-end iteration time, suffers from substantial long-tail latency and poor resource utilization due to inherent workload imbalance. We present Seer, a novel online context learning system that addresses these challenges by exploiting previously overlooked similarities in output lengths and generation patterns among requests sharing the same prompt. Seer introduces three key techniques: divided rollout for dynamic load balancing, context-aware scheduling, and adaptive grouped speculative decoding. Together, these mechanisms substantially reduce long-tail latency and improve resource efficiency during rollout. Evaluations on production-grade RL workloads demonstrate that Seer improves end-to-end rollout throughput by 74% to 97% and reduces long-tail latency by 75% to 93% compared to state-of-the-art synchronous RL systems, significantly accelerating RL training iterations.