NIOct 25, 2022
Teal: Learning-Accelerated Optimization of WAN Traffic EngineeringZhiying Xu, Francis Y. Yan, Rachee Singh et al.
The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance. We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links. We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
LGDec 2, 2022
AGO: Boosting Mobile AI Inference Performance by Removing Constraints on Graph OptimizationZhiying Xu, Hongding Peng, Wei Wang
Traditional deep learning compilers rely on heuristics for subgraph generation, which impose extra constraints on graph optimization, e.g., each subgraph can only contain at most one complex operator. In this paper, we propose AGO, a framework for graph optimization with arbitrary structures to boost the inference performance of deep models by removing such constraints. To create new optimization opportunities for complicated subgraphs, we propose intensive operator fusion, which can effectively stitch multiple complex operators together for better performance. Further, we design a graph partitioning scheme that allows an arbitrary structure for each subgraph while guaranteeing the acyclic property among all generated subgraphs. Additionally, to enable efficient performance tuning on complicated subgraphs, we devise a novel divide-and-conquer tuning mechanism to orchestrate different system components. Through extensive experiments on various neural networks and mobile devices, we show that our system can improve the inference performance by up to 3.3x when compared with state-of-the-art deep compilers.
LGOct 22, 2022
ALT: Boosting Deep Learning Performance by Breaking the Wall between Graph and Operator Level OptimizationsZhiying Xu, Jiafan Xu, Hongding Peng et al.
Deep learning models rely on highly optimized tensor libraries for efficient inference on heterogeneous hardware. Current deep compilers typically predetermine layouts of tensors and then optimize loops of operators. However, such unidirectional and one-off workflow strictly separates graph-level optimization and operator-level optimization into different system layers, missing opportunities for unified tuning. This paper proposes ALT, a compiler that performs joint graph- and operator-level optimizations for deep models. ALT provides a generic transformation module to manipulate layouts and loops with easy-to-use primitive functions. ALT further integrates an auto-tuning module that jointly optimizes graph-level data layouts and operator-level loops while guaranteeing efficiency. Experimental results show that ALT significantly outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single operator performance (e.g., 1.5x speedup on average) and end-to-end inference performance (e.g., 1.4x speedup on average).
DCJan 28
StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUsJiacheng Yang, Jun Wu, Yaoyao Ding et al.
Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).
LGFeb 3
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache QuantizationHaocheng Xi, Shuo Yang, Yilong Zhao et al.
Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality.
CLJan 26, 2025Code
Qwen2.5-1M Technical ReportAn Yang, Bowen Yu, Chengyuan Li et al.
We introduce Qwen2.5-1M, a series of models that extend the context length to 1 million tokens. Compared to the previous 128K version, the Qwen2.5-1M series have significantly enhanced long-context capabilities through long-context pre-training and post-training. Key techniques such as long data synthesis, progressive pre-training, and multi-stage supervised fine-tuning are employed to effectively enhance long-context performance while reducing training costs. To promote the use of long-context models among a broader user base, we present and open-source our inference framework. This framework includes a length extrapolation method that can expand the model context lengths by at least four times, or even more, without additional training. To reduce inference costs, we implement a sparse attention method along with chunked prefill optimization for deployment scenarios and a sparsity refinement method to improve precision. Additionally, we detail our optimizations in the inference engine, including kernel optimization, pipeline parallelism, and scheduling optimization, which significantly enhance overall inference performance. By leveraging our inference framework, the Qwen2.5-1M models achieve a remarkable 3x to 7x prefill speedup in scenarios with 1 million tokens of context. This framework provides an efficient and powerful solution for developing applications that require long-context processing using open-source models. The Qwen2.5-1M series currently includes the open-source models Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, as well as the API-accessed model Qwen2.5-Turbo. Evaluations show that Qwen2.5-1M models have been greatly improved in long-context tasks without compromising performance in short-context scenarios. Specifically, the Qwen2.5-14B-Instruct-1M model significantly outperforms GPT-4o-mini in long-context tasks and supports contexts eight times longer.
DCDec 22, 2025
UCCL-EP: Portable Expert-Parallel CommunicationZiming Mao, Yihan Zhang, Chihan Cui et al.
Mixture-of-Experts (MoE) workloads rely on expert parallelism (EP) to achieve high GPU efficiency. State-of-the-art EP communication systems such as DeepEP demonstrate strong performance but exhibit poor portability across heterogeneous GPU and NIC platforms. The poor portability is rooted in architecture: GPU-initiated token-level RDMA communication requires tight vertical integration between GPUs and NICs, e.g., GPU writes to NIC driver/MMIO interfaces. We present UCCL-EP, a portable EP communication system that delivers DeepEP-level performance across heterogeneous GPU and NIC hardware. UCCL-EP replaces GPU-initiated RDMA with a high-throughput GPU-CPU control channel: compact token-routing commands are transferred to multithreaded CPU proxies, which then issue GPUDirect RDMA operations on behalf of GPUs. UCCL-EP further emulates various ordering semantics required by specialized EP communication modes using RDMA immediate data, enabling correctness on NICs that lack such ordering, e.g., AWS EFA. We implement UCCL-EP on NVIDIA and AMD GPUs with EFA and Broadcom NICs. On EFA, it outperforms the best existing EP solution by up to $2.1\times$ for dispatch and combine throughput. On NVIDIA-only platform, UCCL-EP achieves comparable performance to the original DeepEP. UCCL-EP also improves token throughput on SGLang by up to 40% on the NVIDIA+EFA platform, and improves DeepSeek-V3 training throughput over the AMD Primus/Megatron-LM framework by up to 45% on a 16-node AMD+Broadcom platform.
83.4DCApr 19
CCCL: In-GPU Compression-Coupled Collective CommunicationChon Lam Lao, Zhiying Xu, Zhuang Wang et al.
Collective communication incurs significant overhead in LLM workloads. Although overlapping communication with computation in application-level is a common strategy, it often requires substantial code modifications and is impractical for many workloads (e.g., tensor and expert parallelism). We present CCCL, a built-in compression-based collective communication library that supports operations such as allreduce, alltoall, and send/recv without requiring any user-side changes, thereby enabling seamless adoption in existing applications. CCCL tightly fuses compression kernels to minimize memory accesses and integrates with NCCL to eliminate the data coalescing stage, making it fast enough (up to 3x NVLink bandwidth) to sustain communication. Our evaluation shows that CCCL improves end-to-end throughput in vLLM PD disaggregation workloads by up to 10.1% and microbenchmark throughput by up to 30%.
DBJun 9, 2025
LEANN: A Low-Storage Vector IndexYichuan Wang, Shu Liu, Zhifei Li et al.
Embedding-based search is widely used in applications such as recommendation and retrieval-augmented generation (RAG). Recently, there is a growing demand to support these capabilities over personal data stored locally on devices. However, maintaining the necessary data structure associated with the embedding-based search is often infeasible due to its high storage overhead. For example, indexing 100 GB of raw data requires 150 to 700 GB of storage, making local deployment impractical. Reducing this overhead while maintaining search quality and latency becomes a critical challenge. In this paper, we present LEANN, a storage-efficient approximate nearest neighbor (ANN) search index optimized for resource-constrained personal devices. LEANN combines a compact graph-based structure with an efficient on-the-fly recomputation strategy to enable fast and accurate retrieval with minimal storage overhead. Our evaluation shows that LEANN reduces index size to under 5% of the original raw data, achieving up to 50 times smaller storage than standard indexes, while maintaining 90% top-3 recall in under 2 seconds on real-world question answering benchmarks.
LGDec 30, 2024
NetFlowGen: Leveraging Generative Pre-training for Network Traffic DynamicsJiawei Zhou, Woojeong Kim, Zhiying Xu et al.
Understanding the traffic dynamics in networks is a core capability for automated systems to monitor and analyze networking behaviors, reducing expensive human efforts and economic risks through tasks such as traffic classification, congestion prediction, and attack detection. However, it is still challenging to accurately model network traffic with machine learning approaches in an efficient and broadly applicable manner. Task-specific models trained from scratch are used for different networking applications, which limits the efficiency of model development and generalization of model deployment. Furthermore, while networking data is abundant, high-quality task-specific labels are often insufficient for training individual models. Large-scale self-supervised learning on unlabeled data provides a natural pathway for tackling these challenges. We propose to pre-train a general-purpose machine learning model to capture traffic dynamics with only traffic data from NetFlow records, with the goal of fine-tuning for different downstream tasks with small amount of labels. Our presented NetFlowGen framework goes beyond a proof-of-concept for network traffic pre-training and addresses specific challenges such as unifying network feature representations, learning from large unlabeled traffic data volume, and testing on real downstream tasks in DDoS attack detection. Experiments demonstrate promising results of our pre-training framework on capturing traffic dynamics and adapting to different networking tasks.
CRMar 13, 2020
Automating Botnet Detection with Graph Neural NetworksJiawei Zhou, Zhiying Xu, Alexander M. Rush et al.
Botnets are now a major source for many network attacks, such as DDoS attacks and spam. However, most traditional detection methods heavily rely on heuristically designed multi-stage detection criteria. In this paper, we consider the neural network design challenges of using modern deep learning techniques to learn policies for botnet detection automatically. To generate training data, we synthesize botnet connections with different underlying communication patterns overlaid on large-scale real networks as datasets. To capture the important hierarchical structure of centralized botnets and the fast-mixing structure for decentralized botnets, we tailor graph neural networks (GNN) to detect the properties of these structures. Experimental results show that GNNs are better able to capture botnet structure than previous non-learning methods when trained with appropriate data, and that deeper GNNs are crucial for learning difficult botnet topologies. We believe our data and studies can be useful for both the network security and graph learning communities.
CRDec 19, 2019
An Adaptive and Fast Convergent Approach to Differentially Private Deep LearningZhiying Xu, Shuyu Shi, Alex X. Liu et al.
With the advent of the era of big data, deep learning has become a prevalent building block in a variety of machine learning or data mining tasks, such as signal processing, network modeling and traffic analysis, to name a few. The massive user data crowdsourced plays a crucial role in the success of deep learning models. However, it has been shown that user data may be inferred from trained neural models and thereby exposed to potential adversaries, which raises information security and privacy concerns. To address this issue, recent studies leverage the technique of differential privacy to design private-preserving deep learning algorithms. Albeit successful at privacy protection, differential privacy degrades the performance of neural models. In this paper, we develop ADADP, an adaptive and fast convergent learning algorithm with a provable privacy guarantee. ADADP significantly reduces the privacy cost by improving the convergence speed with an adaptive learning rate and mitigates the negative effect of differential privacy upon the model accuracy by introducing adaptive noise. The performance of ADADP is evaluated on real-world datasets. Experiment results show that it outperforms state-of-the-art differentially private approaches in terms of both privacy cost and model accuracy.
CRNov 27, 2019
Reviewing and Improving the Gaussian Mechanism for Differential PrivacyJun Zhao, Teng Wang, Tao Bai et al.
Differential privacy provides a rigorous framework to quantify data privacy, and has received considerable interest recently. A randomized mechanism satisfying $(ε, δ)$-differential privacy (DP) roughly means that, except with a small probability $δ$, altering a record in a dataset cannot change the probability that an output is seen by more than a multiplicative factor $e^ε $. A well-known solution to $(ε, δ)$-DP is the Gaussian mechanism initiated by Dwork et al. [1] in 2006 with an improvement by Dwork and Roth [2] in 2014, where a Gaussian noise amount $\sqrt{2\ln \frac{2}δ} \times \fracΔε$ of [1] or $\sqrt{2\ln \frac{1.25}δ} \times \fracΔε$ of [2] is added independently to each dimension of the query result, for a query with $\ell_2$-sensitivity $Δ$. Although both classical Gaussian mechanisms [1,2] assume $0 < ε\leq 1$, our review finds that many studies in the literature have used the classical Gaussian mechanisms under values of $ε$ and $δ$ where the added noise amounts of [1,2] do not achieve $(ε,δ)$-DP. We obtain such result by analyzing the optimal noise amount $σ_{DP-OPT}$ for $(ε,δ)$-DP and identifying $ε$ and $δ$ where the noise amounts of classical mechanisms are even less than $σ_{DP-OPT}$. Since $σ_{DP-OPT}$ has no closed-form expression and needs to be approximated in an iterative manner, we propose Gaussian mechanisms by deriving closed-form upper bounds for $σ_{DP-OPT}$. Our mechanisms achieve $(ε,δ)$-DP for any $ε$, while the classical mechanisms [1,2] do not achieve $(ε,δ)$-DP for large $ε$ given $δ$. Moreover, the utilities of our mechanisms improve those of [1,2] and are close to that of the optimal yet more computationally expensive Gaussian mechanism.
CRNov 6, 2019
A Survey of Blockchain Applications in Different DomainsWubing Chen, Zhiying Xu, Shuyu Shi et al.
Blockchains have received much attention recently since they provide decentralized approaches to the creation and management of value. Many banks, Internet companies, car manufacturers, and even governments worldwide have incorporated or started considering blockchains to improve the security, scalability, and efficiency of their services. In this paper, we survey blockchain applications in different areas. These areas include cryptocurrency, healthcare, advertising, insurance, copyright protection, energy, and societal applications. Our work provides a timely summary for individuals and organizations interested in blockchains. We envision our study to motivate more blockchain applications.
SIMar 27, 2017
De-anonymization of Social Networks with Communities: When Quantifications Meet AlgorithmsLuoyi Fu, Xinzhe Fu, Zhongzhao Hu et al.
A crucial privacy-driven issue nowadays is re-identifying anonymized social networks by mapping them to correlated cross-domain auxiliary networks. Prior works are typically based on modeling social networks as random graphs representing users and their relations, and subsequently quantify the quality of mappings through cost functions that are proposed without sufficient rationale. Also, it remains unknown how to algorithmically meet the demand of such quantifications, i.e., to find the minimizer of the cost functions. We address those concerns in a more realistic social network modeling parameterized by community structures that can be leveraged as side information for de-anonymization. By Maximum A Posteriori (MAP) estimation, our first contribution is new and well justified cost functions, which, when minimized, enjoy superiority to previous ones in finding the correct mapping with the highest probability. The feasibility of the cost functions is then for the first time algorithmically characterized. While proving the general multiplicative inapproximability, we are able to propose two algorithms, which, respectively, enjoy an ε-additive approximation and a conditional optimality in carrying out successful user re-identification. Our theoretical findings are empirically validated, with a notable dataset extracted from rare true cross-domain networks that reproduce genuine social network de-anonymization. Both theoretical and empirical observations also manifest the importance of community information in enhancing privacy inferencing.