Pengfei Su

CL
h-index4
7papers
52citations
Novelty52%
AI Score53

7 Papers

DCApr 21
CoCoDiff: Optimizing Collective Communications for Distributed Diffusion Transformer Inference Under Ulysses Sequence Parallelism

Bin Ma, Xingjian Ding, Tekin Bicer et al.

Diffusion Transformers (DiTs) are increasingly adopted in scientific computing, yet growing model sizes and resolutions make distributed multi-GPU inference essential. Ulysses sequence parallelism scales DiT inference but introduces frequent all-to-all collectives that dominate latency. Overlapping these with computation is difficult due to tight data dependencies, large message volumes, and asymmetric interconnect bandwidths. We introduce CoCoDiff, a distributed DiT inference engine exploiting two observations: (1) V requires only linear projection while Q/K need additional normalization and RoPE, creating opportunities to overlap V's communication with Q/K computation; (2) adjacent denoising steps produce similar tensors, yielding temporal redundancy. CoCoDiff introduces three mechanisms: Tile-Aware Parallel All-to-all (TAPA) decomposes collectives into topology-aligned phases; V-First scheduling hides V's communication behind Q/K computation; and V-Major selective communication transmits only active projections on slow interconnects. On the Aurora supercomputer with four DiT models across 1-8 nodes (up to 96 Intel GPU tiles), CoCoDiff achieves an average speedup of 3.6x, peaking at 8.4x.

OSApr 14
Hybrid Adaptive Tuning for Tiered Memory Systems

Xi Wang, Jie Liu, Shuangyan Yang et al.

Memory tiering provides a cost-effective solution to increase memory capacity, utilization, and even bandwidth. Memory tiering relies on system software for memory profiling, detection of frequently accessed pages, and page migration. Such a system software often comes with system parameters. The configurations of those parameters impact application performance. We comprehensively classify system parameters, and characterize the sensitivity of application performance to them using representative memory tiering solutions. Furthermore, we introduce a lightweight and user-friendly framework PTMT, which automates tuning of parameters at runtime for various memory tiering solutions. We identify major challenges for online tuning of memory tiering. PTMT uses a hybrid "offline + online" tuning method: while the offline phase builds a performance database for online queries and reduces runtime overhead, the online phase uses reinforcement learning (customized to memory tiering) to tune. PTMT improves performance by 30%, 26%, 21%, and 14%, on four memory tiering solutions (TPP, UPM, Colloid, and AutoNUMA), compared to using the default configurations. PTMT outperforms the state-of-the-art by 32% on average.

SYApr 23
A Convexified Eulerian Framework for Scalable Coordination of Massive DER Populations

Ge Chen, Yiwei Qiu, Shiyao Zhang et al.

This paper proposes a scalable coordination framework with aggregator-side privacy protection for storage-like distributed energy resources (DERs). The framework adopts a two-layer architecture. At the macroscopic layer, building upon an \emph{Eulerian} modeling perspective, the DER population is represented as a continuum whose density evolution is governed by a partial differential equation (PDE), such that the computational complexity is independent of the population size. To address the bilinear non-convexity in this PDE-constrained optimization problem, we develop a convexification method that combines finite-volume discretization with a flux-lifting technique, reformulating the macroscopic problem into a sparse linear program (LP). The LP solution yields a unified, state-dependent broadcast signal for population coordination. Furthermore, a Wasserstein-based relaxation is introduced to replace rigid cyclic constraints and provide additional operational flexibility for improved economic performance. At the microscopic layer, individual resources autonomously recover local setpoints from the broadcast signal and their local states, while an upstream data-mixing protocol aggregates individual states into a macroscopic density histogram without exposing raw individual states to the aggregator. Numerical studies validate the scalability, feasibility, and economic effectiveness of the proposed framework.

CLOct 29, 2025
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache

Dinghong Song, Yuan Feng, Yiwei Wang et al.

Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning. However, many realworld workloads such as classification, question answering, recommendation, and text embedding rely solely on the prefill stage of inference, where the model encodes input sequences without performing autoregressive decoding. In these prefill only scenarios, the self-attention computation becomes the primary performance bottleneck due to its quadratic complexity with respect to sequence length. In this paper, we observe that semantically different sentences often produce similar attention maps across layers and heads. Building on this insight, we propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps. Based on an attention map memorization database, AttnCache employs efficient caching and similarity search techniques to identify and reuse pre-cached attention maps during inference, thereby reducing the computational overhead of self-attention. Experimental results show that AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.

CLOct 29, 2025
NeuronMM: High-Performance Matrix Multiplication for LLM Inference on AWS Trainium

Dinghong Song, Jierui Xu, Weichu Yang et al.

AI accelerators, customized to AI workloads, provide cost-effective and high-performance solutions for training and inference. Trainium, an AI accelerator recently developed by Amazon Web Services (AWS), provides an attractive option for LLM training and inference through its heterogeneous architecture. However, leveraging Trainium architecture for high performance can be challenging because of its systolic array architecture and special requirement on data layout. In this paper, we design high-performance matrix multiplication (matmul), a critical compute kernel, for LLM inference on Trainium. We introduce a series of techniques customized to Trainium based on kernel fusion and novel caching strategies to reduce data movement across the software-managed memory hierarchy, maximize SRAM bandwidth, and avoid expensive matrix transpose. Evaluating with nine datasets and four recent LLMs, we show that our system largely outperforms the state-of-the-art matmul implemented by AWS on Trainium: at the level of matmul kernel, it achieves an average 1.35x speedup (up to 2.22x), which translates to an average 1.66x speedup (up to 2.49x) for end-to-end LLM inference.

PFJun 28, 2019
Pinpointing Performance Inefficiencies in Java

Pengfei Su, Qingsen Wang, Milind Chabbi et al.

Many performance inefficiencies such as inappropriate choice of algorithms or data structures, developers' inattention to performance, and missed compiler optimizations show up as wasteful memory operations. Wasteful memory operations are those that produce/consume data to/from memory that may have been avoided. We present, JXPerf, a lightweight performance analysis tool for pinpointing wasteful memory operations in Java programs. Traditional byte-code instrumentation for such analysis (1) introduces prohibitive overheads and (2) misses inefficiencies in machine code generation. JXPerf overcomes both of these problems. JXPerf uses hardware performance monitoring units to sample memory locations accessed by a program and uses hardware debug registers to monitor subsequent accesses to the same memory. The result is a lightweight measurement at machine-code level with attribution of inefficiencies to their provenance: machine and source code within full calling contexts. JXPerf introduces only 7% runtime overhead and 7% memory overhead making it useful in production. Guided by JXPerf, we optimize several Java applications by improving code generation and choosing superior data structures and algorithms, which yield significant speedups.

PFFeb 14, 2019
Redundant Loads: A Software Inefficiency Indicator

Pengfei Su, Shasha Wen, Hailong Yang et al.

Modern software packages have become increasingly complex with millions of lines of code and references to many external libraries. Redundant operations are a common performance limiter in these code bases. Missed compiler optimization opportunities, inappropriate data structure and algorithm choices, and developers' inattention to performance are some common reasons for the existence of redundant operations. Developers mainly depend on compilers to eliminate redundant operations. However, compilers' static analysis often misses optimization opportunities due to ambiguities and limited analysis scope; automatic optimizations to algorithmic and data structural problems are out of scope. We develop LoadSpy, a whole-program profiler to pinpoint redundant memory load operations, which are often a symptom of many redundant operations. The strength of LoadSpy exists in identifying and quantifying redundant load operations in programs and associating the redundancies with program execution contexts and scopes to focus developers' attention on problematic code. LoadSpy works on fully optimized binaries, adopts various optimization techniques to reduce its overhead, and provides a rich graphic user interface, which make it a complete developer tool. Applying LoadSpy showed that a large fraction of redundant loads is common in modern software packages despite highest levels of automatic compiler optimizations. Guided by LoadSpy, we optimize several well-known benchmarks and real-world applications, yielding significant speedups.