Mengbai Xiao

LG
h-index10
4papers
85citations
Novelty61%
AI Score45

4 Papers

CGMar 26
GPU-Accelerated Computation of Vietoris-Rips Persistence Barcodes

Simon Zhang, Mengbai Xiao, Hao Wang

The computation of Vietoris-Rips persistence barcodes is both execution-intensive and memory-intensive. In this paper, we study its computational structure and identify several unique mathematical properties and algorithmic opportunities with connections to the GPU. Mathematically and empirically, we look into the properties of apparent pairs, which are independently identifiable persistence pairs comprising up to 99\% of persistence pairs. We prove tight upper and lower bounds of the apparent pair rate and some probabilistic lower bounds. We also design massively parallel algorithms to take advantage of the very large number of simplices that can be processed independently of each other. Having identified these opportunities, we develop a GPU-accelerated software for computing Vietoris-Rips persistence barcodes, called Ripser++. Under nice sampling conditions, we show that the expected work complexity of our algorithm is near linear in the number of simplices. The expected depth complexity is dependent only on the computation of the expected number of $p$-dimensional homological cycles. The software achieves up to 30x speedup over the total execution time of the original Ripser and also reduces CPU-memory usage by up to 2.0x. We believe our GPU-acceleration based efforts open a new chapter for the advancement of topological data analysis in the post-Moore's Law era.

AIApr 29, 2025
TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems

Xiao Zhang, Qi Wang, Mingyi Li et al.

Implementing large language models (LLMs)-driven root cause analysis (RCA) in cloud-native systems has become a key topic of modern software operations and maintenance. However, existing LLM-based approaches face three key challenges: multi-modality input constraint, context window limitation, and dynamic dependence graph. To address these issues, we propose a tool-assisted LLM agent with multi-modality observation data for fine-grained RCA, namely TAMO, including multimodality alignment tool, root cause localization tool, and fault types classification tool. In detail, TAMO unifies multi-modal observation data into time-aligned representations for cross-modal feature consistency. Based on the unified representations, TAMO then invokes its specialized root cause localization tool and fault types classification tool for further identifying root cause and fault type underlying system context. This approach overcomes the limitations of LLMs in processing real-time raw observational data and dynamic service dependencies, guiding the model to generate repair strategies that align with system context through structured prompt design. Experiments on two benchmark datasets demonstrate that TAMO outperforms state-of-the-art (SOTA) approaches with comparable performance.

LGApr 5, 2025
SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding

Haofei Yin, Mengbai Xiao, Tinghong Li et al.

The demand for large language model inference is rapidly increasing. Pipeline parallelism offers a cost-effective deployment strategy for distributed inference but suffers from high service latency. While incorporating speculative decoding to pipeline parallelism improves performance, it still faces challenges of low hardware utilization and narrow speculative window. Inspired by branch prediction in instruction pipelining, we introduce SpecPipe, which fills the pipeline with speculative tokens of a request step-by-step. By maximizing the hardware utilization, SpecPipe decodes one token per pipeline step ideally. Specifically, SpecPipe comprises a dynamic speculative token tree and a pipelined inference framework. The tree dynamically accepts tokens from a speculative token source and outputs the tokens to the inference pipeline. Since the speculative window relaxed in our framework, a high-accuracy draft model is integrated without fine-tuning. The pipeline inference framework follows node-wise computation, pruning propagation, and inter-node communication stages. We implement SpecPipe and a variant SpecPipe-DB with dynamic batching for single- and multi-request inference, respectively. On an 8-stage pipeline, SpecPipe improves time between tokens on diverse single-request workloads by $4.19\times$-$5.53\times$ over standard pipeline parallelism and by $2.08\times$-$2.38\times$ over prior tree-based speculative decoding methods. For multi-request workloads, SpecPipe-DB achieves $1.64\times$-$2.08\times$ higher throughput and $1.61\times$-$2.06\times$ lower time between tokens than vLLM.

LGJul 23, 2025
DistrAttention: An Efficient and Flexible Self-Attention Mechanism on Modern GPUs

Haolin Jin, Mengbai Xiao, Yuan Yuan et al.

The Transformer architecture has revolutionized deep learning, delivering the state-of-the-art performance in areas such as natural language processing, computer vision, and time series prediction. However, its core component, self-attention, has the quadratic time complexity relative to input sequence length, which hinders the scalability of Transformers. The exsiting approaches on optimizing self-attention either discard full-contextual information or lack of flexibility. In this work, we design DistrAttention, an effcient and flexible self-attention mechanism with the full context. DistrAttention achieves this by grouping data on the embedding dimensionality, usually referred to as $d$. We realize DistrAttention with a lightweight sampling and fusion method that exploits locality-sensitive hashing to group similar data. A block-wise grouping framework is further designed to limit the errors introduced by locality sensitive hashing. By optimizing the selection of block sizes, DistrAttention could be easily integrated with FlashAttention-2, gaining high-performance on modern GPUs. We evaluate DistrAttention with extensive experiments. The results show that our method is 37% faster than FlashAttention-2 on calculating self-attention. In ViT inference, DistrAttention is the fastest and the most accurate among approximate self-attention mechanisms. In Llama3-1B, DistrAttention still achieves the lowest inference time with only 1% accuray loss.