NIApr 1
EvalNet: A Practical Toolchain for Generation and Analysis of Extreme-Scale InterconnectsMaciej Besta, Patrick Iff, Marcel Schneider et al.
The diversity of communication paths in a network, especially non-minimal paths, is a key enabler of performance at extreme scales. We present EvalNet, a toolchain for scalable generation and analysis of over 25 important network topologies, such as Slim Fly, PolarFly, and Orthogonal Fat Trees, with a strong focus on path diversity metrics. EvalNet provides an extensive and fine-grained analysis of shortest and non-shortest paths, including their multiplicities, lengths, and interference. It supports exact measurement and visualization of bandwidth and throughput between every router pair, enabling unprecedented insight into routing potential. EvalNet also includes detailed models for construction cost and power consumption, and interfaces seamlessly with established simulators, which we tune to support large-scale evaluations on low-cost hardware. Using EvalNet, we deliver the widest and most comprehensive path diversity study to date, demonstrating how path diversity underpins throughput and scalability, and facilitating progress towards new frontiers in extreme-scale network design.
ARJun 17, 2025
Scaling Intelligence: Designing Data Centers for Next-Gen Language ModelsJesmin Jahan Tithi, Hanjiang Wu, Avishaii Abuhatzera et al.
The explosive growth of Large Language Models (LLMs), such as GPT-4 with 1.8 trillion parameters, demands a fundamental rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce and evaluate FullFlat network architectures, which provide uniform high-bandwidth, low-latency connectivity between all nodes, and demonstrate their transformative impact on performance and scalability. Through detailed sensitivity analyses, we quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, widening the scale-out domain, and increasing memory capacity. Our study spans both sparse (mixture of experts) and dense transformer-based LLMs, revealing how system design choices affect Model FLOPS Utilization (MFU = Model FLOPS per token * Observed tokens per second / Peak FLOPS of the hardware) and overall throughput. For the co-design study, we utilized an analytical performance modeling tool capable of predicting LLM runtime within 10% of real-world measurements. Our findings offer actionable insights and a practical roadmap for designing AI data centers that can efficiently support trillion-parameter models, reduce optimization complexity, and sustain the rapid evolution of AI capabilities.
LGAug 29, 2025
ReLATE: Learning Efficient Sparse Encoding for High-Performance Tensor DecompositionAhmed E. Helal, Fabio Checconi, Jan Laukemann et al.
Tensor decomposition (TD) is essential for analyzing high-dimensional sparse data, yet its irregular computations and memory-access patterns pose major performance challenges on modern parallel processors. Prior works rely on expert-designed sparse tensor formats that fail to adapt to irregular tensor shapes and/or highly variable data distributions. We present the reinforcement-learned adaptive tensor encoding (ReLATE) framework, a novel learning-augmented method that automatically constructs efficient sparse tensor representations without labeled training samples. ReLATE employs an autonomous agent that discovers optimized tensor encodings through direct interaction with the TD environment, leveraging a hybrid model-free and model-based algorithm to learn from both real and imagined actions. Moreover, ReLATE introduces rule-driven action masking and dynamics-informed action filtering mechanisms that ensure functionally correct tensor encoding with bounded execution time, even during early learning stages. By automatically adapting to both irregular tensor shapes and data distributions, ReLATE generates sparse tensor representations that consistently outperform expert-designed formats across diverse sparse tensor data sets, achieving up to 2X speedup compared to the best sparse format, with a geometric-mean speedup of 1.4-1.46X.
AIJun 11, 2024
Efficient Parallel Multi-Hop Reasoning: A Scalable Approach for Knowledge Graph AnalysisJesmin Jahan Tithi, Fabio Checconi, Fabrizio Petrini
Multi-hop reasoning (MHR) is a process in artificial intelligence and natural language processing where a system needs to make multiple inferential steps to arrive at a conclusion or answer. In the context of knowledge graphs or databases, it involves traversing multiple linked entities and relationships to understand complex queries or perform tasks requiring a deeper understanding. Multi-hop reasoning is a critical function in various applications, including question answering, knowledge base completion, and link prediction. It has garnered significant interest in artificial intelligence, machine learning, and graph analytics. This paper focuses on optimizing MHR for time efficiency on large-scale graphs, diverging from the traditional emphasis on accuracy which is an orthogonal goal. We introduce a novel parallel algorithm that harnesses domain-specific learned embeddings to efficiently identify the top K paths between vertices in a knowledge graph to find the best answers to a three-hop query. Our contributions are: (1) We present a new parallel algorithm to enhance MHR performance, scalability and efficiency. (2) We demonstrate the algorithm's superior performance on leading-edge Intel and AMD architectures through empirical results. We showcase the algorithm's practicality through a case study on identifying academic affiliations of potential Turing Award laureates in Deep Learning, highlighting its capability to handle intricate entity relationships. This demonstrates the potential of our approach to enabling high-performance MHR, useful to navigate the growing complexity of modern knowledge graphs.
DCJul 14, 2021
A New Parallel Algorithm for Sinkhorn Word-Movers Distance and Its Performance on PIUMA and Xeon CPUJesmin Jahan Tithi, Fabrizio Petrini
The Word Movers Distance (WMD) measures the semantic dissimilarity between two text documents by computing the cost of optimally moving all words of a source/query document to the most similar words of a target document. Computing WMD between two documents is costly because it requires solving an $O(V^3log(V))$ optimization problem where $V$ is the number of unique words in the document. Fortunately, WMD can be framed as an Earth Mover's Distance (EMD) for which the algorithmic complexity can be reduced to $O(V^2)$ by adding an entropy penalty to the optimization problem and solving it using the Sinkhorn-Knopp algorithm. Additionally, the computation can be made highly parallel by adopting a batching approach, i.e., computing the WMD of a single query document against multiple target documents at once. Sinkhorn WMD is a key kernel used in many ML/NLP applications. and usually gets implemented in Python. However, a straightforward Python implementation may leave significant performance on the table even though it may internally call optimized C++ BLAS routines. We present a new sparse {P}arallel {A}lgorithm for {S}inkhorn-Knopp {W}ord-movers {D}istance to compute the semantic distance of one document to many other documents by adopting the $O(V^2)$ EMD algorithm. We algorithmically transform $O(V^2)$ dense compute-heavy EMD version into an equivalent sparse one using new fused SDDMM-SpMM (sparse selection of dense-dense matrix-, sparse-dense matrix-multiplication) kernels. We implemented and optimized this algorithm for two very different architectures -- the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) and Intel Xeon CPUs. We show that we were able to reach close to peak performance on both platforms.
LGMay 14, 2020
An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's DistanceJesmin Jahan Tithi, Fabrizio Petrini
The Word Mover's Distance (WMD) is a metric that measures the semantic dissimilarity between two text documents by computing the cost of moving all words of a source/query document to the most similar words of a target document optimally. Computing WMD between two documents is costly because it requires solving an optimization problem that costs \(O(V^3log(V))\) where \(V\) is the number of unique words in the document. Fortunately, the WMD can be framed as the Earth Mover's Distance (EMD) (also known as the Optimal Transportation Distance) for which it has been shown that the algorithmic complexity can be reduced to \(O(V^2)\) by adding an entropy penalty to the optimization problem and a similar idea can be adapted to compute WMD efficiently. Additionally, the computation can be made highly parallel by computing WMD of a single query document against multiple target documents at once (e.g., finding whether a given tweet is similar to any other tweets happened in a day). In this paper, we present a shared-memory parallel Sinkhorn-Knopp Algorithm to compute the WMD of one document against many other documents by adopting the \(O(V^2)\) EMD algorithm. We used algorithmic transformations to change the original dense compute-heavy kernel to a sparse compute kernel and obtained \(67\times\) speedup using \(96\) cores on the state-of-the-art of Intel\textregistered{} 4-sockets Cascade Lake machine w.r.t. its sequential run. Our parallel algorithm is over \(700\times\) faster than the naive parallel python code that internally uses optimized matrix library calls.
DCMar 26, 2020
Online and Real-time Object Tracking Algorithm with Extremely Small MatricesJesmin Jahan Tithi, Sriram Aananthakrishnan, Fabrizio Petrini
Online and Real-time Object Tracking is an interesting workload that can be used to track objects (e.g., car, human, animal) in a series of video sequences in real-time. For simple object tracking on edge devices, the output of object tracking could be as simple as drawing a bounding box around a detected object and in some cases, the input matrices used in such computation are quite small (e.g., 4x7, 3x3, 5x5, etc). As a result, the amount of actual work is low. Therefore, a typical multi-threading based parallelization technique can not accelerate the tracking application; instead, a throughput based parallelization technique where each thread operates on independent video sequences is more rewarding. In this paper, we share our experience in parallelizing a Simple Online and Real-time Tracking (SORT) application on shared-memory multicores.