LGApr 10, 2023
RAPID: Enabling Fast Online Policy Learning in Dynamic Public Cloud EnvironmentsDrew Penney, Bin Li, Lizhong Chen et al.
Resource sharing between multiple workloads has become a prominent practice among cloud service providers, motivated by demand for improved resource utilization and reduced cost of ownership. Effective resource sharing, however, remains an open challenge due to the adverse effects that resource contention can have on high-priority, user-facing workloads with strict Quality of Service (QoS) requirements. Although recent approaches have demonstrated promising results, those works remain largely impractical in public cloud environments since workloads are not known in advance and may only run for a brief period, thus prohibiting offline learning and significantly hindering online learning. In this paper, we propose RAPID, a novel framework for fast, fully-online resource allocation policy learning in highly dynamic operating environments. RAPID leverages lightweight QoS predictions, enabled by domain-knowledge-inspired techniques for sample efficiency and bias reduction, to decouple control from conventional feedback sources and guide policy learning at a rate orders of magnitude faster than prior work. Evaluation on a real-world server platform with representative cloud workloads confirms that RAPID can learn stable resource allocation policies in minutes, as compared with hours in prior state-of-the-art, while improving QoS by 9.0x and increasing best-effort workload performance by 19-43%.
CLJul 3, 2023
Implicit Memory Transformer for Computationally Efficient Simultaneous Speech TranslationMatthew Raffel, Lizhong Chen
Simultaneous speech translation is an essential communication task difficult for humans whereby a translation is generated concurrently with oncoming speech inputs. For such a streaming task, transformers using block processing to break an input sequence into segments have achieved state-of-the-art performance at a reduced cost. Current methods to allow information to propagate across segments, including left context and memory banks, have faltered as they are both insufficient representations and unnecessarily expensive to compute. In this paper, we propose an Implicit Memory Transformer that implicitly retains memory through a new left context method, removing the need to explicitly represent memory with memory banks. We generate the left context from the attention output of the previous segment and include it in the keys and values of the current segment's attention calculation. Experiments on the MuST-C dataset show that the Implicit Memory Transformer provides a substantial speedup on the encoder forward pass with nearly identical translation quality when compared with the state-of-the-art approach that employs both left context and memory banks.
CLJul 3, 2023
Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech TranslationMatthew Raffel, Drew Penney, Lizhong Chen
Transformer models using segment-based processing have been an effective architecture for simultaneous speech translation. However, such models create a context mismatch between training and inference environments, hindering potential translation accuracy. We solve this issue by proposing Shiftable Context, a simple yet effective scheme to ensure that consistent segment and context sizes are maintained throughout training and inference, even with the presence of partially filled segments due to the streaming nature of simultaneous translation. Shiftable Context is also broadly applicable to segment-based transformers for streaming tasks. Our experiments on the English-German, English-French, and English-Spanish language pairs from the MUST-C dataset demonstrate that when applied to the Augmented Memory Transformer, a state-of-the-art model for simultaneous speech translation, the proposed scheme achieves an average increase of 2.09, 1.83, and 1.95 BLEU scores across each wait-k value for the three language pairs, respectively, with a minimal impact on computation-aware Average Lagging.
CLDec 7, 2023Code
Simul-LLM: A Framework for Exploring High-Quality Simultaneous Translation with Large Language ModelsVictor Agostinelli, Max Wild, Matthew Raffel et al.
Large language models (LLMs) with billions of parameters and pretrained on massive amounts of data are now capable of near or better than state-of-the-art performance in a variety of downstream natural language processing tasks. Neural machine translation (NMT) is one such task that LLMs have been applied to with great success. However, little research has focused on applying LLMs to the more difficult subset of NMT called simultaneous translation (SimulMT), where translation begins before the entire source context is available to the model. In this paper, we address key challenges facing LLMs fine-tuned for SimulMT, validate classical SimulMT concepts and practices in the context of LLMs, explore adapting LLMs that are fine-tuned for NMT to the task of SimulMT, and introduce Simul-LLM, the first open-source fine-tuning and evaluation pipeline development framework for LLMs focused on SimulMT.
18.6LGMay 16
PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 DiabetesPhat Tran, Neville Mehta, Clara Mosquera-Lopez et al.
Accurate long-horizon glucose forecasting is critical for automated insulin delivery systems, which help people with type 1 diabetes (T1D) manage their glucose and avoid dangerous hypoglycemia. However, standard recursive long short-term memory (LSTM) networks suffer from systematic negative bias at longer horizons due to error compounding, while purely mechanistic ordinary differential equation (ODE) models fail to generalize across individuals when parameterized at the population level. We propose PhysioSeq2Seq, a hybrid architecture that combines patient-specific physiological modeling with a sequence-to-sequence (Seq2Seq) LSTM. For each glucose segment, twin matching searches a population of 300 parameterized digital twins to identify the best-fitting physiological match from a 3-hour continuous glucose monitoring (CGM) history. The 10 internal ODE state variables of the matched twin are injected as exogenous covariates into both the encoder and decoder of the Seq2Seq LSTM. This simultaneous 48-step prediction strategy eliminates recursive error compounding, while the ODE features provide a physics-grounded constraint that bounds long-horizon drift within physiologically plausible ranges. PhysioSeq2Seq was trained on CGM and insulin data from 348 participants in the Type 1 Diabetes Exercise Initiative (T1DEXI) dataset and evaluated on 74 held-out participants. At the 240-minute horizon, PhysioSeq2Seq achieves a mean absolute error of 39.28 mg/dL and a mean error of -10.62 mg/dL, reducing bias by 13.89 mg/dL over the recursive LSTM and reducing mean absolute error by 28.62 mg/dL over the ODE-based digital twin. These results show that eliminating architectural feedback and injecting patient-matched physiological states is an effective and clinically meaningful strategy for long-horizon glucose forecasting in T1D.
CLMay 17, 2025Code
Towards Universal Semantics With Large Language ModelsRaymond Baartmans, Matthew Raffel, Rahul Vikram et al.
The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond. Our code is available at https://github.com/OSU-STARLAB/DeepNSM.
CLApr 17, 2023
Improving Autoregressive NLP Tasks via Modular Linearized AttentionVictor Agostinelli, Lizhong Chen
Various natural language processing (NLP) tasks necessitate models that are efficient and small based on their ultimate application at the edge or in other resource-constrained environments. While prior research has reduced the size of these models, increasing computational efficiency without considerable performance impacts remains difficult, especially for autoregressive tasks. This paper proposes modular linearized attention (MLA), which combines multiple efficient attention mechanisms, including cosFormer, to maximize inference quality while achieving notable speedups. We validate this approach on several autoregressive NLP tasks, including speech-to-text neural machine translation (S2T NMT), speech-to-text simultaneous translation (SimulST), and autoregressive text-to-spectrogram, noting efficiency gains on TTS and competitive performance for NMT and SimulST during training and inference.
LGJun 24, 2023
Partitioning-Guided K-Means: Extreme Empty Cluster Resolution for Extreme Model CompressionTianhong Huang, Victor Agostinelli, Lizhong Chen
Compactness in deep learning can be critical to a model's viability in low-resource applications, and a common approach to extreme model compression is quantization. We consider Iterative Product Quantization (iPQ) with Quant-Noise to be state-of-the-art in this area, but this quantization framework suffers from preventable inference quality degradation due to prevalent empty clusters. In this paper, we propose several novel enhancements aiming to improve the accuracy of iPQ with Quant-Noise by focusing on resolving empty clusters. Our contribution, which we call Partitioning-Guided k-means (PG k-means), is a heavily augmented k-means implementation composed of three main components. First, we propose a partitioning-based pre-assignment strategy that ensures no initial empty clusters and encourages an even weight-to-cluster distribution. Second, we propose an empirically superior empty cluster resolution heuristic executed via cautious partitioning of large clusters. Finally, we construct an optional optimization step that consolidates intuitively dense clusters of weights to ensure shared representation. The proposed approach consistently reduces the number of empty clusters in iPQ with Quant-Noise by 100x on average, uses 8x fewer iterations during empty cluster resolution, and improves overall model accuracy by up to 12%, when applied to RoBERTa on a variety of tasks in the GLUE benchmark.
CLMay 16, 2024
Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous TranslationMatthew Raffel, Victor Agostinelli, Lizhong Chen
Large language models (LLMs) have achieved state-of-the-art performance in various language processing tasks, motivating their adoption in simultaneous translation. Current fine-tuning methods to adapt LLMs for simultaneous translation focus on prompting optimization strategies using either data augmentation or prompt structure modifications. However, these methods suffer from several issues, such as unnecessarily expanded training sets, computational inefficiency from dumping the key and value cache, increased prompt sizes, or restriction to a single decision policy. To eliminate these issues, in this work, we propose SimulMask, a new paradigm for fine-tuning LLMs for simultaneous translation. It utilizes a novel attention mask approach that models simultaneous translation during fine-tuning by masking attention for a desired decision policy. Applying the proposed SimulMask on a Falcon LLM for the IWSLT 2017 dataset, we have observed a significant translation quality improvement compared to state-of-the-art prompting optimization strategies on five language pairs while reducing the computational cost.
LGFeb 11, 2025
MatrixKAN: Parallelized Kolmogorov-Arnold NetworkCale Coffman, Lizhong Chen
Kolmogorov-Arnold Networks (KAN) are a new class of neural network architecture representing a promising alternative to the Multilayer Perceptron (MLP), demonstrating improved expressiveness and interpretability. However, KANs suffer from slow training and inference speeds relative to MLPs due in part to the recursive nature of the underlying B-spline calculations. This issue is particularly apparent with respect to KANs utilizing high-degree B-splines, as the number of required non-parallelizable recursions is proportional to B-spline degree. We solve this issue by proposing MatrixKAN, a novel optimization that parallelizes B-spline calculations with matrix representation and operations, thus significantly improving effective computation time for models utilizing high-degree B-splines. In this paper, we demonstrate the superior scaling of MatrixKAN's computation time relative to B-spline degree. Further, our experiments demonstrate speedups of approximately 40x relative to KAN, with significant additional speedup potential for larger datasets or higher spline degrees.
CLMay 18, 2024
LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned ProportionsVictor Agostinelli, Sanghyun Hong, Lizhong Chen
A promising approach to preserving model performance in linearized transformers is to employ position-based re-weighting functions. However, state-of-the-art re-weighting functions rely heavily on target sequence lengths, making it difficult or impossible to apply them to autoregressive and simultaneous tasks, where the target and sometimes even the input sequence length are unknown. To address this issue, we propose Learned Proportions (LeaP) and LeaPformers. Our contribution is built on two major components. First, we generalize the dependence on explicit positional representations and sequence lengths into dependence on sequence proportions for re-weighting. Second, we replace static positional representations with dynamic proportions derived via a compact module, enabling more flexible attention concentration patterns. We evaluate LeaPformer against eight representative efficient transformers on the Long-Range Arena benchmark, showing that LeaPformer achieves the best quality-throughput trade-off, as well as LeaPformer to Wikitext-103 autoregressive language modeling and simultaneous speech-to-text translation for two language pairs, achieving competitive results.
ARApr 25, 2024
FLAASH: Flexible Accelerator Architecture for Sparse High-Order Tensor ContractionGabriel Kulp, Andrew Ensinger, Lizhong Chen
Tensors play a vital role in machine learning (ML) and often exhibit properties best explored while maintaining high-order. Efficiently performing ML computations requires taking advantage of sparsity, but generalized hardware support is challenging. This paper introduces FLAASH, a flexible and modular accelerator design for sparse tensor contraction that achieves over 25x speedup for a deep learning workload. Our architecture performs sparse high-order tensor contraction by distributing sparse dot products, or portions thereof, to numerous Sparse Dot Product Engines (SDPEs). Memory structure and job distribution can be customized, and we demonstrate a simple approach as a proof of concept. We address the challenges associated with control flow to navigate data structures, high-order representation, and high-sparsity handling. The effectiveness of our approach is demonstrated through various evaluations, showcasing significant speedup as sparsity and order increase.
LGMay 20, 2025
FlashKAT: Understanding and Addressing Performance Bottlenecks in the Kolmogorov-Arnold TransformerMatthew Raffel, Lizhong Chen
The Kolmogorov-Arnold Network (KAN) has been gaining popularity as an alternative to the multi-layer perceptron (MLP) with its increased expressiveness and interpretability. Even so, the KAN suffers from being orders of magnitude slower due to its increased computational cost and training instability, limiting its applicability to larger-scale tasks. Recently, the Kolmogorov-Arnold Transformer (KAT) has been proposed, which can achieve FLOPs similar to the traditional Transformer with MLPs by leveraging Group-Rational KAN (GR-KAN). Unfortunately, despite the comparable FLOPs, our testing reveals that the KAT is still 123x slower in training speeds, indicating that there are other performance bottlenecks beyond FLOPs. In this paper, we conduct a series of experiments to understand the root cause of the slowdown in KAT. We uncover that the slowdown can be isolated to memory stalls, linked more specifically to inefficient gradient accumulations in the backward pass of GR-KAN. To address this memory bottleneck, we propose FlashKAT, which minimizes accesses to slow memory and the usage of atomic adds through a restructured kernel. Evaluations demonstrate that FlashKAT can achieve a training speedup of 86.5x compared with the state-of-the-art KAT, while reducing rounding errors in the computation of the gradients.
CLNov 1, 2024
LLM-Ref: Enhancing Reference Handling in Technical Writing with Large Language ModelsKazi Ahmed Asif Fuad, Lizhong Chen
Large Language Models (LLMs) excel in data synthesis but can be inaccurate in domain-specific tasks, which retrieval-augmented generation (RAG) systems address by leveraging user-provided data. However, RAGs require optimization in both retrieval and generation stages, which can affect output quality. In this paper, we present LLM-Ref, a writing assistant tool that aids researchers in writing articles from multiple source documents with enhanced reference synthesis and handling capabilities. Unlike traditional RAG systems that use chunking and indexing, our tool retrieves and generates content directly from text paragraphs. This method facilitates direct reference extraction from the generated outputs, a feature unique to our tool. Additionally, our tool employs iterative response generation, effectively managing lengthy contexts within the language model's constraints. Compared to baseline RAG-based systems, our approach achieves a $3.25\times$ to $6.26\times$ increase in Ragas score, a comprehensive metric that provides a holistic view of a RAG system's ability to produce accurate, relevant, and contextually appropriate responses. This improvement shows our method enhances the accuracy and contextual relevance of writing assistance tools.
LGNov 24, 2025
QuantKAN: A Unified Quantization Framework for Kolmogorov Arnold NetworksKazi Ahmed Asif Fuad, Lizhong Chen
Kolmogorov Arnold Networks (KANs) represent a new class of neural architectures that replace conventional linear transformations and node-based nonlinearities with spline-based function approximations distributed along network edges. Although KANs offer strong expressivity and interpretability, their heterogeneous spline and base branch parameters hinder efficient quantization, which remains unexamined compared to CNNs and Transformers. In this paper, we present QuantKAN, a unified framework for quantizing KANs across both quantization aware training (QAT) and post-training quantization (PTQ) regimes. QuantKAN extends modern quantization algorithms, such as LSQ, LSQ+, PACT, DoReFa, QIL, GPTQ, BRECQ, AdaRound, AWQ, and HAWQ-V2, to spline based layers with branch-specific quantizers for base, spline, and activation components. Through extensive experiments on MNIST, CIFAR 10, and CIFAR 100 across multiple KAN variants (EfficientKAN, FastKAN, PyKAN, and KAGN), we establish the first systematic benchmarks for low-bit spline networks. Our results show that KANs, particularly deeper KAGN variants, are compatible with low-bit quantization but exhibit strong method architecture interactions: LSQ, LSQ+, and PACT preserve near full precision accuracy at 4 bit for shallow KAN MLP and ConvNet models, while DoReFa provides the most stable behavior for deeper KAGN under aggressive low-bit settings. For PTQ, GPTQ and Uniform consistently deliver the strongest overall performance across datasets, with BRECQ highly competitive on simpler regimes such as MNIST. Our proposed QuantKAN framework thus unifies spline learning and quantization, and provides practical tools and guidelines for efficiently deploying KANs in real-world, resource-constrained environments.
LGOct 21, 2025
MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold NetworkMatthew Raffel, Adwaith Renjith, Lizhong Chen
Kolmogorov-Arnold Networks (KANs) replace scalar weights with per-edge vectors of basis coefficients, thereby boosting expressivity and accuracy but at the same time resulting in a multiplicative increase in parameters and memory. We propose MetaCluster, a framework that makes KANs highly compressible without sacrificing accuracy. Specifically, a lightweight meta-learner, trained jointly with the KAN, is used to map low-dimensional embedding to coefficient vectors, shaping them to lie on a low-dimensional manifold that is amenable to clustering. We then run K-means in coefficient space and replace per-edge vectors with shared centroids. Afterwards, the meta-learner can be discarded, and a brief fine-tuning of the centroid codebook recovers any residual accuracy loss. The resulting model stores only a small codebook and per-edge indices, exploiting the vector nature of KAN parameters to amortize storage across multiple coefficients. On MNIST, CIFAR-10, and CIFAR-100, across standard KANs and ConvKANs using multiple basis functions, MetaCluster achieves a reduction of up to 80$\times$ in parameter storage, with no loss in accuracy. Code will be released upon publication.
CLMay 29, 2025
BeaverTalk: Oregon State University's IWSLT 2025 Simultaneous Speech Translation SystemMatthew Raffel, Victor Agostinelli, Lizhong Chen
This paper discusses the construction, fine-tuning, and deployment of BeaverTalk, a cascaded system for speech-to-text translation as part of the IWSLT 2025 simultaneous translation task. The system architecture employs a VAD segmenter for breaking a speech stream into segments, Whisper Large V2 for automatic speech recognition (ASR), and Gemma 3 12B for simultaneous translation. Regarding the simultaneous translation LLM, it is fine-tuned via low-rank adaptors (LoRAs) for a conversational prompting strategy that leverages a single prior-sentence memory bank from the source language as context. The cascaded system participated in the English$\rightarrow$German and English$\rightarrow$Chinese language directions for both the low and high latency regimes. In particular, on the English$\rightarrow$German task, the system achieves a BLEU of 24.64 and 27.83 at a StreamLAAL of 1837.86 and 3343.73, respectively. Then, on the English$\rightarrow$Chinese task, the system achieves a BLEU of 34.07 and 37.23 at a StreamLAAL of 2216.99 and 3521.35, respectively.
LGApr 11, 2025
ML For Hardware Design Interpretability: Challenges and OpportunitiesRaymond Baartmans, Andrew Ensinger, Victor Agostinelli et al.
The increasing size and complexity of machine learning (ML) models have driven the growing need for custom hardware accelerators capable of efficiently supporting ML workloads. However, the design of such accelerators remains a time-consuming process, heavily relying on engineers to manually ensure design interpretability through clear documentation and effective communication. Recent advances in large language models (LLMs) offer a promising opportunity to automate these design interpretability tasks, particularly the generation of natural language descriptions for register-transfer level (RTL) code, what we refer to as "RTL-to-NL tasks." In this paper, we examine how design interpretability, particularly in RTL-to-NL tasks, influences the efficiency of the hardware design process. We review existing work adapting LLMs for these tasks, highlight key challenges that remain unaddressed, including those related to data, computation, and model development, and identify opportunities to address them. By doing so, we aim to guide future research in leveraging ML to automate RTL-to-NL tasks and improve hardware design interpretability, thereby accelerating the hardware design process and meeting the increasing demand for custom hardware accelerators in machine learning and beyond.
CRApr 2, 2025
Hessian-aware Training for Enhancing DNNs Resilience to Parameter CorruptionsTahmid Hasan Prato, Seijoon Kim, Lizhong Chen et al.
Deep neural networks are not resilient to parameter corruptions: even a single-bitwise error in their parameters in memory can cause an accuracy drop of over 10%, and in the worst cases, up to 99%. This susceptibility poses great challenges in deploying models on computing platforms, where adversaries can induce bit-flips through software or bitwise corruptions may occur naturally. Most prior work addresses this issue with hardware or system-level approaches, such as integrating additional hardware components to verify a model's integrity at inference. However, these methods have not been widely deployed as they require infrastructure or platform-wide modifications. In this paper, we propose a new approach to addressing this issue: training models to be more resilient to bitwise corruptions to their parameters. Our approach, Hessian-aware training, promotes models with $flatter$ loss surfaces. We show that, while there have been training methods, designed to improve generalization through Hessian-based approaches, they do not enhance resilience to parameter corruptions. In contrast, models trained with our method demonstrate increased resilience to parameter corruptions, particularly with a 20$-$50% reduction in the number of bits whose individual flipping leads to a 90$-$100% accuracy drop. Moreover, we show the synergy between ours and existing hardware and system-level defenses.
COMP-PHNov 5, 2024
Neurons for Neutrons: A Transformer Model for Computation Load Estimation on Domain-Decomposed Neutron Transport ProblemsAlexander Mote, Todd Palmer, Lizhong Chen
Domain decomposition is a technique used to reduce memory overhead on large neutron transport problems. Currently, the optimal load-balanced processor allocation for these domains is typically determined through small-scale simulations of the problem, which can be time-consuming for researchers and must be repeated anytime a problem input is changed. We propose a Transformer model with a unique 3D input embedding, and input representations designed for domain-decomposed neutron transport problems, which can predict the subdomain computation loads generated by small-scale simulations. We demonstrate that such a model trained on domain-decomposed Small Modular Reactor (SMR) simulations achieves 98.2% accuracy while being able to skip the small-scale simulation step entirely. Tests of the model's robustness on variant fuel assemblies, other problem geometries, and changes in simulation parameters are also discussed.
LGJan 19, 2022
PROMPT: Learning Dynamic Resource Allocation Policies for Network ApplicationsDrew Penney, Bin Li, Jaroslaw Sydir et al.
A growing number of service providers are exploring methods to improve server utilization and reduce power consumption by co-scheduling high-priority latency-critical workloads with best-effort workloads. This practice requires strict resource allocation between workloads to reduce contention and maintain Quality-of-Service (QoS) guarantees. Prior work demonstrated promising opportunities to dynamically allocate resources based on workload demand, but may fail to meet QoS objectives in more stringent operating environments due to the presence of resource allocation cliffs, transient fluctuations in workload performance, and rapidly changing resource demand. We therefore propose PROMPT, a novel resource allocation framework using proactive QoS prediction to guide a reinforcement learning controller. PROMPT enables more precise resource optimization, more consistent handling of transient behaviors, and more robust generalization when co-scheduling new best-effort workloads not encountered during policy training. Evaluation shows that the proposed method incurs 4.2x fewer QoS violations, reduces severity of QoS violations by 12.7x, improves best-effort workload performance, and improves overall power efficiency over prior work.
ARSep 26, 2019
A Survey of Machine Learning Applied to Computer Architecture DesignDrew D. Penney, Lizhong Chen
Machine learning has enabled significant benefits in diverse fields, but, with a few exceptions, has had limited impact on computer architecture. Recent work, however, has explored broader applicability for design, optimization, and simulation. Notably, machine learning based strategies often surpass prior state-of-the-art analytical, heuristic, and human-expert approaches. This paper reviews machine learning applied system-wide to simulation and run-time optimization, and in many individual components, including memory systems, branch predictors, networks-on-chip, and GPUs. The paper further analyzes current practice to highlight useful design strategies and identify areas for future work, based on optimized implementation strategies, opportune extensions to existing work, and ambitious long term possibilities. Taken together, these strategies and techniques present a promising future for increasingly automated architectural design.
ARMay 11, 2019
Optimizing Routerless Network-on-Chip Designs: An Innovative Learning-Based FrameworkTing-Ru Lin, Drew Penney, Massoud Pedram et al.
Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches being either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, and 1.14x reduction in average hop count albeit with slightly more power overhead.