LGDec 8, 2023
SparQ Attention: Bandwidth-Efficient LLM InferenceLuka Ribar, Ivan Chelombiev, Luke Hudlass-Galley et al.
The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.
LGDec 5, 2024
Approximate Top-$k$ for Increased ParallelismOscar Key, Luka Ribar, Alberto Cattaneo et al.
We present an evaluation of bucketed approximate top-$k$ algorithms. Computing top-$k$ exactly suffers from limited parallelism, because the $k$ largest values must be aggregated along the vector, thus is not well suited to computation on highly-parallel machine learning accelerators. By relaxing the requirement that the top-$k$ is exact, bucketed algorithms can dramatically increase the parallelism available by independently computing many smaller top-$k$ operations. We explore the design choices of this class of algorithms using both theoretical analysis and empirical evaluation on downstream tasks. Our motivating examples are sparsity algorithms for language models, which often use top-$k$ to select the most important parameters or activations. We also release a fast bucketed top-$k$ implementation for PyTorch.
LGMay 19, 2025
Optimal Formats for Weight QuantisationDouglas Orr, Luka Ribar, Carlo Luschi
Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for block-scaled data across multiple distribution families and observe that these formats, along with sparse outlier formats, consistently outperform fixed-length formats, indicating that they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when applied to large language models.
NCDec 28, 2021
Reliability of Event Timing in Silicon NeuronsTai Miyazaki Kirby, Luka Ribar, Rodolphe Sepulchre
Analog, low-voltage electronics show great promise in producing silicon neurons (SiNs) with unprecedented levels of energy efficiency. Yet, their inherently high susceptibility to process, voltage and temperature (PVT) variations, and noise has long been recognised as a major bottleneck in developing effective neuromorphic solutions. Inspired by spike transmission studies in biophysical, neocortical neurons, we demonstrate that the inherent noise and variability can coexist with reliable spike transmission in analog SiNs, similarly to biological neurons. We illustrate this property on a recent neuromorphic model of a bursting neuron by showcasing three different relevant types of reliable event transmission: single spike transmission, burst transmission, and the on-off control of a half-centre oscillator (HCO) network.
SYNov 9, 2020
Neuromorphic ControlLuka Ribar, Rodolphe Sepulchre
Neuromorphic engineering is a rapidly developing field that aims to take inspiration from the biological organization of neural systems to develop novel technology for computing, sensing, and actuating. The unique properties of such systems call for new signal processing and control paradigms. The article introduces the mixed feedback organization of excitable neuronal systems, consisting of interlocked positive and negative feedback loops acting in distinct timescales. The principles of biological neuromodulation suggest a methodology for designing and controlling mixed-feedback systems neuromorphically. The proposed design consists of a parallel interconnection of elementary circuit elements that mirrors the organization of biological neurons and utilizes the hardware components of neuromorphic electronic circuits. The interconnection structure endows the neuromorphic systems with a simple control methodology that reframes the neuronal control as an input-output shaping problem. The potential of neuronal control is illustrated on elementary network examples that suggest the scalability of the mixed-feedback principles.
NCMay 15, 2018
Neuromodulation of Neuromorphic CircuitsLuka Ribar, Rodolphe Sepulchre
We present a novel methodology to enable control of a neuromorphic circuit in close analogy with the physiological neuromodulation of a single neuron. The methodology is general in that it only relies on a parallel interconnection of elementary voltage-controlled current sources. In contrast to controlling a nonlinear circuit through the parameter tuning of a state-space model, our approach is purely input-output. The circuit elements are controlled and interconnected to shape the current-voltage characteristics (I-V curves) of the circuit in prescribed timescales. In turn, shaping those I-V curves determines the excitability properties of the circuit. We show that this methodology enables both robust and accurate control of the circuit behavior and resembles the biophysical mechanisms of neuromodulation. As a proof of concept, we simulate a SPICE model composed of MOSFET transconductance amplifiers operating in the weak inversion regime.