Luka Ribar

h-index3

6papers

78citations

Novelty54%

AI Score41

Ranked #65,628 of 194,257 authors (top 34%)#14,771 in LG (top 37%)

6 Papers

15.3LGJul 9

A Practical Investigation of Training-free Relaxed Speculative Decoding

Guoxuan Xia, Luka Ribar, Paul Balanca

Speculative decoding accelerates sampling from an autoregressive LLM by using a faster auxiliary model to draft tokens which are then verified in parallel by the LLM. Standard speculative decoding is lossless: its rejection and resampling steps exactly preserve the LLM's sampling distribution. Recent work argues that relaxing this strict guarantee can yield further speed-ups, controlled capability-speed trade-offs, or even capability gains. We practically investigate training-free relaxed speculative decoding techniques, unify existing approaches within a shared framework, benchmark them on contemporary settings, and distil takeaways and empirical findings for practitioners. Important takeaways include: relaxation can require considerable capability evaluation unlike lossless speculative decoding, and many relaxed approaches rely on a drafter that is a good language model, making them unsuited for lightweight dedicated multi-token-prediction drafters.

32.2LGDec 8, 2023Code

SparQ Attention: Bandwidth-Efficient LLM Inference

Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley et al.

The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically causes token-generation to be bottlenecked by data transfer. For this reason, we introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within the attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to 8x savings in attention data transfers without substantial drops in accuracy, by evaluating Llama 2 and 3, Mistral, Gemma and Pythia models on a wide range of downstream tasks.

10.4LGDec 5, 2024

Approximate Top-$k$ for Increased Parallelism

Oscar Key, Luka Ribar, Alberto Cattaneo et al.

We present an evaluation of bucketed approximate top-$k$ algorithms. Computing top-$k$ exactly suffers from limited parallelism, because the $k$ largest values must be aggregated along the vector, thus is not well suited to computation on highly-parallel machine learning accelerators. By relaxing the requirement that the top-$k$ is exact, bucketed algorithms can dramatically increase the parallelism available by independently computing many smaller top-$k$ operations. We explore the design choices of this class of algorithms using both theoretical analysis and empirical evaluation on downstream tasks. Our motivating examples are sparsity algorithms for language models, which often use top-$k$ to select the most important parameters or activations. We also release a fast bucketed top-$k$ implementation for PyTorch.

4.1LGMay 19, 2025Code

Optimal Formats for Weight Quantisation

Douglas Orr, Luka Ribar, Carlo Luschi

Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for block-scaled data across multiple distribution families and observe that these formats, along with sparse outlier formats, consistently outperform fixed-length formats, indicating that they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when applied to large language models.

1.2NCDec 28, 2021

Reliability of Event Timing in Silicon Neurons

Tai Miyazaki Kirby, Luka Ribar, Rodolphe Sepulchre

Analog, low-voltage electronics show great promise in producing silicon neurons (SiNs) with unprecedented levels of energy efficiency. Yet, their inherently high susceptibility to process, voltage and temperature (PVT) variations, and noise has long been recognised as a major bottleneck in developing effective neuromorphic solutions. Inspired by spike transmission studies in biophysical, neocortical neurons, we demonstrate that the inherent noise and variability can coexist with reliable spike transmission in analog SiNs, similarly to biological neurons. We illustrate this property on a recent neuromorphic model of a bursting neuron by showcasing three different relevant types of reliable event transmission: single spike transmission, burst transmission, and the on-off control of a half-centre oscillator (HCO) network.

5.9SYNov 9, 2020Code

Neuromorphic Control

Luka Ribar, Rodolphe Sepulchre

Neuromorphic engineering is a rapidly developing field that aims to take inspiration from the biological organization of neural systems to develop novel technology for computing, sensing, and actuating. The unique properties of such systems call for new signal processing and control paradigms. The article introduces the mixed feedback organization of excitable neuronal systems, consisting of interlocked positive and negative feedback loops acting in distinct timescales. The principles of biological neuromodulation suggest a methodology for designing and controlling mixed-feedback systems neuromorphically. The proposed design consists of a parallel interconnection of elementary circuit elements that mirrors the organization of biological neurons and utilizes the hardware components of neuromorphic electronic circuits. The interconnection structure endows the neuromorphic systems with a simple control methodology that reframes the neuronal control as an input-output shaping problem. The potential of neuronal control is illustrated on elementary network examples that suggest the scalability of the mixed-feedback principles.