Rathinakumar Appuswamy

LG
h-index42
9papers
2,032citations
Novelty54%
AI Score51

9 Papers

LGJan 30, 2023
Efficient and Effective Methods for Mixed Precision Neural Network Quantization for Faster, Energy-efficient Inference

Deepika Bablani, Jeffrey L. Mckinstry, Steven K. Esser et al. · ibm-research

For efficient neural network inference, it is desirable to achieve state-of-the-art accuracy with the simplest networks requiring the least computation, memory, and power. Quantizing networks to lower precision is a powerful technique for simplifying networks. As each layer of a network may have different sensitivity to quantization, mixed precision quantization methods selectively tune the precision of individual layers to achieve a minimum drop in task performance (e.g., accuracy). To estimate the impact of layer precision choice on task performance, two methods are introduced: i) Entropy Approximation Guided Layer selection (EAGL) is fast and uses the entropy of the weight distribution, and ii) Accuracy-aware Layer Precision Selection (ALPS) is straightforward and relies on single epoch fine-tuning after layer precision reduction. Using EAGL and ALPS for layer precision selection, full-precision accuracy is recovered with a mix of 4-bit and 2-bit layers for ResNet-50, ResNet-101 and BERT-base transformer networks, demonstrating enhanced performance across the entire accuracy-throughput frontier. The techniques demonstrate better performance than existing techniques in several commensurate comparisons. Notably, this is accomplished with significantly lesser computational time required to reach a solution.

ITMar 22
Probability of super-regular matrices and MDS codes over finite fields

Rathinakumar Appuswamy, Marco Bazzani, Spencer Congero et al.

Let $C$ be an $[n, k]$ linear code chosen uniformly at random over a finite field $\mathbb{F}_q$ of size $q$. The following asymptotic probability of $C$ being maximum distance separable (MDS) as $q,n,k\to\infty$ is known: If $\frac{1}{q}\binom{n}{k} \to 0$, then $P(C\text{ is MDS}) \to 1$. We demonstrate that this growth rate is in fact a threshold by proving: If $\frac{1}{q}\binom{n}{k} \to \infty$, then $P(C\text{ is MDS}) \to 0$. A matrix is (\textit{contiguous}) \textit{super-regular} if all of its (contiguous) square submatrices are nonsingular. The above results imply that for any $k \times k$ matrix $A$ chosen uniformly at random over $\mathbb{F}_q$, the following hold: If $\frac{4^k/\sqrt{k}}{q} \to 0$, then $P(A \text{ is super-regular}) \to 1$. If $\frac{4^k/\sqrt{k}}{q} \to \infty$, then $P(A \text{ is super-regular}) \to 0$. We also obtain the following asymptotic probabilities for two variations of the above questions: If $\frac{1}{q}\binom{n}{k} \to λ\in (0,\infty)$ and $k/n \to 0$, then $P(C\text{ is MDS}) \to e^{-λ}$. If $\frac{k^3/3}{q} \to λ\in (0,\infty)$, then $P(A \text{ is contiguous super-regular}) \to e^{-λ}$. The number of contiguous super-regular $3 \times 3$ matrices is also a polynomial. Finally, for $4 \times 4$ matrices, we show that the number of super-regular matrices is not a polynomial, nor even a quasi-polynomial of period less than 7, whereas our experimental evidence suggests that the number of contiguous super-regular matrices is a polynomial.

DCNov 20, 2025Code
A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

Michael V. DeBole, Rathinakumar Appuswamy, Neil McGlohon et al. · ibm-research

A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m^2 42U rack footprint. The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct model at 2,048 context length with 28 simultaneous users and a per-user inter-token latency of 2.8 ms. The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing data center (cloud, on-prem) environments. For example, the system can support 18 instances of a 3-billion-parameter model or a single instance of a 70-billion-parameter model.

LGJul 22, 2025
SiLQ: Simple Large Language Model Quantization-Aware Training

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani et al. · ibm-research

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.

LGFeb 21, 2019
Learned Step Size Quantization

Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani et al.

Deep networks run with low precision operations at inference time offer power and space advantages over high precision alternatives, but need to overcome the challenge of maintaining high accuracy as precision decreases. Here, we present a method for training such networks, Learned Step Size Quantization, that achieves the highest accuracy to date on the ImageNet dataset when using models, from a variety of architectures, with weights and activations quantized to 2-, 3- or 4-bits of precision, and that can train 3-bit models that reach full precision baseline accuracy. Our approach builds upon existing methods for learning weights in quantized networks by improving how the quantizer itself is configured. Specifically, we introduce a novel means to estimate and scale the task loss gradient at each weight and activation layer's quantizer step size, such that it can be learned in conjunction with other network parameters. This approach works using different levels of precision as needed for a given system and requires only a simple modification of existing training code.

CVSep 11, 2018
Discovering Low-Precision Networks Close to Full-Precision Networks for Efficient Embedded Inference

Jeffrey L. McKinstry, Steven K. Esser, Rathinakumar Appuswamy et al.

To realize the promise of ubiquitous embedded deep network inference, it is essential to seek limits of energy and area efficiency. To this end, low-precision networks offer tremendous promise because both energy and area scale down quadratically with the reduction in precision. Here we demonstrate ResNet-18, -34, -50, -152, Inception-v3, Densenet-161, and VGG-16bn networks on the ImageNet classification benchmark that, at 8-bit precision exceed the accuracy of the full-precision baseline networks after one epoch of finetuning, thereby leveraging the availability of pretrained models. We also demonstrate ResNet-18, -34, -50, -152, Densenet-161, and VGG-16bn 4-bit models that match the accuracy of the full-precision baseline networks -- the highest scores to date. Surprisingly, the weights of the low-precision networks are very close (in cosine similarity) to the weights of the corresponding baseline networks, making training from scratch unnecessary. We find that gradient noise due to quantization during training increases with reduced precision, and seek ways to overcome this noise. The number of iterations required by SGD to achieve a given training error is related to the square of (a) the distance of the initial solution from the final plus (b) the maximum variance of the gradient estimates. Therefore, we (a) reduce solution distance by starting with pretrained fp32 precision baseline networks and fine-tuning, and (b) combat gradient noise introduced by quantization by training longer and reducing learning rates. Sensitivity analysis indicates that these simple techniques, coupled with proper activation function range calibration to take full advantage of the limited precision, are sufficient to discover low-precision networks, if they exist, close to fp32 precision baseline networks. The results herein provide evidence that 4-bits suffice for classification.

NEJun 8, 2016
Structured Convolution Matrices for Energy-efficient Deep learning

Rathinakumar Appuswamy, Tapan Nayak, John Arthur et al.

We derive a relationship between network representation in energy-efficient neuromorphic architectures and block Toplitz convolutional matrices. Inspired by this connection, we develop deep convolutional networks using a family of structured convolutional matrices and achieve state-of-the-art trade-off between energy efficiency and classification accuracy for well-known image recognition tasks. We also put forward a novel method to train binary convolutional networks by utilising an existing connection between noisy-rectified linear units and binary activations.

NEJun 7, 2016
Deep neural networks are robust to weight binarization and other non-linear distortions

Paul Merolla, Rathinakumar Appuswamy, John Arthur et al.

Recent results show that deep neural networks achieve excellent performance even when, during training, weights are quantized and projected to a binary representation. Here, we show that this is just the tip of the iceberg: these same networks, during testing, also exhibit a remarkable robustness to distortions beyond quantization, including additive and multiplicative noise, and a class of non-linear projections where binarization is just a special case. To quantify this robustness, we show that one such network achieves 11% test error on CIFAR-10 even with 0.68 effective bits per weight. Furthermore, we find that a common training heuristic--namely, projecting quantized weights during backpropagation--can be altered (or even removed) and networks still achieve a base level of robustness during testing. Specifically, training with weight projections other than quantization also works, as does simply clipping the weights, both of which have never been reported before. We confirm our results for CIFAR-10 and ImageNet datasets. Finally, drawing from these ideas, we propose a stochastic projection rule that leads to a new state of the art network with 7.64% test error on CIFAR-10 using no data augmentation.

NEMar 28, 2016
Convolutional Networks for Fast, Energy-Efficient Neuromorphic Computing

Steven K. Esser, Paul A. Merolla, John V. Arthur et al.

Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that i) approach state-of-the-art classification accuracy across 8 standard datasets, encompassing vision and speech, ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1200 and 2600 frames per second and using between 25 and 275 mW (effectively > 6000 frames / sec / W) and iii) can be specified and trained using backpropagation with the same ease-of-use as contemporary deep learning. For the first time, the algorithmic power of deep learning can be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.