Puneet Gupta

15papers

187citations

Novelty52%

AI Score48

Ranked #52,538 of 201,326 authors (top 26%)#11,959 in LG (top 28%)

15 Papers

ARNov 10, 2022

PhotoFourier: A Photonic Joint Transform Correlator-Based Neural Network Accelerator

Shurui Li, Hangbo Yang, Chee Wei Wong et al.

The last few years have seen a lot of work to address the challenge of low-latency and high-throughput convolutional neural network inference. Integrated photonics has the potential to dramatically accelerate neural networks because of its low-latency nature. Combined with the concept of Joint Transform Correlator (JTC), the computationally expensive convolution functions can be computed instantaneously (time of flight of light) with almost no cost. This 'free' convolution computation provides the theoretical basis of the proposed PhotoFourier JTC-based CNN accelerator. PhotoFourier addresses a myriad of challenges posed by on-chip photonic computing in the Fourier domain including 1D lenses and high-cost optoelectronic conversions. The proposed PhotoFourier accelerator achieves more than 28X better energy-delay product compared to state-of-art photonic neural network accelerators.

ARJul 19, 2024

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Joyjit Kundu, Wenzhe Guo, Ali BanaGozar et al.

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of distributed LLM training and inference through an analytical framework that accurately considers compute, memory sub-system, network, and various parallelization strategies (model parallel, data parallel, pipeline parallel, and sequence parallel). We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA). For distributed training, we investigate the memory footprint of LLMs for different activation re-computation methods, dissect the key factors behind the massive performance gain from A100 to B200 ($\sim$ 35x speed-up closely following NVIDIA's scaling trend), and further run a design space exploration at different technology nodes (12 nm to 1 nm) to study the impact of logic, memory, and network scaling on the performance. For inference, we analyze the compute versus memory boundedness of different operations at a matrix-multiply level for different GPU systems and further explore the impact of DRAM memory technology scaling on inference latency. Utilizing our modeling framework, we reveal the evolution of performance bottlenecks for both LLM training and inference with technology scaling, thus, providing insights to design future systems for LLM training and inference.

LGApr 8, 2023

Training Neural Networks for Execution on Approximate Hardware

Tianmu Li, Shurui Li, Puneet Gupta

Approximate computing methods have shown great potential for deep learning. Due to the reduced hardware costs, these methods are especially suitable for inference tasks on battery-operated devices that are constrained by their power budget. However, approximate computing hasn't reached its full potential due to the lack of work on training methods. In this work, we discuss training methods for approximate hardware. We demonstrate how training needs to be specialized for approximate hardware, and propose methods to speed up the training process by up to 18X.

LGOct 11, 2023

Cost-Driven Hardware-Software Co-Optimization of Machine Learning Pipelines

Ravit Sharma, Wojciech Romaszkan, Feiqian Zhu et al.

Researchers have long touted a vision of the future enabled by a proliferation of internet-of-things devices, including smart sensors, homes, and cities. Increasingly, embedding intelligence in such devices involves the use of deep neural networks. However, their storage and processing requirements make them prohibitive for cheap, off-the-shelf platforms. Overcoming those requirements is necessary for enabling widely-applicable smart devices. While many ways of making models smaller and more efficient have been developed, there is a lack of understanding of which ones are best suited for particular scenarios. More importantly for edge platforms, those choices cannot be analyzed in isolation from cost and user experience. In this work, we holistically explore how quantization, model scaling, and multi-modality interact with system components such as memory, sensors, and processors. We perform this hardware/software co-design from the cost, latency, and user-experience perspective, and develop a set of guidelines for optimal system design and model deployment for the most cost-constrained platforms. We demonstrate our approach using an end-to-end, on-device, biometric user authentication system using a $20 ESP-EYE board.

1.2LGApr 25

Efficient VQ-QAT and Mixed Vector/Linear quantized Neural Networks

Terry Gou, Puneet Gupta

In this work, we developed and tested 3 techniques for vector quantization (VQ) based model weight compression. To mitigate codebook collapse and enable end-to-end training, we adopted cosine similarity-based assignment. Building on ideas from attention-based formulations in Differentiable K-Means (DKM), we further improved this approach by using cosine similarity for assignment combined with top-1 sampling and a straight-through estimator, thereby eliminating the need for weighted-average reconstruction. Finally, we investigated the use of differentiable neural architecture search (NAS) to adaptively select layer-wise quantization configurations, further optimizing the compression process. Although our method does not consistently outperform existing approaches across all quantization levels, it provides useful insights into the design trade-offs and behaviors of VQ-based model compression methods.

4.0CVApr 28

Exploring Remote Photoplethysmography for Neonatal Pain Detection from Facial Videos

Ashutosh Dhamaniya, Anup Kumar Gupta, Trishna Saikia et al.

Unaddressed pain in neonates can lead to adverse effects, including delayed development and slower weight gain, emphasising the need for more objective and reliable pain assessment methods. Hence, automated methods using behavioural and physiological pain indicators have been developed to aid healthcare professionals in the Neonatal ICU. Traditional contact-based methods for physiological parameter estimation are unsuitable for long-term monitoring and increase the risk of spreading diseases like COVID-19. We introduce a novel approach using remote photoplethysmography (rPPG) to estimate pulse signals in a non-contact manner and employ them for neonatal pain detection. The temporal signals acquired from regions-of-interest (ROIs) affected by skin deformations may exhibit lower quality and provide erroneous rPPG signals. Therefore, we incorporated a quality parameter to select the temporal signals obtained from ROIs that are least affected by skin deformations. Further, we employed signal-to-noise ratio as a fitness parameter to extract the rPPG signal corresponding to the clip that is least affected by noise. Experimental findings demonstrate that the rPPG signals provide useful information for neonatal pain detection, and signals extracted from the blue colour channel outperform those extracted from other colour channels. We also show that combining rPPG and audio features provides better results than individual modalities.

25.0ARMar 12

Link Quality Aware Pathfinding for Chiplet Interconnects

Aaron Yen, Jooyeon Jeong, Puneet Gupta

As chiplet-based integration advances, designers must select among short-reach die-to-die interconnect technologies with widely varying shoreline and areal bandwidth density, energy per bit, reach, and raw bit error rate (BER). Meeting stringent delivered BER targets in chiplet systems requires error-correcting codes (ECC), but incurs energy, area, and throughput overheads. We develop a flow centered around RTL synthesis power and area estimations to support pathfinding of inter-chiplet links under a stringent 10-27 delivered BER target. We synthesize a parameterized Reed-Solomon code with CRC-64 and Go-Back-N retry logic to estimate the correction overhead for different transceiver bit error rates. Results show that ECC can materially change link comparisons under common figures of merit and that CRC+ARQ can reduce the required RS strength (and decoder overhead) at moderate BERs while still meeting stringent delivered-BER targets. We present a CP-SAT-based link assignment formulation that uses these ECC-corrected metrics under reach, delivered-bandwidth, and shoreline constraints in system-level optimization.

ARJun 28, 2024

FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models

Saeed Rashidi, William Won, Sudarshan Srinivasan et al.

Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric.

LGJan 25, 2022

Bit-serial Weight Pools: Compression and Arbitrary Precision Execution of Neural Networks on Resource Constrained Processors

Shurui Li, Puneet Gupta

Applications of neural networks on edge systems have proliferated in recent years but the ever-increasing model size makes neural networks not able to deploy on resource-constrained microcontrollers efficiently. We propose bit-serial weight pools, an end-to-end framework that includes network compression and acceleration of arbitrary sub-byte precision. The framework can achieve up to 8x compression compared to 8-bit networks by sharing a pool of weights across the entire network. We further propose a bit-serial lookup based software implementation that allows runtime-bitwidth tradeoff and is able to achieve more than 2.8x speedup and 7.5x storage compression compared to 8-bit weight pool networks, with less than 1% accuracy drop.

LGDec 23, 2021

High Throughput Multi-Channel Parallelized Diffraction Convolutional Neural Network Accelerator

Zibo Hu, Shurui Li, Russell L. T. Schwartz et al.

Convolutional neural networks are paramount in image and signal processing including the relevant classification and training tasks alike and constitute for the majority of machine learning compute demand today. With convolution operations being computationally intensive, next generation hardware accelerators need to offer parallelization and algorithmic-hardware homomorphism. Fortunately, diffractive display optics is capable of million-channel parallel data processing at low latency, however, thus far only showed tens of Hertz slow single image and kernel capability, thereby significantly underdelivering from its performance potential. Here, we demonstrate an operation-parallelized high-throughput Fourier optic convolutional neural network accelerator. For the first time simultaneously processing of multiple kernels in Fourier domain enabled by optical diffraction has been achieved alongside with already conventional in the field input parallelism. Additionally, we show an about one hundred times system speed up over existing optical diffraction-based processors and this demonstration rivals performance of modern electronic solutions. Therefore, this system is capable of processing large-scale matrices about ten times faster than state of art electronic systems.

LGMar 1, 2021

SWIS -- Shared Weight bIt Sparsity for Efficient Neural Network Acceleration

Shurui Li, Wojciech Romaszkan, Alexander Graening et al.

Quantization is spearheading the increase in performance and efficiency of neural network computing systems making headway into commodity hardware. We present SWIS - Shared Weight bIt Sparsity, a quantization framework for efficient neural network inference acceleration delivering improved performance and storage compression through an offline weight decomposition and scheduling algorithm. SWIS can achieve up to 54.3% (19.8%) point accuracy improvement compared to weight truncation when quantizing MobileNet-v2 to 4 (2) bits post-training (with retraining) showing the strength of leveraging shared bit-sparsity in weights. SWIS accelerator gives up to 6x speedup and 1.9x energy improvement overstate of the art bit-serial architectures.

CVMay 10, 2020

MOMBAT: Heart Rate Monitoring from Face Video using Pulse Modeling and Bayesian Tracking

Puneet Gupta, Brojeshwar Bhowmick, Arpan Pal

A non-invasive yet inexpensive method for heart rate (HR) monitoring is of great importance in many real-world applications including healthcare, psychology understanding, affective computing and biometrics. Face videos are currently utilized for such HR monitoring, but unfortunately this can lead to errors due to the noise introduced by facial expressions, out-of-plane movements, camera parameters (like focus change) and environmental factors. We alleviate these issues by proposing a novel face video based HR monitoring method MOMBAT, that is, MOnitoring using Modeling and BAyesian Tracking. We utilize out-of-plane face movements to define a novel quality estimation mechanism. Subsequently, we introduce a Fourier basis based modeling to reconstruct the cardiovascular pulse signal at the locations containing the poor quality, that is, the locations affected by out-of-plane face movements. Furthermore, we design a Bayesian decision theory based HR tracking mechanism to rectify the spurious HR estimates. Experimental results reveal that our proposed method, MOMBAT outperforms state-of-the-art HR monitoring methods and performs HR monitoring with an average absolute error of 1.329 beats per minute and the Pearson correlation between estimated and actual heart rate is 0.9746. Moreover, it demonstrates that HR monitoring is significantly

LGJul 30, 2019

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Saptadeep Pal, Eiman Ebrahimi, Arslan Zulfiqar et al.

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5%, 8%, and 22% respectively compared to what DP alone can achieve at scale.

CRAug 4, 2018

Implementation and Analysis of Stable PUFs Using Gate Oxide Breakdown

Wei-Che Wang, Yair Yona, Yizhang Wu et al.

We implement and analyze highly stable PUFs using two random gate oxide breakdown mechanisms: plasma induced breakdown and voltage stressed breakdown. These gate oxide breakdown PUFs can be easily implemented in commercial silicon processes, and they are highly stable. We fabricated bit generation units for the stable PUFs on 99 testchips with 65nm CMOS bulk technology. Measurement results show that the plasma induced breakdown can generate complete stable responses. For the voltage stressed breakdown, the responses are with 0.12\% error probability at a worst case corner, which can be effectively accommodated by taking the majority vote from multiple measurements. Both PUFs show significant area reduction compared to SRAM PUF. We compare methods for evaluating the security level of PUFs such as min-entropy, mutual information and guesswork as well as inter- and intra-FHD, and the popular NIST test suite. We show that guesswork can be viewed as a generalization of min-entropy and mutual information. In addition, we analyze our testchip data and show through various statistical distance measures that the bits are independent. Finally, we propose guesswork as a new statistical measure for the level of statistical independence that also has an operational meaning in terms of security.

CRJan 19, 2017

Design and Analysis of Stability-Guaranteed PUFs

Wei-Che Wang, Yair Yona, Suhas Diggavi et al.

The lack of stability is one of the major limitations that constrains PUF from being put in widespread practical use. In this paper, we propose a weak PUF and a strong PUF that are both completely stable with 0% intra-distance. These PUFs are called Locally Enhanced Defectivity (LED)PUF. The source of randomness of a LEDPUF is extracted from locally enhance defectivity without affecting other parts of the chip. A LEDPUF is a pure functional PUF that does not require any kinds of correction schemes as conventional parametric PUFs do. A weak LEDPUF is constructed by forming arrays of Directed Self Assembly (DSA) random connections is presented, and the strong LEDPUF is implemented by using the weak LEDPUF as the key of a keyed-hash message authentication code (HMAC). Our simulation and statistical results show that the entropy of the weak LEDPUF bits is close to ideal, and the inter-distances of both weak and strong LEDPUFs are about 50%, which means that these LEDPUFs are not only stable but also unique. We develop a new unified framework for evaluating the level of security of PUFs, based on password security, by using information theoretic tools of guesswork. The guesswork model allows to quantitatively compare, with a single unified metric, PUFs with varying levels of stability, bias and available side information. In addition, it generalizes other measures to evaluate the security level such as min-entropy and mutual information. We evaluate guesswork-based security of some measured SRAM and Ring Oscillator PUFs as an example and compare them with LEDPUF to show that stability has a more severe impact on the PUF security than biased responses. Furthermore, we find the guesswork of three new problems: Guesswork under probability of attack failure, the guesswork of idealized version of a message authentication code, and the guesswork of strong PUFs that are used for authentication.