LGJul 11, 2023
Memorization Through the Lens of Curvature of Loss Function Around SamplesIsha Garg, Deepak Ravikumar, Kaushik Roy
Deep neural networks are over-parameterized and easily overfit the datasets they train on. In the extreme case, it has been shown that these networks can memorize a training set with fully randomized labels. We propose using the curvature of loss function around each training sample, averaged over training epochs, as a measure of memorization of the sample. We use this metric to study the generalization versus memorization properties of different samples in popular image datasets and show that it captures memorization statistics well, both qualitatively and quantitatively. We first show that the high curvature samples visually correspond to long-tailed, mislabeled, or conflicting samples, those that are most likely to be memorized. This analysis helps us find, to the best of our knowledge, a novel failure mode on the CIFAR100 and ImageNet datasets: that of duplicated images with differing labels. Quantitatively, we corroborate the validity of our scores via two methods. First, we validate our scores against an independent and comprehensively calculated baseline, by showing high cosine similarity with the memorization scores released by Feldman and Zhang (2020). Second, we inject corrupted samples which are memorized by the network, and show that these are learned with high curvature. To this end, we synthetically mislabel a random subset of the dataset. We overfit a network to it and show that sorting by curvature yields high AUROC values for identifying the corrupted samples. An added advantage of our method is that it is scalable, as it requires training only a single network as opposed to the thousands trained by the baseline, while capturing the aforementioned failure mode that the baseline fails to identify.
CVJul 5, 2024
Revealing the Utilized Rank of Subspaces of Learning in Neural NetworksIsha Garg, Christian Koguchi, Eshan Verma et al.
In this work, we study how well the learned weights of a neural network utilize the space available to them. This notion is related to capacity, but additionally incorporates the interaction of the network architecture with the dataset. Most learned weights appear to be full rank, and are therefore not amenable to low rank decomposition. This deceptively implies that the weights are utilizing the entire space available to them. We propose a simple data-driven transformation that projects the weights onto the subspace where the data and the weight interact. This preserves the functional mapping of the layer and reveals its low rank structure. In our findings, we conclude that most models utilize a fraction of the available space. For instance, for ViTB-16 and ViTL-16 trained on ImageNet, the mean layer utilization is 35% and 20% respectively. Our transformation results in reducing the parameters to 50% and 25% respectively, while resulting in less than 0.2% accuracy drop after fine-tuning. We also show that self-supervised pre-training drives this utilization up to 70%, justifying its suitability for downstream tasks.
LGJul 17, 2025
Apple Intelligence Foundation Language Models: Tech Report 2025Ethan Li, Anders Boesen Lindbo Larsen, Chen Zhang et al. · apple-ml, cmu
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
ETMar 19, 2024
Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory AcceleratorsTimur Ibrayev, Isha Garg, Indranil Chakraborty et al.
Deep learning has proved successful in many applications but suffers from high computational demands and requires custom accelerators for deployment. Crossbar-based analog in-memory architectures are attractive for acceleration of deep neural networks (DNN), due to their high data reuse and high efficiency enabled by combining storage and computation in memory. However, they require analog-to-digital converters (ADCs) to communicate crossbar outputs. ADCs consume a significant portion of energy and area of every crossbar processing unit, thus diminishing the potential efficiency benefits. Pruning is a well-studied technique to improve the efficiency of DNNs but requires modifications to be effective for crossbars. In this paper, we motivate crossbar-attuned pruning to target ADC-specific inefficiencies. This is achieved by identifying three key properties (dubbed D.U.B.) that induce sparsity that can be utilized to reduce ADC energy without sacrificing accuracy. The first property ensures that sparsity translates effectively to hardware efficiency by restricting sparsity levels to Discrete powers of 2. The other 2 properties encourage columns in the same crossbar to achieve both Unstructured and Balanced sparsity in order to amortize the accuracy drop. The desired D.U.B. sparsity is then achieved by regularizing the variance of $L_{0}$ norms of neighboring columns within the same crossbar. Our proposed implementation allows it to be directly used in end-to-end gradient-based training. We apply the proposed algorithm to convolutional layers of VGG11 and ResNet18 models, trained on CIFAR-10 and ImageNet datasets, and achieve up to 7.13x and 1.27x improvement, respectively, in ADC energy with less than 1% drop in accuracy.
LGJan 21, 2022
TOFU: Towards Obfuscated Federated Updates by Encoding Weight Updates into Gradients from Proxy DataIsha Garg, Manish Nagaraj, Kaushik Roy
Advances in Federated Learning and an abundance of user data have enabled rich collaborative learning between multiple clients, without sharing user data. This is done via a central server that aggregates learning in the form of weight updates. However, this comes at the cost of repeated expensive communication between the clients and the server, and concerns about compromised user privacy. The inversion of gradients into the data that generated them is termed data leakage. Encryption techniques can be used to counter this leakage, but at added expense. To address these challenges of communication efficiency and privacy, we propose TOFU, a novel algorithm which generates proxy data that encodes the weight updates for each client in its gradients. Instead of weight updates, this proxy data is now shared. Since input data is far lower in dimensional complexity than weights, this encoding allows us to send much lesser data per communication round. Additionally, the proxy data resembles noise, and even perfect reconstruction from data leakage attacks would invert the decoded gradients into unrecognizable noise, enhancing privacy. We show that TOFU enables learning with less than 1% and 7% accuracy drops on MNIST and on CIFAR-10 datasets, respectively. This drop can be recovered via a few rounds of expensive encrypted gradient exchange. This enables us to learn to near-full accuracy in a federated setup, while being 4x and 6.6x more communication efficient than the standard Federated Averaging algorithm on MNIST and CIFAR-10, respectively.
CVDec 20, 2021
Encoding Hierarchical Information in Neural Networks helps in Subpopulation ShiftAmitangshu Mukherjee, Isha Garg, Kaushik Roy
Over the past decade, deep neural networks have proven to be adept in image classification tasks, often surpassing humans in terms of accuracy. However, standard neural networks often fail to understand the concept of hierarchical structures and dependencies among different classes for vision related tasks. Humans on the other hand, seem to intuitively learn categories conceptually, progressively growing from understanding high-level concepts down to granular levels of categories. One of the issues arising from the inability of neural networks to encode such dependencies within its learned structure is that of subpopulation shift -- where models are queried with novel unseen classes taken from a shifted population of the training set categories. Since the neural network treats each class as independent from all others, it struggles to categorize shifting populations that are dependent at higher levels of the hierarchy. In this work, we study the aforementioned problems through the lens of a novel conditional supervised training framework. We tackle subpopulation shift by a structured learning procedure that incorporates hierarchical information conditionally through labels. Furthermore, we introduce a notion of graphical distance to model the catastrophic effect of mispredictions. We show that learning in this structured hierarchical manner results in networks that are more robust against subpopulation shifts, with an improvement up to 3\% in terms of accuracy and up to 11\% in terms of graphical distance over standard models on subpopulation shift benchmarks.
LGApr 26, 2021
Spatio-Temporal Pruning and Quantization for Low-latency Spiking Neural NetworksSayeed Shafayet Chowdhury, Isha Garg, Kaushik Roy
Spiking Neural Networks (SNNs) are a promising alternative to traditional deep learning methods since they perform event-driven information processing. However, a major drawback of SNNs is high inference latency. The efficiency of SNNs could be enhanced using compression methods such as pruning and quantization. Notably, SNNs, unlike their non-spiking counterparts, consist of a temporal dimension, the compression of which can lead to latency reduction. In this paper, we propose spatial and temporal pruning of SNNs. First, structured spatial pruning is performed by determining the layer-wise significant dimensions using principal component analysis of the average accumulated membrane potential of the neurons. This step leads to 10-14X model compression. Additionally, it enables inference with lower latency and decreases the spike count per inference. To further reduce latency, temporal pruning is performed by gradually reducing the timesteps while training. The networks are trained using surrogate gradient descent based backpropagation and we validate the results on CIFAR10 and CIFAR100, using VGG architectures. The spatiotemporally pruned SNNs achieve 89.04% and 66.4% accuracy on CIFAR10 and CIFAR100, respectively, while performing inference with 3-30X reduced latency compared to state-of-the-art SNNs. Moreover, they require 8-14X lesser compute energy compared to their unpruned standard deep learning counterparts. The energy numbers are obtained by multiplying the number of operations with energy per operation. These SNNs also provide 1-4% higher robustness against Gaussian noise corrupted inputs. Furthermore, we perform weight quantization and find that performance remains reasonably stable up to 5-bit quantization.
LGMar 17, 2021
Gradient Projection Memory for Continual LearningGobinda Saha, Isha Garg, Kaushik Roy
The ability to learn continually without forgetting the past tasks is a desired attribute for artificial learning systems. Existing approaches to enable such learning in artificial neural networks usually rely on network growth, importance based weight update or replay of old data from the memory. In contrast, we propose a novel approach where a neural network learns new tasks by taking gradient steps in the orthogonal direction to the gradient subspaces deemed important for the past tasks. We find the bases of these subspaces by analyzing network representations (activations) after learning each task with Singular Value Decomposition (SVD) in a single shot manner and store them in the memory as Gradient Projection Memory (GPM). With qualitative and quantitative analyses, we show that such orthogonal gradient descent induces minimum to no interference with the past tasks, thereby mitigates forgetting. We evaluate our algorithm on diverse image classification datasets with short and long sequences of tasks and report better or on-par performance compared to the state-of-the-art approaches.
LGDec 15, 2020
Exploring Vicinal Risk Minimization for Lightweight Out-of-Distribution DetectionDeepak Ravikumar, Sangamesh Kodge, Isha Garg et al.
Deep neural networks have found widespread adoption in solving complex tasks ranging from image recognition to natural language processing. However, these networks make confident mispredictions when presented with data that does not belong to the training distribution, i.e. out-of-distribution (OoD) samples. In this paper we explore whether the property of Vicinal Risk Minimization (VRM) to smoothly interpolate between different class boundaries helps to train better OoD detectors. We apply VRM to existing OoD detection techniques and show their improved performance. We observe that existing OoD detectors have significant memory and compute overhead, hence we leverage VRM to develop an OoD detector with minimal overheard. Our detection method introduces an auxiliary class for classifying OoD samples. We utilize mixup in two ways to implement Vicinal Risk Minimization. First, we perform mixup within the same class and second, we perform mixup with Gaussian noise when training the auxiliary class. Our method achieves near competitive performance with significantly less compute and memory overhead when compared to existing OoD detection techniques. This facilitates the deployment of OoD detection on edge devices and expands our understanding of Vicinal Risk Minimization for use in training OoD detectors.
LGOct 5, 2020
DCT-SNN: Using DCT to Distribute Spatial Information over Time for Learning Low-Latency Spiking Neural NetworksIsha Garg, Sayeed Shafayet Chowdhury, Kaushik Roy
Spiking Neural Networks (SNNs) offer a promising alternative to traditional deep learning frameworks, since they provide higher computational efficiency due to event-driven information processing. SNNs distribute the analog values of pixel intensities into binary spikes over time. However, the most widely used input coding schemes, such as Poisson based rate-coding, do not leverage the additional temporal learning capability of SNNs effectively. Moreover, these SNNs suffer from high inference latency which is a major bottleneck to their deployment. To overcome this, we propose a scalable time-based encoding scheme that utilizes the Discrete Cosine Transform (DCT) to reduce the number of timesteps required for inference. DCT decomposes an image into a weighted sum of sinusoidal basis images. At each time step, the Hadamard product of the DCT coefficients and a single frequency base, taken in order, is given to an accumulator that generates spikes upon crossing a threshold. We use the proposed scheme to learn DCT-SNN, a low-latency deep SNN with leaky-integrate-and-fire neurons, trained using surrogate gradient descent based backpropagation. We achieve top-1 accuracy of 89.94%, 68.3% and 52.43% on CIFAR-10, CIFAR-100 and TinyImageNet, respectively using VGG architectures. Notably, DCT-SNN performs inference with 2-14X reduced latency compared to other state-of-the-art SNNs, while achieving comparable accuracy to their standard deep learning counterparts. The dimension of the transform allows us to control the number of timesteps required for inference. Additionally, we can trade-off accuracy with latency in a principled manner by dropping the highest frequency components during inference.
LGAug 4, 2020
TREND: Transferability based Robust ENsemble DesignDeepak Ravikumar, Sangamesh Kodge, Isha Garg et al.
Deep Learning models hold state-of-the-art performance in many fields, but their vulnerability to adversarial examples poses threat to their ubiquitous deployment in practical settings. Additionally, adversarial inputs generated on one classifier have been shown to transfer to other classifiers trained on similar data, which makes the attacks possible even if model parameters are not revealed to the adversary. This property of transferability has not yet been systematically studied, leading to a gap in our understanding of robustness of neural networks to adversarial inputs. In this work, we study the effect of network architecture, initialization, optimizer, input, weight and activation quantization on transferability of adversarial samples. We also study the effect of different attacks on transferability. Our experiments reveal that transferability is significantly hampered by input quantization and architectural mismatch between source and target, is unaffected by initialization but the choice of optimizer turns out to be critical. We observe that transferability is architecture-dependent for both weight and activation quantized models. To quantify transferability, we use simple metric and demonstrate the utility of the metric in designing a methodology to build ensembles with improved adversarial robustness. When attacking ensembles we observe that "gradient domination" by a single ensemble member model hampers existing attacks. To combat this we propose a new state-of-the-art ensemble attack. We compare the proposed attack with existing attack techniques to show its effectiveness. Finally, we show that an ensemble consisting of carefully chosen diverse networks achieves better adversarial robustness than would otherwise be possible with a single network.
LGJan 23, 2020
SPACE: Structured Compression and Sharing of Representational Space for Continual LearningGobinda Saha, Isha Garg, Aayush Ankit et al.
Humans learn adaptively and efficiently throughout their lives. However, incrementally learning tasks causes artificial neural networks to overwrite relevant information learned about older tasks, resulting in 'Catastrophic Forgetting'. Efforts to overcome this phenomenon often utilize resources poorly, for instance, by growing the network architecture or needing to save parametric importance scores, or violate data privacy between tasks. To tackle this, we propose SPACE, an algorithm that enables a network to learn continually and efficiently by partitioning the learnt space into a Core space, that serves as the condensed knowledge base over previously learned tasks, and a Residual space, which is akin to a scratch space for learning the current task. After learning each task, the Residual is analyzed for redundancy, both within itself and with the learnt Core space. A minimal number of extra dimensions required to explain the current task are added to the Core space and the remaining Residual is freed up for learning the next task. We evaluate our algorithm on P-MNIST, CIFAR and a sequence of 8 different datasets, and achieve comparable accuracy to the state-of-the-art methods while overcoming catastrophic forgetting. Additionally, our algorithm is well suited for practical use. The partitioning algorithm analyzes all layers in one shot, ensuring scalability to deeper networks. Moreover, the analysis of dimensions translates to filter-level sparsity, and the structured nature of the resulting architecture gives us up to 5x improvement in energy efficiency during task inference over the current state-of-the-art.
LGJun 4, 2019
Constructing Energy-efficient Mixed-precision Neural Networks through Principal Component Analysis for Edge IntelligenceIndranil Chakraborty, Deboleena Roy, Isha Garg et al.
The `Internet of Things' has brought increased demand for AI-based edge computing in applications ranging from healthcare monitoring systems to autonomous vehicles. Quantization is a powerful tool to address the growing computational cost of such applications, and yields significant compression over full-precision networks. However, quantization can result in substantial loss of performance for complex image classification tasks. To address this, we propose a Principal Component Analysis (PCA) driven methodology to identify the important layers of a binary network, and design mixed-precision networks. The proposed Hybrid-Net achieves a more than 10% improvement in classification accuracy over binary networks such as XNOR-Net for ResNet and VGG architectures on CIFAR-100 and ImageNet datasets while still achieving up to 94% of the energy-efficiency of XNOR-Nets. This work furthers the feasibility of using highly compressed neural networks for energy-efficient neural computing in edge devices.
CVDec 15, 2018
A Low Effort Approach to Structured CNN Design Using PCAIsha Garg, Priyadarshini Panda, Kaushik Roy
Deep learning models hold state of the art performance in many fields, yet their design is still based on heuristics or grid search methods that often result in overparametrized networks. This work proposes a method to analyze a trained network and deduce an optimized, compressed architecture that preserves accuracy while keeping computational costs tractable. Model compression is an active field of research that targets the problem of realizing deep learning models in hardware. However, most pruning methodologies tend to be experimental, requiring large compute and time intensive iterations of retraining the entire network. We introduce structure into model design by proposing a single shot analysis of a trained network that serves as a first order, low effort approach to dimensionality reduction, by using PCA (Principal Component Analysis). The proposed method simultaneously analyzes the activations of each layer and considers the dimensionality of the space described by the filters generating these activations. It optimizes the architecture in terms of number of layers, and number of filters per layer without any iterative retraining procedures, making it a viable, low effort technique to design efficient networks. We demonstrate the proposed methodology on AlexNet and VGG style networks on the CIFAR-10, CIFAR-100 and ImageNet datasets, and successfully achieve an optimized architecture with a reduction of up to 3.8X and 9X in the number of operations and parameters respectively, while trading off less than 1% accuracy. We also apply the method to MobileNet, and achieve 1.7X and 3.9X reduction in the number of operations and parameters respectively, while improving accuracy by almost one percentage point.