LGJun 9, 2023
Understanding the Effect of the Long Tail on Neural Network CompressionHarvey Dam, Vinu Joseph, Aditya Bhaskara et al.
Network compression is now a mature sub-field of neural network research: over the last decade, significant progress has been made towards reducing the size of models and speeding up inference, while maintaining the classification accuracy. However, many works have observed that focusing on just the overall accuracy can be misguided. E.g., it has been shown that mismatches between the full and compressed models can be biased towards under-represented classes. This raises the important research question, can we achieve network compression while maintaining "semantic equivalence" with the original network? In this work, we study this question in the context of the "long tail" phenomenon in computer vision datasets observed by Feldman, et al. They argue that memorization of certain inputs (appropriately defined) is essential to achieving good generalization. As compression limits the capacity of a network (and hence also its ability to memorize), we study the question: are mismatches between the full and compressed models correlated with the memorized training data? We present positive evidence in this direction for image classification tasks, by considering different base architectures and compression schemes.
DCMar 27
Fast Topology-Aware Lossy Data Compression with Full Preservation of Critical Points and Local OrderAlex Fallin, Nathaniel Gorski, Tripti Agarwal et al.
Many scientific codes and instruments generate large amounts of floating-point data at high rates that must be compressed before they can be stored. Typically, only lossy compression algorithms deliver high-enough compression ratios. However, many of them provide only point-wise error bounds and do not preserve topological aspects of the data such as the relative magnitude of neighboring points. Even topology-preserving compressors tend to merely preserve some critical points and are generally slow. Our Local-Order-Preserving Compressor is the first to preserve the full local order (and thus all critical points), runs orders of magnitude faster than prior topology-preserving compressors, yields higher compression ratios than lossless compressors, and produces bit-for-bit the same output on CPUs and GPUs.
LGNov 6, 2019Code
A Programmable Approach to Neural Network CompressionVinu Joseph, Saurav Muralidharan, Animesh Garg et al.
Deep neural networks (DNNs) frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both the model size and inference time without appreciable loss in accuracy. However, finding the best compression strategy and corresponding target sparsity for a given DNN, hardware platform, and optimization objective currently requires expensive, frequently manual, trial-and-error experimentation. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build more complex and practically interesting compression strategies. Given a strategy and user-provided objective (such as minimization of running time), Condensa uses a novel Bayesian optimization-based algorithm to automatically infer desirable sparsities. Our experiments on four real-world DNNs demonstrate memory footprint and hardware runtime throughput improvements of 188x and 2.59x, respectively, using at most ten samples per search. We have released a reference implementation of Condensa at https://github.com/NVlabs/condensa.
SENov 14, 2018Code
Multi-level analysis of compiler induced variability and performance tradeoffsMichael Bentley, Ian Briggs, Ganesh Gopalakrishnan et al.
Successful HPC software applications are long-lived. When ported across machines and their compilers, these applications often produce different numerical results, many of which are unacceptable. Such variability is also a concern while optimizing the code more aggressively to gain performance. Efficient tools that help locate the program units (files and functions) within which most of the variability occurs are badly needed, both to plan for code ports and to root-cause errors due to variability when they happen in the field. In this work, we offer an enhanced version of the open-source testing framework FLiT to serve these roles. Key new features of FLiT include a suite of bisection algorithms that help locate the root causes of variability. Another added feature allows an analysis of the tradeoffs between performance and the degree of variability. Our new contributions also include a collection of case studies. Results on the MFEM finite-element library include variability/performance tradeoffs, and the identification of a (hitherto unknown) abnormal level of result-variability even under mild compiler optimizations. Results from studying the Laghos proxy application include identifying a significantly divergent floating-point result-variability and successful root-causing down to the problematic function over as little as 14 program executions. Finally, in an evaluation of 4,376 controlled injections of floating-point perturbations on the LULESH proxy application, we showed that the FLiT framework has 100 precision and recall in discovering the file and function locations of the injections all within an average of only 15 program executions.
CLMay 28, 2025
Derailing Non-Answers via Logit Suppression at Output Subspace Boundaries in RLHF-Aligned Language ModelsHarvey Dam, Jonas Knochelmann, Vinu Joseph et al.
We introduce a method to reduce refusal rates of large language models (LLMs) on sensitive content without modifying model weights or prompts. Motivated by the observation that refusals in certain models were often preceded by the specific token sequence of a token marking the beginning of the chain-of-thought (CoT) block (<think>) followed by a double newline token (\n\n), we investigate the impact of two simple formatting adjustments during generation: suppressing \n\n after <think> and suppressing the end-of-sequence token after the end of the CoT block (</think>). Our method requires no datasets, parameter changes, or training, relying solely on modifying token probabilities during generation. In our experiments with official DeepSeek-R1 distillations, these interventions increased the proportion of substantive answers to sensitive prompts without affecting performance on standard benchmarks. Our findings suggest that refusal behaviors can be circumvented by blocking refusal subspaces at specific points in the generation process.
LGApr 5, 2025
Directional Sign Loss: A Topology-Preserving Loss Function that Approximates the Sign of Finite DifferencesHarvey Dam, Tripti Agarwal, Ganesh Gopalakrishnan
Preserving topological features in learned latent spaces is a fundamental challenge in representation learning, particularly for topology-sensitive data. This paper introduces directional sign loss (DSL), an efficient, differentiable loss function that approximates the number of mismatches in the signs of finite differences between corresponding elements of two arrays. By penalizing discrepancies in critical points between input and reconstructed data, DSL encourages autoencoders and other learnable compressors to retain the topological features of the original data. We present the formulation and complexity analysis of DSL, comparing it to other non-differentiable topological measures. Experiments on multidimensional array data show that combining DSL with traditional loss functions preserves topological features more effectively than traditional losses alone. DSL serves as a differentiable, efficient proxy for common topology-based metrics, enabling topological feature preservation on previously impractical problem sizes and in a wider range of gradient-based optimization frameworks.
DCJun 17, 2024
What Operations can be Performed Directly on Compressed Arrays, and with What Error?Tripti Agarwal, Harvey Dam, Dorra Ben Khalifa et al.
In response to the rapidly escalating costs of computing with large matrices and tensors caused by data movement, several lossy compression methods have been developed to significantly reduce data volumes. Unfortunately, all these methods require the data to be decompressed before further computations are done. In this work, we develop a lossy compressor that allows a dozen fairly fundamental operations directly on compressed data while offering good compression ratios and modest errors. We implement a new compressor PyBlaz based on the familiar GPU-powered PyTorch framework, and evaluate it on three non-trivial applications, choosing different number systems for internal representation. Our results demonstrate that the compressed-domain operations achieve good scalability with problem sizes while incurring errors well within acceptable limits. To our best knowledge, this is the first such lossy compressor that supports compressed-domain operations while achieving acceptable performance as well as error.
DCOct 13, 2021
Efficient Linearizability Checking for Actor-based SystemsMohammed S. Al-Mahfoudh, Ryan Stutsman, Ganesh Gopalakrishnan
Recent demand for distributed software had led to a surge in popularity in actor-based frameworks. However, even with the stylized message passing model of actors, writing correct distributed software is still difficult. We present our work on linearizability checking in DS2, an integrated framework for specifying, synthesizing, and testing distributed actor systems. The key insight of our approach is that often subcomponents of distributed actor systems represent common algorithms or data structures (e.g.\ a distributed hash table or tree) that can be validated against a simple sequential model of the system. This makes it easy for developers to validate their concurrent actor systems without complex specifications. DS2 automatically explores the concurrent schedules that system could arrive at, and it compares observed output of the system to ensure it is equivalent to what the sequential implementation could have produced. We describe DS2's linearizability checking and test it on several concurrent replication algorithms from the literature. We explore in detail how different algorithms for enumerating the model schedule space fare in finding bugs in actor systems, and we present our own refinements on algorithms for exploring actor system schedules that we show are effective in finding bugs.
DCMay 24, 2021
Automated Dynamic Concurrency Analysis for GoSaeed Taheri, Ganesh Gopalakrishnan
The concurrency features of the Go language have proven versatile in the development of a number of concurrency systems. However, correctness methods to address challenges in Go concurrency debugging have not received much attention. In this work, we present an automatic dynamic tracing mechanism that efficiently captures and helps analyze the whole-program concurrency model. Using an enhancement to the built-in tracer package of Go and a framework that collects dynamic traces from application execution, we enable thorough post-mortem analysis for concurrency debugging. Preliminary results about the effectiveness and scalability (up to more than 2K goroutines) of our proposed dynamic tracing for concurrent debugging are presented. We discuss the future direction for exploiting dynamic tracing towards accelerating concurrent bug exposure.
CVDec 3, 2020
Going Beyond Classification Accuracy Metrics in Model CompressionVinu Joseph, Shoaib Ahmed Siddiqui, Aditya Bhaskara et al.
With the rise in edge-computing devices, there has been an increasing demand to deploy energy and resource-efficient models. A large body of research has been devoted to developing methods that can reduce the size of the model considerably without affecting the standard metrics such as top-1 accuracy. However, these pruning approaches tend to result in a significant mismatch in other metrics such as fairness across classes and explainability. To combat such misalignment, we propose a novel multi-part loss function inspired by the knowledge-distillation literature. Through extensive experiments, we demonstrate the effectiveness of our approach across different compression algorithms, architectures, tasks as well as datasets. In particular, we obtain up to $4.1\times$ reduction in the number of prediction mismatches between the compressed and reference models, and up to $5.7\times$ in cases where the reference model makes the correct prediction; all while making no changes to the compression algorithm, and minor modifications to the loss function. Furthermore, we demonstrate how inducing simple alignment between the predictions of the models naturally improves the alignment on other metrics including fairness and attributions. Our framework can thus serve as a simple plug-and-play component for compression algorithms in the future.
DCSep 24, 2019
Message Scheduling for Performant, Many-Core Belief PropagationMark Van der Merwe, Vinu Joseph, Ganesh Gopalakrishnan
Belief Propagation (BP) is a message-passing algorithm for approximate inference over Probabilistic Graphical Models (PGMs), finding many applications such as computer vision, error-correcting codes, and protein-folding. While general, the convergence and speed of the algorithm has limited its practical use on difficult inference problems. As an algorithm that is highly amenable to parallelization, many-core Graphical Processing Units (GPUs) could significantly improve BP performance. Improving BP through many-core systems is non-trivial: the scheduling of messages in the algorithm strongly affects performance. We present a study of message scheduling for BP on GPUs. We demonstrate that BP exhibits a tradeoff between speed and convergence based on parallelism and show that existing message schedulings are not able to utilize this tradeoff. To this end, we present a novel randomized message scheduling approach, Randomized BP (RnBP), which outperforms existing methods on the GPU.
SEJun 29, 2016
PRESAGE: Protecting Structured Address Generation against Soft ErrorsVishal Chandra Sharma, Ganesh Gopalakrishnan, Sriram Krishnamoorthy
Modern computer scaling trends in pursuit of larger component counts and power efficiency have, unfortunately, lead to less reliable hardware and consequently soft errors escaping into application data ("silent data corruptions"). Techniques to enhance system resilience hinge on the availability of efficient error detectors that have high detection rates, low false positive rates, and lower computational overhead. Unfortunately, efficient detectors to detect faults during address generation (to index large arrays) have not been widely researched. We present a novel lightweight compiler-driven technique called PRESAGE for detecting bit-flips affecting structured address computations. A key insight underlying PRESAGE is that any address computation scheme that flows an already incurred error is better than a scheme that corrupts one particular array access but otherwise (falsely) appears to compute perfectly. Enabling the flow of errors allows one to situate detectors at loop exit points, and helps turn silent corruptions into easily detectable error situations. Our experiments using PolyBench benchmark suite indicate that PRESAGE-based error detectors have a high error-detection rate while incurring low overheads.