Abhishek Ghose

LG
h-index4
6papers
31citations
Novelty33%
AI Score24

6 Papers

AIJun 24, 2023
Are Good Explainers Secretly Human-in-the-Loop Active Learners?

Emma Thuong Nguyen, Abhishek Ghose

Explainable AI (XAI) techniques have become popular for multiple use-cases in the past few years. Here we consider its use in studying model predictions to gather additional training data. We argue that this is equivalent to Active Learning, where the query strategy involves a human-in-the-loop. We provide a mathematical approximation for the role of the human, and present a general formalization of the end-to-end workflow. This enables us to rigorously compare this use with standard Active Learning algorithms, while allowing for extensions to the workflow. An added benefit is that their utility can be assessed via simulation instead of conducting expensive user-studies. We also present some initial promising results.

LGOct 8, 2022
Data Selection: A General Principle for Building Small Interpretable Models

Abhishek Ghose

We present convincing empirical evidence for an effective and general strategy for building accurate small models. Such models are attractive for interpretability and also find use in resource-constrained environments. The strategy is to learn the training distribution and sample accordingly from the provided training data. The distribution learning algorithm is not a contribution of this work; our contribution is a rigorous demonstration of the broad utility of this strategy in various practical settings. We apply it to the tasks of (1) building cluster explanation trees, (2) prototype-based classification, and (3) classification using Random Forests, and show that it improves the accuracy of decades-old weak traditional baselines to be competitive with specialized modern techniques. This strategy is also versatile wrt the notion of model size. In the first two tasks, model size is considered to be number of leaves in the tree and the number of prototypes respectively. In the final task involving Random Forests, the strategy is shown to be effective even when model size comprises of more than one factor: number of trees and their maximum depth. Positive results using multiple datasets are presented that are shown to be statistically significant.

LGMar 23, 2024
On the Fragility of Active Learners for Text Classification

Abhishek Ghose, Emma Thuong Nguyen

Active learning (AL) techniques optimally utilize a labeling budget by iteratively selecting instances that are most valuable for learning. However, they lack ``prerequisite checks'', i.e., there are no prescribed criteria to pick an AL algorithm best suited for a dataset. A practitioner must pick a technique they \emph{trust} would beat random sampling, based on prior reported results, and hope that it is resilient to the many variables in their environment: dataset, labeling budget and prediction pipelines. The important questions then are: how often on average, do we expect any AL technique to reliably beat the computationally cheap and easy-to-implement strategy of random sampling? Does it at least make sense to use AL in an ``Always ON'' mode in a prediction pipeline, so that while it might not always help, it never under-performs random sampling? How much of a role does the prediction pipeline play in AL's success? We examine these questions in detail for the task of text classification using pre-trained representations, which are ubiquitous today. Our primary contribution here is a rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers. This unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an ``Always ON'' mode and the relative significance of different factors. Additionally, we release a framework for rigorous benchmarking of AL techniques for text classification.

LGOct 20, 2019
Rational Kernels: A survey

Abhishek Ghose

Many kinds of data are naturally amenable to being treated as sequences. An example is text data, where a text may be seen as a sequence of words. Another example is clickstream data, where a data instance is a sequence of clicks made by a visitor to a website. This is also common for data originating in the domains of speech processing and computational biology. Using such data with statistical learning techniques can often prove to be cumbersome since most of them only allow fixed-length feature vectors as input. In casting the data to fixed-length feature vectors to suit these techniques, we lose the convenience, and possibly information, a good sequence-based representation can offer. The framework of rational kernels partly addresses this problem by providing an elegant representation for sequences, for algorithms that use kernel functions. In this report, we take a comprehensive look at this framework, its various extensions and applications. We start with an overview of the core ideas, where we look at the characterization of rational kernels, and then extend our discussion to extensions, applications and use at scale. Rational kernels represent a family of kernels, and thus, learning an appropriate rational kernel instead of picking one, suggests a convenient way to use them; we explore this idea in our concluding section. Rational kernels are not as popular as the many other learning techniques in use today; however, we hope that this summary effectively shows that not only is their theory well-developed, but also that various practical aspects have been carefully studied over time.

LGJun 17, 2019
Learning Interpretable Models Using Uncertainty Oracles

Abhishek Ghose, Balaraman Ravindran

A desirable property of interpretable models is small size, so that they are easily understandable by humans. This leads to the following challenges: (a) small sizes typically imply diminished accuracy, and (b) bespoke levers provided by model families to restrict size, e.g., L1 regularization, might be insufficient to reach the desired size-accuracy trade-off. We address these challenges here. Earlier work has shown that learning the training distribution creates accurate small models. Our contribution is a new technique that exploits this idea. The training distribution is encoded as a Dirichlet Process to allow for a flexible number of modes that is learnable from the data. Its parameters are learned using Bayesian Optimization; a design choice that makes the technique applicable to non-differentiable loss functions. To avoid the challenges with high dimensionality, the data is first projected down to one-dimension using uncertainty scores of a separate probabilistic model, that we refer to as the uncertainty oracle. We show that this technique addresses the above challenges: (a) it arrests the reduction in accuracy that comes from shrinking a model (in some cases we observe $\sim 100\%$ improvement over baselines), and also, (b) that this maybe applied with no change across model families with different notions of size; results are shown for Decision Trees, Linear Probability models and Gradient Boosted Models. Additionally, we show that (1) it is more accurate than its predecessor, (2) requires only one hyperparameter to be set in practice, (3) accommodates a multi-variate notion of model size, e.g., both maximum depth of a tree and number of trees in Gradient Boosted Models, and (4) works across different feature spaces between the uncertainty oracle and the interpretable model, e.g., a GRU might act as an oracle for a decision tree that ingests n-grams.

LGMay 4, 2019
Interpretability with Accurate Small Models

Abhishek Ghose, Balaraman Ravindran

Models often need to be constrained to a certain size for them to be considered interpretable. For example, a decision tree of depth 5 is much easier to understand than one of depth 50. Limiting model size, however, often reduces accuracy. We suggest a practical technique that minimizes this trade-off between interpretability and classification accuracy. This enables an arbitrary learning algorithm to produce highly accurate small-sized models. Our technique identifies the training data distribution to learn from that leads to the highest accuracy for a model of a given size. We represent the training distribution as a combination of sampling schemes. Each scheme is defined by a parameterized probability mass function applied to the segmentation produced by a decision tree. An Infinite Mixture Model with Beta components is used to represent a combination of such schemes. The mixture model parameters are learned using Bayesian Optimization. Under simplistic assumptions, we would need to optimize for $O(d)$ variables for a distribution over a $d$-dimensional input space, which is cumbersome for most real-world data. However, we show that our technique significantly reduces this number to a \emph{fixed set of eight variables} at the cost of relatively cheap preprocessing. The proposed technique is flexible: it is \emph{model-agnostic}, i.e., it may be applied to the learning algorithm for any model family, and it admits a general notion of model size. We demonstrate its effectiveness using multiple real-world datasets to construct decision trees, linear probability models and gradient boosted models with different sizes. We observe significant improvements in the F1-score in most instances, exceeding an improvement of $100\%$ in some cases.