LGJul 18, 2023
Conformal prediction under ambiguous ground truthDavid Stutz, Abhijit Guha Roy, Tatiana Matejovicova et al. · deepmind
Conformal Prediction (CP) allows to perform rigorous uncertainty quantification by constructing a prediction set $C(X)$ satisfying $\mathbb{P}(Y \in C(X))\geq 1-α$ for a user-chosen $α\in [0,1]$ by relying on calibration data $(X_1,Y_1),...,(X_n,Y_n)$ from $\mathbb{P}=\mathbb{P}^{X} \otimes \mathbb{P}^{Y|X}$. It is typically implicitly assumed that $\mathbb{P}^{Y|X}$ is the "true" posterior label distribution. However, in many real-world scenarios, the labels $Y_1,...,Y_n$ are obtained by aggregating expert opinions using a voting procedure, resulting in a one-hot distribution $\mathbb{P}_{vote}^{Y|X}$. For such ``voted'' labels, CP guarantees are thus w.r.t. $\mathbb{P}_{vote}=\mathbb{P}^X \otimes \mathbb{P}_{vote}^{Y|X}$ rather than the true distribution $\mathbb{P}$. In cases with unambiguous ground truth labels, the distinction between $\mathbb{P}_{vote}$ and $\mathbb{P}$ is irrelevant. However, when experts do not agree because of ambiguous labels, approximating $\mathbb{P}^{Y|X}$ with a one-hot distribution $\mathbb{P}_{vote}^{Y|X}$ ignores this uncertainty. In this paper, we propose to leverage expert opinions to approximate $\mathbb{P}^{Y|X}$ using a non-degenerate distribution $\mathbb{P}_{agg}^{Y|X}$. We develop Monte Carlo CP procedures which provide guarantees w.r.t. $\mathbb{P}_{agg}=\mathbb{P}^X \otimes \mathbb{P}_{agg}^{Y|X}$ by sampling multiple synthetic pseudo-labels from $\mathbb{P}_{agg}^{Y|X}$ for each calibration example $X_1,...,X_n$. In a case study of skin condition classification with significant disagreement among expert annotators, we show that applying CP w.r.t. $\mathbb{P}_{vote}$ under-covers expert annotations: calibrated for $72\%$ coverage, it falls short by on average $10\%$; our Monte Carlo CP closes this gap both empirically and theoretically.
LGJul 5, 2023
Evaluating AI systems under uncertain ground truth: a case study in dermatologyDavid Stutz, Ali Taylan Cemgil, Abhijit Guha Roy et al. · deepmind
For safety, medical AI systems undergo thorough evaluations before deployment, validating their predictions against a ground truth which is assumed to be fixed and certain. However, this ground truth is often curated in the form of differential diagnoses. While a single differential diagnosis reflects the uncertainty in one expert assessment, multiple experts introduce another layer of uncertainty through disagreement. Both forms of uncertainty are ignored in standard evaluation which aggregates these differential diagnoses to a single label. In this paper, we show that ignoring uncertainty leads to overly optimistic estimates of model performance, therefore underestimating risk associated with particular diagnostic decisions. To this end, we propose a statistical aggregation approach, where we infer a distribution on probabilities of underlying medical condition candidates themselves, based on observed annotations. This formulation naturally accounts for the potential disagreements between different experts, as well as uncertainty stemming from individual differential diagnoses, capturing the entire ground truth uncertainty. Our approach boils down to generating multiple samples of medical condition probabilities, then evaluating and averaging performance metrics based on these sampled probabilities. In skin condition classification, we find that a large portion of the dataset exhibits significant ground truth uncertainty and standard evaluation severely over-estimates performance without providing uncertainty estimates. In contrast, our framework provides uncertainty estimates on common metrics of interest such as top-k accuracy and average overlap, showing that performance can change multiple percentage points. We conclude that, while assuming a crisp ground truth can be acceptable for many AI applications, a more nuanced evaluation protocol should be utilized in medical diagnosis.
LGJan 31, 2023
Transformers Meet Directed GraphsSimon Geisler, Yujia Li, Daniel Mankowitz et al. · deepmind
Transformers were originally proposed as a sequence-to-sequence model for text but have become vital for a wide range of modalities, including images, audio, video, and undirected graphs. However, transformers for directed graphs are a surprisingly underexplored topic, despite their applicability to ubiquitous domains, including source code and logic circuits. In this work, we propose two direction- and structure-aware positional encodings for directed graphs: (1) the eigenvectors of the Magnetic Laplacian - a direction-aware generalization of the combinatorial Laplacian; (2) directional random walk encodings. Empirically, we show that the extra directionality information is useful in various downstream tasks, including correctness testing of sorting networks and source code understanding. Together with a data-flow-centric graph construction, our model outperforms the prior state of the art on the Open Graph Benchmark Code2 relatively by 14.7%.
LGApr 4, 2024
Mitigating LLM Hallucinations via Conformal AbstentionYasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz et al. · deepmind
We develop a principled procedure for determining when a large language model (LLM) should abstain from responding (e.g., by saying "I don't know") in a general domain, instead of resorting to possibly "hallucinating" a non-sensical or incorrect answer. Building on earlier approaches that use self-consistency as a more reliable measure of model confidence, we propose using the LLM itself to self-evaluate the similarity between each of its sampled responses for a given query. We then further leverage conformal prediction techniques to develop an abstention procedure that benefits from rigorous theoretical guarantees on the hallucination rate (error rate). Experimentally, our resulting conformal abstention method reliably bounds the hallucination rate on various closed-book, open-domain generative question answering datasets, while also maintaining a significantly less conservative abstention rate on a dataset with long responses (Temporal Sequences) compared to baselines using log-probability scores to quantify uncertainty, while achieveing comparable performance on a dataset with short answers (TriviaQA). To evaluate the experiments automatically, one needs to determine if two responses are equivalent given a question. Following standard practice, we use a thresholded similarity function to determine if two responses match, but also provide a method for calibrating the threshold based on conformal prediction, with theoretical guarantees on the accuracy of the match prediction, which might be of independent interest.
LGOct 18, 2021
Learning Optimal Conformal ClassifiersDavid Stutz, Krishnamurthy, Dvijotham et al.
Modern deep learning based classifiers show very high accuracy on test data but this does not provide sufficient guarantees for safe deployment, especially in high-stake AI applications such as medical diagnosis. Usually, predictions are obtained without a reliable uncertainty estimate or a formal guarantee. Conformal prediction (CP) addresses these issues by using the classifier's predictions, e.g., its probability estimates, to predict confidence sets containing the true class with a user-specified probability. However, using CP as a separate processing step after training prevents the underlying model from adapting to the prediction of confidence sets. Thus, this paper explores strategies to differentiate through CP during training with the goal of training model with the conformal wrapper end-to-end. In our approach, conformal training (ConfTr), we specifically "simulate" conformalization on mini-batches during training. Compared to standard training, ConfTr reduces the average confidence set size (inefficiency) of state-of-the-art CP methods applied after training. Moreover, it allows to "shape" the confidence sets predicted at test time, which is difficult for standard CP. On experiments with several datasets, we show ConfTr can influence how inefficiency is distributed across classes, or guide the composition of confidence sets in terms of the included classes, while retaining the guarantees offered by CP.
IRDec 20, 2020
Towards Fair Personalization by Avoiding Feedback LoopsGökhan Çapan, Özge Bozal, İlker Gündoğdu et al.
Self-reinforcing feedback loops are both cause and effect of over and/or under-presentation of some content in interactive recommender systems. This leads to erroneous user preference estimates, namely, overestimation of over-presented content while violating the right to be presented of each alternative, contrary of which we define as a fair system. We consider two models that explicitly incorporate, or ignore the systematic and limited exposure to alternatives. By simulations, we demonstrate that ignoring the systematic presentations overestimates promoted options and underestimates censored alternatives. Simply conditioning on the limited exposure is a remedy for these biases.
LGOct 4, 2020
Intermittent Demand Forecasting with Renewal ProcessesAli Caner Turkmen, Tim Januschowski, Yuyang Wang et al.
Intermittency is a common and challenging problem in demand forecasting. We introduce a new, unified framework for building intermittent demand forecasting models, which incorporates and allows to generalize existing methods in several directions. Our framework is based on extensions of well-established model-based methods to discrete-time renewal processes, which can parsimoniously account for patterns such as aging, clustering and quasi-periodicity in demand arrivals. The connection to discrete-time renewal processes allows not only for a principled extension of Croston-type models, but also for an natural inclusion of neural network based models---by replacing exponential smoothing with a recurrent neural network. We also demonstrate that modeling continuous-time demand arrivals, i.e., with a temporal point process, is possible via a trivial extension of our framework. This leads to more flexible modeling in scenarios where data of individual purchase orders are directly available with granular timestamps. Complementing this theoretical advancement, we demonstrate the efficacy of our framework for forecasting practice via an extensive empirical study on standard intermittent demand data sets, in which we report predictive accuracy in a variety of scenarios that compares favorably to the state of the art.
MLAug 15, 2019
A Bayesian Choice Model for Eliminating Feedback LoopsGökhan Çapan, Ilker Gündoğdu, Ali Caner Türkmen et al.
Self-reinforcing feedback loops in personalization systems are typically caused by users choosing from a limited set of alternatives presented systematically based on previous choices. We propose a Bayesian choice model built on Luce axioms that explicitly accounts for users' limited exposure to alternatives. Our model is fair---it does not impose negative bias towards unpresented alternatives, and practical---preference estimates are accurately inferred upon observing a small number of interactions. It also allows efficient sampling, leading to a straightforward online presentation mechanism based on Thompson sampling. Our approach achieves low regret in learning to present upon exploration of only a small fraction of possible presentations. The proposed structure can be reused as a building block in interactive systems, e.g., recommender systems, free of feedback loops.
MLMar 11, 2019
Bayesian Allocation Model: Inference by Sequential Monte Carlo for Nonnegative Tensor Factorizations and Topic Models using Polya UrnsAli Taylan Cemgil, Mehmet Burak Kurutmaz, Sinan Yildirim et al.
We introduce a dynamic generative model, Bayesian allocation model (BAM), which establishes explicit connections between nonnegative tensor factorization (NTF), graphical models of discrete probability distributions and their Bayesian extensions, and the topic models such as the latent Dirichlet allocation. BAM is based on a Poisson process, whose events are marked by using a Bayesian network, where the conditional probability tables of this network are then integrated out analytically. We show that the resulting marginal process turns out to be a Polya urn, an integer valued self-reinforcing process. This urn processes, which we name a Polya-Bayes process, obey certain conditional independence properties that provide further insight about the nature of NTF. These insights also let us develop space efficient simulation algorithms that respect the potential sparsity of data: we propose a class of sequential importance sampling algorithms for computing NTF and approximating their marginal likelihood, which would be useful for model selection. The resulting methods can also be viewed as a model scoring method for topic models and discrete Bayesian networks with hidden variables. The new algorithms have favourable properties in the sparse data regime when contrasted with variational algorithms that become more accurate when the total sum of the elements of the observed tensor goes to infinity. We illustrate the performance on several examples and numerically study the behaviour of the algorithms for various data regimes.
SDOct 31, 2018
Audio Source Separation Using Variational Autoencoders and Weak Class SupervisionErtuğ Karamatlı, Ali Taylan Cemgil, Serap Kırbız
In this paper, we propose a source separation method that is trained by observing the mixtures and the class labels of the sources present in the mixture without any access to isolated sources. Since our method does not require source class labels for every time-frequency bin but only a single label for each source constituting the mixture signal, we call this scenario as weak class supervision. We associate a variational autoencoder (VAE) with each source class within a non-negative (compositional) model. Each VAE provides a prior model to identify the signal from its associated class in a sound mixture. After training the model on mixtures, we obtain a generative model for each source class and demonstrate our method on one-second mixtures of utterances of digits from 0 to 9. We show that the separation performance obtained by source class supervision is as good as the performance obtained by source signal supervision.
MLNov 30, 2017
Differentially Private Variational DropoutBeyza Ermis, Ali Taylan Cemgil
Deep neural networks with their large number of parameters are highly flexible learning systems. The high flexibility in such networks brings with some serious problems such as overfitting, and regularization is used to address this problem. A currently popular and effective regularization technique for controlling the overfitting is dropout. Often, large data collections required for neural networks contain sensitive information such as the medical histories of patients, and the privacy of the training data should be protected. In this paper, we modify the recently proposed variational dropout technique which provided an elegant Bayesian interpretation to dropout, and show that the intrinsic noise in the variational dropout can be exploited to obtain a degree of differential privacy. The iterative nature of training neural networks presents a challenge for privacy-preserving estimation since multiple iterations increase the amount of noise added. We overcome this by using a relaxed notion of differential privacy, called concentrated differential privacy, which provides tighter estimates on the overall privacy loss. We demonstrate the accuracy of our privacy-preserving variational dropout algorithm on benchmark datasets.
MLNov 30, 2017
Differentially Private DropoutBeyza Ermis, Ali Taylan Cemgil
Large data collections required for the training of neural networks often contain sensitive information such as the medical histories of patients, and the privacy of the training data must be preserved. In this paper, we introduce a dropout technique that provides an elegant Bayesian interpretation to dropout, and show that the intrinsic noise added, with the primary goal of regularization, can be exploited to obtain a degree of differential privacy. The iterative nature of training neural networks presents a challenge for privacy-preserving estimation since multiple iterations increase the amount of noise added. We overcome this by using a relaxed notion of differential privacy, called concentrated differential privacy, which provides tighter estimates on the overall privacy loss. We demonstrate the accuracy of our privacy-preserving dropout algorithm on benchmark datasets.
CLOct 24, 2014
Clustering Words by Projection EntropyIşık Barış Fidaner, Ali Taylan Cemgil
We apply entropy agglomeration (EA), a recently introduced algorithm, to cluster the words of a literary text. EA is a greedy agglomerative procedure that minimizes projection entropy (PE), a function that can quantify the segmentedness of an element set. To apply it, the text is reduced to a feature allocation, a combinatorial object to represent the word occurences in the text's paragraphs. The experiment results demonstrate that EA, despite its reduction and simplicity, is useful in capturing significant relationships among the words in the text. This procedure was implemented in Python and published as a free software: REBUS.
LGOct 1, 2013
Summary Statistics for Partitionings and Feature AllocationsIşık Barış Fidaner, Ali Taylan Cemgil
Infinite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or find its maximum a posteriori solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. We develop an element-based definition of entropy to quantify segmentation among their elements. Then we propose a simple algorithm called entropy agglomeration (EA) to summarize and visualize this information. Experiments on various infinite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice.