Maxim Panov

ML
h-index47
54papers
1,350citations
Novelty53%
AI Score59

54 Papers

MLJun 8, 2023
Conformal Prediction for Federated Uncertainty Quantification Under Label Shift

Vincent Plassier, Mehdi Makni, Aleksandr Rubashevskii et al.

Federated Learning (FL) is a machine learning framework where many clients collaboratively train models while keeping the training data decentralized. Despite recent advances in FL, the uncertainty quantification topic (UQ) remains partially addressed. Among UQ methods, conformal prediction (CP) approaches provides distribution-free guarantees under minimal assumptions. We develop a new federated conformal prediction method based on quantile regression and take into account privacy constraints. This method takes advantage of importance weighting to effectively address the label shift between agents and provides theoretical guarantees for both valid coverage of the prediction sets and differential privacy. Extensive experimental studies demonstrate that this method outperforms current competitors.

CLNov 13, 2023
LM-Polygraph: Uncertainty Estimation for Language Models

Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun et al.

Recent advancements in the capabilities of large language models (LLMs) have paved the way for a myriad of groundbreaking applications in various fields. However, a significant challenge arises as these models often "hallucinate", i.e., fabricate facts without providing users an apparent means to discern the veracity of their statements. Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of LLMs. However, to date, research on UE methods for LLMs has been focused primarily on theoretical rather than engineering contributions. In this work, we tackle this issue by introducing LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python. Additionally, it introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores, empowering end-users to discern unreliable responses. LM-Polygraph is compatible with the most recent LLMs, including BLOOMz, LLaMA-2, ChatGPT, and GPT-4, and is designed to support future releases of similarly-styled LMs.

LGJun 21, 2022
Towards OOD Detection in Graph Classification from Uncertainty Estimation Perspective

Gleb Bazhenov, Sergei Ivanov, Maxim Panov et al.

The problem of out-of-distribution detection for graph classification is far from being solved. The existing models tend to be overconfident about OOD examples or completely ignore the detection task. In this work, we consider this problem from the uncertainty estimation perspective and perform the comparison of several recently proposed methods. In our experiment, we find that there is no universal approach for OOD detection, and it is important to consider both graph representations and predictive categorical distribution.

63.6CLApr 15
Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models

Aleksandr Rubashevskii, Dzianis Piatrashyn, Preslav Nakov et al.

Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.

75.3CLApr 12
Why Don't You Know? Evaluating the Impact of Uncertainty Sources on Uncertainty Quantification in LLMs

Maiya Goloburda, Roman Vashurin, Fedor Chernogorsky et al.

As Large Language Models (LLMs) are increasingly deployed in real-world applications, reliable uncertainty quantification (UQ) becomes critical for safe and effective use. Most existing UQ approaches for language models aim to produce a single confidence score -- for example, estimating the probability that a model's answer is correct. However, uncertainty in natural language tasks arises from multiple distinct sources, including model knowledge gaps, output variability, and input ambiguity, which have different implications for system behavior and user interaction. In this work, we study how the source of uncertainty impacts the behavior and effectiveness of existing UQ methods. To enable controlled analysis, we introduce a new dataset that explicitly categorizes uncertainty sources, allowing systematic evaluation of UQ performance under each condition. Our experiments reveal that while many UQ methods perform well when uncertainty stems solely from model knowledge limitations, their performance degrades or becomes misleading when other sources are introduced. These findings highlight the need for uncertainty-aware methods that explicitly account for the source of uncertainty in large language models.

CLAug 20, 2024
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models

Artem Vazhentsev, Ekaterina Fadeeva, Rui Xing et al.

Uncertainty quantification (UQ) has emerged as a promising approach for detecting hallucinations and low-quality output of Large Language Models (LLMs). However, obtaining proper uncertainty scores is complicated by the conditional dependency between the generation steps of an autoregressive LLM because it is hard to model it explicitly. Here, we propose to learn this dependency from attention-based features. In particular, we train a regression model that leverages LLM attention maps, probabilities on the current generation step, and recurrently computed uncertainty scores from previously generated tokens. To incorporate the recurrent features, we also suggest a two-staged training procedure. Our experimental evaluation on ten datasets and three LLMs shows that the proposed method is highly effective for selective generation, achieving substantial improvements over rivaling unsupervised and supervised approaches.

CLJan 9, 2023
Active Learning for Abstractive Text Summarization

Akim Tsvigun, Ivan Lysenko, Danila Sedashov et al.

Construction of human-curated annotated datasets for abstractive text summarization (ATS) is very time-consuming and expensive because creating each instance requires a human annotator to read a long document and compose a shorter summary that would preserve the key information relayed by the original document. Active Learning (AL) is a technique developed to reduce the amount of annotation required to achieve a certain level of machine learning model performance. In information extraction and text classification, AL can reduce the amount of labor up to multiple times. Despite its potential for aiding expensive annotation, as far as we know, there were no effective AL query strategies for ATS. This stems from the fact that many AL strategies rely on uncertainty estimation, while as we show in our work, uncertain instances are usually noisy, and selecting them can degrade the model performance compared to passive annotation. We address this problem by proposing the first effective query strategy for AL in ATS based on diversity principles. We show that given a certain annotation budget, using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores. Additionally, we analyze the effect of self-learning and show that it can further increase the performance of the model.

MLDec 10, 2025
Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii et al.

Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.

CVJan 2, 2023
Learning Confident Classifiers in the Presence of Label Noise

Asma Ahmed Hashmi, Aigerim Zhumabayeva, Nikita Kotelevskii et al.

The success of Deep Neural Network (DNN) models significantly depends on the quality of provided annotations. In medical image segmentation, for example, having multiple expert annotations for each data point is common to minimize subjective annotation bias. Then, the goal of estimation is to filter out the label noise and recover the ground-truth masks, which are not explicitly given. This paper proposes a probabilistic model for noisy observations that allows us to build a confident classification and segmentation models. To accomplish it, we explicitly model label noise and introduce a new information-based regularization that pushes the network to recover the ground-truth labels. In addition, for segmentation task we adjust the loss function by prioritizing learning in high-confidence regions where all the annotators agree on labeling. We evaluate the proposed method on a series of classification tasks such as noisy versions of MNIST, CIFAR-10, Fashion-MNIST datasets as well as CIFAR-10N, which is real-world dataset with noisy human annotations. Additionally, for segmentation task, we consider several medical imaging datasets, such as, LIDC and RIGA that reflect real-world inter-variability among multiple annotators. Our experiments show that our algorithm outperforms state-of-the-art solutions for the considered classification and segmentation problems.

CVSep 5, 2022
ScaleFace: Uncertainty-aware Deep Metric Learning

Roman Kail, Kirill Fedyanin, Nikita Muravev et al.

The performance of modern deep learning-based systems dramatically depends on the quality of input objects. For example, face recognition quality would be lower for blurry or corrupted inputs. However, it is hard to predict the influence of input quality on the resulting accuracy in more complex scenarios. We propose an approach for deep metric learning that allows direct estimation of the uncertainty with almost no additional computational cost. The developed \textit{ScaleFace} algorithm uses trainable scale values that modify similarities in the space of embeddings. These input-dependent scale values represent a measure of confidence in the recognition result, thus allowing uncertainty estimation. We provide comprehensive experiments on face recognition tasks that show the superior performance of ScaleFace compared to other uncertainty-aware face recognition approaches. We also extend the results to the task of text-to-image retrieval showing that the proposed approach beats the competitors with significant margin.

CLAug 11, 2024
Reference-free Hallucination Detection for Large Vision-Language Models

Qing Li, Jiahui Geng, Chenyang Lyu et al.

Large vision-language models (LVLMs) have made significant progress in recent years. While LVLMs exhibit excellent ability in language understanding, question answering, and conversations of visual inputs, they are prone to producing hallucinations. While several methods are proposed to evaluate the hallucinations in LVLMs, most are reference-based and depend on external tools, which complicates their practical application. To assess the viability of alternative methods, it is critical to understand whether the reference-free approaches, which do not rely on any external tools, can efficiently detect hallucinations. Therefore, we initiate an exploratory study to demonstrate the effectiveness of different reference-free solutions in detecting hallucinations in LVLMs. In particular, we conduct an extensive study on three kinds of techniques: uncertainty-based, consistency-based, and supervised uncertainty quantification methods on four representative LVLMs across two different tasks. The empirical results show that the reference-free approaches are capable of effectively detecting non-factual responses in LVLMs, with the supervised uncertainty quantification method outperforming the others, achieving the best performance across different settings.

LGJan 13, 2023
Scalable Batch Acquisition for Deep Bayesian Active Learning

Aleksandr Rubashevskii, Daria Kotova, Maxim Panov

In deep active learning, it is especially important to choose multiple examples to markup at each step to work efficiently, especially on large datasets. At the same time, existing solutions to this problem in the Bayesian setup, such as BatchBALD, have significant limitations in selecting a large number of examples, associated with the exponential complexity of computing mutual information for joint random variables. We, therefore, present the Large BatchBALD algorithm, which gives a well-grounded approximation to the BatchBALD method that aims to achieve comparable quality while being more computationally efficient. We provide a complexity analysis of the algorithm, showing a reduction in computation time, especially for large batches. Furthermore, we present an extensive set of experimental results on image and text data, both on toy datasets and larger ones such as CIFAR-100.

MLJul 1, 2024
Probabilistic Conformal Prediction with Approximate Conditional Validity

Vincent Plassier, Alexander Fishkov, Mohsen Guizani et al.

We develop a new method for generating prediction sets that combines the flexibility of conformal methods with an estimate of the conditional distribution $P_{Y \mid X}$. Existing methods, such as conformalized quantile regression and probabilistic conformal prediction, usually provide only a marginal coverage guarantee. In contrast, our approach extends these frameworks to achieve approximately conditional coverage, which is crucial for many practical applications. Our prediction sets adapt to the behavior of the predictive distribution, making them effective even under high heteroscedasticity. While exact conditional guarantees are infeasible without assumptions on the underlying data distribution, we derive non-asymptotic bounds that depend on the total variation distance of the conditional distribution and its estimate. Using extensive simulations, we show that our method consistently outperforms existing approaches in terms of conditional coverage, leading to more reliable statistical inference in a variety of applications.

MLMay 6, 2022
Scalable computation of prediction intervals for neural networks via matrix sketching

Alexander Fishkov, Maxim Panov

Accounting for the uncertainty in the predictions of modern neural networks is a challenging and important task in many domains. Existing algorithms for uncertainty estimation require modifying the model architecture and training procedure (e.g., Bayesian neural networks) or dramatically increase the computational cost of predictions such as approaches based on ensembling. This work proposes a new algorithm that can be applied to a given trained neural network and produces approximate prediction intervals. The method is based on the classical delta method in statistics but achieves computational efficiency by using matrix sketching to approximate the Jacobian matrix. The resulting algorithm is competitive with state-of-the-art approaches for constructing predictive intervals on various regression datasets from the UCI repository.

MLSep 28, 2023
Selective Nonparametric Regression via Testing

Fedor Noskov, Alexander Fishkov, Maxim Panov

Prediction with the possibility of abstention (or selective prediction) is an important problem for error-critical machine learning applications. While well-studied in the classification setup, selective approaches to regression are much less developed. In this work, we consider the nonparametric heteroskedastic regression problem and develop an abstention procedure via testing the hypothesis on the value of the conditional variance at a given point. Unlike existing methods, the proposed one allows to account not only for the value of the variance itself but also for the uncertainty of the corresponding variance predictor. We prove non-asymptotic bounds on the risk of the resulting estimator and show the existence of several different convergence regimes. Theoretical analysis is illustrated with a series of experiments on simulated and real-world data.

MLJul 26, 2023
Optimal Noise Reduction in Dense Mixed-Membership Stochastic Block Models under Diverging Spiked Eigenvalues Condition

Fedor Noskov, Maxim Panov

Community detection is one of the most critical problems in modern network science. Its applications can be found in various fields, from protein modeling to social network analysis. Recently, many papers appeared studying the problem of overlapping community detection, where each node of a network may belong to several communities. In this work, we consider Mixed-Membership Stochastic Block Model (MMSB) first proposed by Airoldi et al. MMSB provides quite a general setting for modeling overlapping community structure in graphs. The central question of this paper is to reconstruct relations between communities given an observed network. We compare different approaches and establish the minimax lower bound on the estimation error. Then, we propose a new estimator that matches this lower bound. Theoretical results are proved under fairly general conditions on the considered model. Finally, we illustrate the theory in a series of experiments.

LGFeb 22
IDLM: Inverse-distilled Diffusion Language Models

David Li, Nikita Gushchin, Dmitry Abulkhanov et al.

Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's entropy and generative perplexity.

92.5CLMay 14
Uncertainty Quantification for Large Language Diffusion Models

Artem Vazhentsev, Vladislav Smirnov, David Li et al.

Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. However, existing UQ methods are fundamentally misaligned with this new paradigm: they assume autoregressive factorization or use expensive repeated sampling, negating the efficiency of LLDMs. In this work, we present the first systematic study of UQ for LLDMs and propose lightweight, zero-shot uncertainty signals derived from the iterative denoising process, leveraging intermediate generations, token remasking dynamics, and denoising complexity. We further adapt a state-of-the-art UQ method to LLDMs by combining masked diffusion likelihoods with trajectory-based semantic dissimilarity. We prove that expected trajectory dissimilarity lower bounds the masked diffusion training objective, which motivates its usage as an uncertainty score. Comprehensive experiments across three tasks, eight datasets, and two models show that our method achieves a great cost-performance trade-off: it approaches the strongest sampling-based baselines while incurring up to 100x lower computational overhead. Our work demonstrates that LLDMs can deliver both fast inference and reliable hallucination detection simultaneously.

MLSep 26, 2025Code
Multidimensional Uncertainty Quantification via Optimal Transport

Nikita Kotelevskii, Maiya Goloburda, Vladimir Kondratyev et al.

Most uncertainty quantification (UQ) approaches provide a single scalar value as a measure of model reliability. However, different uncertainty measures could provide complementary information on the prediction confidence. Even measures targeting the same type of uncertainty (e.g., ensemble-based and density-based measures of epistemic uncertainty) may capture different failure modes. We take a multidimensional view on UQ by stacking complementary UQ measures into a vector. Such vectors are assigned with Monge-Kantorovich ranks produced by an optimal-transport-based ordering method. The prediction is then deemed more uncertain than the other if it has a higher rank. The resulting VecUQ-OT algorithm uses entropy-regularized optimal transport. The transport map is learned on vectors of scores from in-distribution data and, by design, applies to unseen inputs, including out-of-distribution cases, without retraining. Our framework supports flexible non-additive uncertainty fusion (including aleatoric and epistemic components). It yields a robust ordering for downstream tasks such as selective prediction, misclassification detection, out-of-distribution detection, and selective generation. Across synthetic, image, and text data, VecUQ-OT shows high efficiency even when individual measures fail. The code for the method is available at: https://github.com/stat-ml/multidimensional_uncertainty.

CLJun 21, 2024Code
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph

Roman Vashurin, Ekaterina Fadeeva, Artem Vazhentsev et al.

The rapid proliferation of large language models (LLMs) has stimulated researchers to seek effective and efficient approaches to deal with LLM hallucinations and low-quality outputs. Uncertainty quantification (UQ) is a key element of machine learning applications in dealing with such challenges. However, research to date on UQ for LLMs has been fragmented in terms of techniques and evaluation methodologies. In this work, we address this issue by introducing a novel benchmark that implements a collection of state-of-the-art UQ baselines and offers an environment for controllable and consistent evaluation of novel UQ techniques over various text generation tasks. Our benchmark also supports the assessment of confidence normalization methods in terms of their ability to provide interpretable scores. Using our benchmark, we conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches. Code: https://github.com/IINemo/lm-polygraph Benchmark: https://huggingface.co/LM-Polygraph

CLMar 7, 2024
Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification

Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov et al.

Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.

74.8AIMay 1
Position: agentic AI orchestration should be Bayes-consistent

Theodore Papamarkou, Pierre Alquier, Matthias Bauer et al.

LLMs excel at predictive tasks and complex reasoning tasks, but many high-value deployments rely on decisions under uncertainty, for example, which tool to call, which expert to consult, or how many resources to invest. While the usefulness and feasibility of Bayesian approaches remain unclear for LLM inference, this position paper argues that the control layer of an agentic AI system (that orchestrates LLMs and tools) is a clear case where Bayesian principles should shine. Bayesian decision theory provides a framework for agentic systems that can help to maintain beliefs over task-relevant latent quantities, to update these beliefs from observed agentic and human-AI interactions, and to choose actions. Making LLMs themselves explicitly Bayesian belief-updating engines remains computationally intensive and conceptually nontrivial as a general modeling target. In contrast, this paper argues that coherent decision-making requires Bayesian principles at the orchestration level of the agentic system, not necessarily the LLM agent parameters. This paper articulates practical properties for Bayesian control that fit modern agentic AI systems and human-AI collaboration, and provides concrete examples and design patterns to illustrate how calibrated beliefs and utility-aware policies can improve agentic AI orchestration.

MLFeb 16, 2024
From Risk to Uncertainty: Generating Predictive Uncertainty Measures via Bayesian Estimation

Nikita Kotelevskii, Vladimir Kondratyev, Martin Takáč et al.

There are various measures of predictive uncertainty in the literature, but their relationships to each other remain unclear. This paper uses a decomposition of statistical pointwise risk into components, associated with different sources of predictive uncertainty, namely aleatoric uncertainty (inherent data variability) and epistemic uncertainty (model-related uncertainty). Together with Bayesian methods, applied as an approximation, we build a framework that allows one to generate different predictive uncertainty measures. We validate our method on image datasets by evaluating its performance in detecting out-of-distribution and misclassified instances using the AUROC metric. The experimental results confirm that the measures derived from our framework are useful for the considered downstream tasks.

MLDec 18, 2023
Dirichlet-based Uncertainty Quantification for Personalized Federated Learning with Improved Posterior Networks

Nikita Kotelevskii, Samuel Horváth, Karthik Nandakumar et al.

In modern federated learning, one of the main challenges is to account for inherent heterogeneity and the diverse nature of data distributions for different clients. This problem is often addressed by introducing personalization of the models towards the data distribution of the particular client. However, a personalized model might be unreliable when applied to the data that is not typical for this client. Eventually, it may perform worse for these data than the non-personalized global model trained in a federated way on the data from all the clients. This paper presents a new approach to federated learning that allows selecting a model from global and personalized ones that would perform better for a particular input point. It is achieved through a careful modeling of predictive uncertainties that helps to detect local and global in- and out-of-distribution data and use this information to select the model that is confident in a prediction. The comprehensive experimental evaluation on the popular real-world image datasets shows the superior performance of the model in the presence of out-of-distribution data while performing on par with state-of-the-art personalized federated learning algorithms in the standard scenarios.

CLFeb 7, 2025
Uncertainty Quantification for LLMs through Minimum Bayes Risk: Bridging Confidence and Consistency

Roman Vashurin, Maiya Goloburda, Albina Ilina et al.

Uncertainty quantification (UQ) methods for Large Language Models (LLMs) encompass a variety of approaches, with two major types being particularly prominent: information-based, which focus on model confidence expressed as token probabilities, and consistency-based, which assess the semantic relationship between multiple outputs generated using repeated sampling. Several recent methods have combined these two approaches to boost UQ performance. However, they sometimes fail to outperform much simpler baseline methods. Our work discusses the fundamental approach to constructing uncertainty measures that directly links uncertainty with the minimum Bayes risks achieved by LLM decoding. Building on these findings, we propose a novel approach to integrating model confidence with output consistency, resulting in a family of efficient and robust UQ methods. Our investigation reveals distinctive characteristics of LLMs as probabilistic models, which help to explain why these UQ methods underperform in certain tasks. Based on these findings, we propose a new way of synthesizing model confidence and output consistency, leading to a family of efficient and robust UQ methods. We evaluate our approach across various tasks such as question answering, abstractive summarization, and machine translation, demonstrating sizable improvements over state-of-the-art UQ approaches.

CLFeb 20, 2025
Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

Artem Vazhentsev, Lyudmila Rvanova, Ivan Lazichny et al.

Uncertainty quantification (UQ) is a prominent approach for eliciting truthful answers from large language models (LLMs). To date, information-based and consistency-based UQ have been the dominant UQ methods for text generation via LLMs. Density-based methods, despite being very effective for UQ in text classification with encoder-based models, have not been very successful with generative LLMs. In this work, we adapt Mahalanobis Distance (MD) - a well-established UQ technique in classification tasks - for text generation and introduce a new supervised UQ method. Our method extracts token embeddings from multiple layers of LLMs, computes MD scores for each token, and uses linear regression trained on these features to provide robust uncertainty scores. Through extensive experiments on eleven datasets, we demonstrate that our approach substantially improves over existing UQ methods, providing accurate and computationally efficient uncertainty scores for both sequence-level selective generation and claim-level fact-checking tasks. Our method also exhibits strong generalization to out-of-domain data, making it suitable for a wide range of LLM-based applications.

MLFeb 22, 2025
Rectifying Conformity Scores for Better Conditional Coverage

Vincent Plassier, Alexander Fishkov, Victor Dheur et al.

We present a new method for generating confidence sets within the split conformal prediction framework. Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage. The transformation is based on an estimate of the conditional quantile of conformity scores. The resulting method is particularly beneficial for constructing adaptive confidence sets in multi-output problems where standard conformal quantile regression approaches have limited applicability. We develop a theoretical bound that captures the influence of the accuracy of the quantile estimate on the approximate conditional validity, unlike classical bounds for conformal prediction methods that only offer marginal coverage. We experimentally show that our method is highly adaptive to the local data structure and outperforms existing methods in terms of conditional coverage, improving the reliability of statistical inference in various applications.

CLMay 27, 2025
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

Ekaterina Fadeeva, Aleksandr Rubashevskii, Roman Vashurin et al.

Large Language Models (LLMs) enhanced with external knowledge retrieval, an approach known as Retrieval-Augmented Generation (RAG), have shown strong performance in open-domain question answering. However, RAG systems remain susceptible to hallucinations: factually incorrect outputs that may arise either from inconsistencies in the model's internal knowledge or incorrect use of the retrieved context. Existing approaches often conflate factuality with faithfulness to the retrieved context, misclassifying factually correct statements as hallucinations if they are not directly supported by the retrieval. In this paper, we introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs. FRANQ applies different Uncertainty Quantification (UQ) techniques to estimate factuality based on whether a statement is faithful to the retrieved context or not. To evaluate FRANQ and other UQ techniques for RAG, we present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging examples. Extensive experiments on long- and short-form QA across multiple datasets and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing methods.

MLDec 25, 2023
Efficient Conformal Prediction under Data Heterogeneity

Vincent Plassier, Nikita Kotelevskii, Aleksandr Rubashevskii et al.

Conformal Prediction (CP) stands out as a robust framework for uncertainty quantification, which is crucial for ensuring the reliability of predictions. However, common CP methods heavily rely on data exchangeability, a condition often violated in practice. Existing approaches for tackling non-exchangeability lead to methods that are not computable beyond the simplest examples. This work introduces a new efficient approach to CP that produces provably valid confidence sets for fairly general non-exchangeable data distributions. We illustrate the general theory with applications to the challenging setting of federated learning under data heterogeneity between agents. Our method allows constructing provably valid personalized prediction sets for agents in a fully federated way. The effectiveness of the proposed method is demonstrated in a series of experiments on real-world datasets.

82.0CLApr 8
ReDAct: Uncertainty-Aware Deferral for LLM Agents

Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov et al.

Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per-token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason-Defer-Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text-based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.

LGSep 30, 2025
Uncertainty Quantification for Regression using Proper Scoring Rules

Alexander Fishkov, Kajetan Schweighofer, Mykyta Ielanskyi et al.

Quantifying uncertainty of machine learning model predictions is essential for reliable decision-making, especially in safety-critical applications. Recently, uncertainty quantification (UQ) theory has advanced significantly, building on a firm basis of learning with proper scoring rules. However, these advances were focused on classification, while extending these ideas to regression remains challenging. In this work, we introduce a unified UQ framework for regression based on proper scoring rules, such as CRPS, logarithmic, squared error, and quadratic scores. We derive closed-form expressions for the resulting uncertainty measures under practical parametric assumptions and show how to estimate them using ensembles of models. In particular, the derived uncertainty measures naturally decompose into aleatoric and epistemic components. The framework recovers popular regression UQ measures based on predictive variance and differential entropy. Our broad evaluation on synthetic and real-world regression datasets provides guidance for selecting reliable UQ measures.

MLMay 21, 2025
Adaptive Set-Mass Calibration with Conformal Prediction

Daniil Kazantsev, Mohsen Guizani, Eric Moulines et al.

Reliable probabilities are critical in high-risk applications, yet common calibration criteria (confidence, class-wise) are only necessary for full distributional calibration, and post-hoc methods often lack distribution-free guarantees. We propose a set-based notion of calibration, cumulative mass calibration, and a corresponding empirical error measure: the Cumulative Mass Calibration Error (CMCE). We develop a new calibration procedure that starts with conformal prediction to obtain a set of labels that gives the desired coverage. We then instantiate two simple post-hoc calibrators: a mass normalization and a temperature scaling-based rule, tuned to the conformal constraint. On multi-class image benchmarks, especially with a large number of classes, our methods consistently improve CMCE and standard metrics (ECE, cw-ECE, MCE) over baselines, delivering a practical, scalable framework with theoretical guarantees.

CLFeb 25, 2025
Uncertainty-aware abstention in medical diagnosis based on medical texts

Artem Vazhentsev, Ivan Sviridov, Alvard Barseghyan et al.

This study addresses the critical issue of reliability for AI-assisted medical diagnosis. We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis. Such selective prediction (or abstention) approaches are usually based on the modeling predictive uncertainty of machine learning models involved. This study explores uncertainty quantification in machine learning models for medical text analysis, addressing diverse tasks across multiple datasets. We focus on binary mortality prediction from textual data in MIMIC-III, multi-label medical code prediction using ICD-10 codes from MIMIC-IV, and multi-class classification with a private outpatient visits dataset. Additionally, we analyze mental health datasets targeting depression and anxiety detection, utilizing various text-based sources, such as essays, social media posts, and clinical descriptions. In addition to comparing uncertainty methods, we introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks. Our results provide a detailed comparison of uncertainty quantification methods. They demonstrate the effectiveness of HUQ-2 in capturing and evaluating uncertainty, paving the way for more reliable and interpretable applications in medical text analysis.

LGMar 18, 2024
Generalization error of spectral algorithms

Maksim Velikanov, Maxim Panov, Dmitry Yarotsky

The asymptotically precise estimation of the generalization of kernel methods has recently received attention due to the parallels between neural networks and their associated kernels. However, prior works derive such estimates for training by kernel ridge regression (KRR), whereas neural networks are typically trained with gradient descent (GD). In the present work, we consider the training of kernels with a family of $\textit{spectral algorithms}$ specified by profile $h(λ)$, and including KRR and GD as special cases. Then, we derive the generalization error as a functional of learning profile $h(λ)$ for two data models: high-dimensional Gaussian and low-dimensional translation-invariant model. Under power-law assumptions on the spectrum of the kernel and target, we use our framework to (i) give full loss asymptotics for both noisy and noiseless observations (ii) show that the loss localizes on certain spectral scales, giving a new perspective on the KRR saturation phenomenon (iii) conjecture, and demonstrate for the considered data models, the universality of the loss w.r.t. non-spectral details of the problem, but only in case of noisy observation.

MLSep 29, 2025
Neural Optimal Transport Meets Multivariate Conformal Prediction

Vladimir Kondratyev, Alexander Fishkov, Nikita Kotelevskii et al.

We propose a framework for conditional vector quantile regression (CVQR) that combines neural optimal transport with amortized optimization, and apply it to multivariate conformal prediction. Classical quantile regression does not extend naturally to multivariate responses, while existing approaches often ignore the geometry of joint distributions. Our method parametrizes the conditional vector quantile function as the gradient of a convex potential implemented by an input-convex neural network, ensuring monotonicity and uniform ranks. To reduce the cost of solving high-dimensional variational problems, we introduced amortized optimization of the dual potentials, yielding efficient training and faster inference. We then exploit the induced multivariate ranks for conformal prediction, constructing distribution-free predictive regions with finite-sample validity. Unlike coordinatewise methods, our approach adapts to the geometry of the conditional distribution, producing tighter and more informative regions. Experiments on benchmark datasets show improved coverage-efficiency trade-offs compared to baselines, highlighting the benefits of integrating neural optimal transport with conformal prediction.

LGSep 18, 2025
Who to Trust? Aggregating Client Knowledge in Logit-Based Federated Learning

Viktor Kovalchuk, Nikita Kotelevskii, Maxim Panov et al.

Federated learning (FL) usually shares model weights or gradients, which is costly for large models. Logit-based FL reduces this cost by sharing only logits computed on a public proxy dataset. However, aggregating information from heterogeneous clients is still challenging. This paper studies this problem, introduces and compares three logit aggregation methods: simple averaging, uncertainty-weighted averaging, and a learned meta-aggregator. Evaluated on MNIST and CIFAR-10, these methods reduce communication overhead, improve robustness under non-IID data, and achieve accuracy competitive with centralized training.

CLMay 25, 2025
UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models

Roman Vashurin, Maiya Goloburda, Preslav Nakov et al.

Large Language Models (LLMs) have become indispensable tools across various applications, making it more important than ever to ensure the quality and the trustworthiness of their outputs. This has led to growing interest in uncertainty quantification (UQ) methods for assessing the reliability of LLM outputs. Many existing UQ techniques rely on token probabilities, which inadvertently introduces a bias with respect to the length of the output. While some methods attempt to account for this, we demonstrate that such biases persist even in length-normalized approaches. To address the problem, here we propose UNCERTAINTY-LINE: (Length-INvariant Estimation), a simple debiasing procedure that regresses uncertainty scores on output length and uses the residuals as corrected, length-invariant estimates. Our method is post-hoc, model-agnostic, and applicable to a range of UQ measures. Through extensive evaluation on machine translation, summarization, and question-answering tasks, we demonstrate that UNCERTAINTY-LINE: consistently improves over even nominally length-normalized UQ methods uncertainty estimates across multiple metrics and models.

CVFeb 12, 2025
Screener: Self-supervised Pathology Segmentation in Medical CT Images

Mikhail Goncharov, Eugenia Soboleva, Mariia Donskova et al.

Accurate detection of all pathological findings in 3D medical images remains a significant challenge, as supervised models are limited to detecting only the few pathology classes annotated in existing datasets. To address this, we frame pathology detection as an unsupervised visual anomaly segmentation (UVAS) problem, leveraging the inherent rarity of pathological patterns compared to healthy ones. We enhance the existing density-based UVAS framework with two key innovations: (1) dense self-supervised learning for feature extraction, eliminating the need for supervised pretraining, and (2) learned, masking-invariant dense features as conditioning variables, replacing hand-crafted positional encodings. Trained on over 30,000 unlabeled 3D CT volumes, our fully self-supervised model, Screener, outperforms existing UVAS methods on four large-scale test datasets comprising 1,820 scans with diverse pathologies. Furthermore, in a supervised fine-tuning setting, Screener surpasses existing self-supervised pretraining methods, establishing it as a state-of-the-art foundation for pathology segmentation. The code and pretrained models will be made publicly available.

MLFeb 24, 2022
Embedded Ensembles: Infinite Width Limit and Operating Regimes

Maksim Velikanov, Roman Kail, Ivan Anokhin et al.

A memory efficient approach to ensembling neural networks is to share most weights among the ensembled models by means of a single reference network. We refer to this strategy as Embedded Ensembling (EE); its particular examples are BatchEnsembles and Monte-Carlo dropout ensembles. In this paper we perform a systematic theoretical and empirical analysis of embedded ensembles with different number of models. Theoretically, we use a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics. In this limit, we identify two ensemble regimes - independent and collective - depending on the architecture and initialization strategy of ensemble models. We prove that in the independent regime the embedded ensemble behaves as an ensemble of independent models. We confirm our theoretical prediction with a wide range of experiments with finite networks, and further study empirically various effects such as transition between the two regimes, scaling of ensemble performance with the network width and number of models, and dependence of performance on a number of architecture and hyperparameter choices.

MLFeb 7, 2022
Nonparametric Uncertainty Quantification for Single Deterministic Neural Network

Nikita Kotelevskii, Aleksandr Artemenkov, Kirill Fedyanin et al.

This paper proposes a fast and scalable method for uncertainty quantification of machine learning models' predictions. First, we show the principled way to measure the uncertainty of predictions for a classifier based on Nadaraya-Watson's nonparametric estimate of the conditional label distribution. Importantly, the proposed approach allows to disentangle explicitly aleatoric and epistemic uncertainties. The resulting method works directly in the feature space. However, one can apply it to any neural network by considering an embedding of the data induced by the network. We demonstrate the strong performance of the method in uncertainty estimation tasks on text classification problems and a variety of real-world image datasets, such as MNIST, SVHN, CIFAR-100 and several versions of ImageNet.

MLJun 30, 2021
Monte Carlo Variational Auto-Encoders

Achille Thin, Nikita Kotelevskii, Arnaud Doucet et al.

Variational auto-encoders (VAE) are popular deep latent variable models which are trained by maximizing an Evidence Lower Bound (ELBO). To obtain tighter ELBO and hence better variational approximations, it has been proposed to use importance sampling to get a lower variance estimate of the evidence. However, importance sampling is known to perform poorly in high dimensions. While it has been suggested many times in the literature to use more sophisticated algorithms such as Annealed Importance Sampling (AIS) and its Sequential Importance Sampling (SIS) extensions, the potential benefits brought by these advanced techniques have never been realized for VAE: the AIS estimate cannot be easily differentiated, while SIS requires the specification of carefully chosen backward Markov kernels. In this paper, we address both issues and demonstrate the performance of the resulting Monte Carlo VAEs on a variety of applications.

CODec 31, 2020
Nonreversible MCMC from conditional invertible transforms: a complete recipe with convergence guarantees

Achille Thin, Nikita Kotelevskii, Christophe Andrieu et al.

Markov Chain Monte Carlo (MCMC) is a class of algorithms to sample complex and high-dimensional probability distributions. The Metropolis-Hastings (MH) algorithm, the workhorse of MCMC, provides a simple recipe to construct reversible Markov kernels. Reversibility is a tractable property that implies a less tractable but essential property here, invariance. Reversibility is however not necessarily desirable when considering performance. This has prompted recent interest in designing kernels breaking this property. At the same time, an active stream of research has focused on the design of novel versions of the MH kernel, some nonreversible, relying on the use of complex invertible deterministic transforms. While standard implementations of the MH kernel are well understood, the aforementioned developments have not received the same systematic treatment to ensure their validity. This paper fills the gap by developing general tools to ensure that a class of nonreversible Markov kernels, possibly relying on complex transforms, has the desired invariance property and leads to convergent algorithms. This leads to a set of simple and practically verifiable conditions.

MLSep 30, 2020
EWS-GCN: Edge Weight-Shared Graph Convolutional Network for Transactional Banking Data

Ivan Sukharev, Valentina Shumovskaia, Kirill Fedyanin et al.

In this paper, we discuss how modern deep learning approaches can be applied to the credit scoring of bank clients. We show that information about connections between clients based on money transfers between them allows us to significantly improve the quality of credit scoring compared to the approaches using information about the target client solely. As a final solution, we develop a new graph neural network model EWS-GCN that combines ideas of graph convolutional and recurrent neural networks via attention mechanism. The resulting model allows for robust training and efficient processing of large-scale data. We also demonstrate that our model outperforms the state-of-the-art graph neural networks achieving excellent results

LGMar 6, 2020
Dropout Strikes Back: Improved Uncertainty Estimation via Diversity Sampling

Kirill Fedyanin, Evgenii Tsymbalov, Maxim Panov

Uncertainty estimation for machine learning models is of high importance in many scenarios such as constructing the confidence intervals for model predictions and detection of out-of-distribution or adversarially generated points. In this work, we show that modifying the sampling distributions for dropout layers in neural networks improves the quality of uncertainty estimation. Our main idea consists of two main steps: computing data-driven correlations between neurons and generating samples, which include maximally diverse neurons. In a series of experiments on simulated and real-world data, we demonstrate that the diversification via determinantal point processes-based sampling achieves state-of-the-art results in uncertainty estimation for regression and classification tasks. An important feature of our approach is that it does not require any modification to the models or training procedures, allowing straightforward application to any deep learning model with dropout layers.

MLFeb 27, 2020
MetFlow: A New Efficient Method for Bridging the Gap between Markov Chain Monte Carlo and Variational Inference

Achille Thin, Nikita Kotelevskii, Jean-Stanislas Denain et al.

In this contribution, we propose a new computationally efficient method to combine Variational Inference (VI) with Markov Chain Monte Carlo (MCMC). This approach can be used with generic MCMC kernels, but is especially well suited to \textit{MetFlow}, a novel family of MCMC algorithms we introduce, in which proposals are obtained using Normalizing Flows. The marginal distribution produced by such MCMC algorithms is a mixture of flow-based distributions, thus drastically increasing the expressivity of the variational family. Unlike previous methods following this direction, our approach is amenable to the reparametrization trick and does not rely on computationally expensive reverse kernels. Extensive numerical experiments show clear computational and performance improvements over state-of-the-art methods.

MLJan 30, 2020
NCVis: Noise Contrastive Approach for Scalable Visualization

Aleksandr Artemenkov, Maxim Panov

Modern methods for data visualization via dimensionality reduction, such as t-SNE, usually have performance issues that prohibit their application to large amounts of high-dimensional data. In this work, we propose NCVis -- a high-performance dimensionality reduction method built on a sound statistical basis of noise contrastive estimation. We show that NCVis outperforms state-of-the-art techniques in terms of speed while preserving the representation quality of other methods. In particular, the proposed approach successfully proceeds a large dataset of more than 1 million news headlines in several minutes and presents the underlying structure in a human-readable way. Moreover, it provides results consistent with classical methods like t-SNE on more straightforward datasets like images of hand-written digits. We believe that the broader usage of such software can significantly simplify the large-scale data analysis and lower the entry barrier to this area.

MLJan 23, 2020
Linking Bank Clients using Graph Neural Networks Powered by Rich Transactional Data

Valentina Shumovskaia, Kirill Fedyanin, Ivan Sukharev et al.

Financial institutions obtain enormous amounts of data about user transactions and money transfers, which can be considered as a large graph dynamically changing in time. In this work, we focus on the task of predicting new interactions in the network of bank clients and treat it as a link prediction problem. We propose a new graph neural network model, which uses not only the topological structure of the network but rich time-series data available for the graph nodes and edges. We evaluate the developed method using the data provided by a large European bank for several years. The proposed model outperforms the existing approaches, including other neural network models, with a significant gap in ROC AUC score on link prediction problem and also allows to improve the quality of credit scoring.

MLApr 12, 2019
Geometry-Aware Maximum Likelihood Estimation of Intrinsic Dimension

Marina Gomtsyan, Nikita Mokrov, Maxim Panov et al.

The existing approaches to intrinsic dimension estimation usually are not reliable when the data are nonlinearly embedded in the high dimensional space. In this work, we show that the explicit accounting to geometric properties of unknown support leads to the polynomial correction to the standard maximum likelihood estimate of intrinsic dimension for flat manifolds. The proposed algorithm (GeoMLE) realizes the correction by regression of standard MLEs based on distances to nearest neighbors for different sizes of neighborhoods. Moreover, the proposed approach also efficiently handles the case of nonuniform sampling of the manifold. We perform numerous experiments on different synthetic and real-world datasets. The results show that our algorithm achieves state-of-the-art performance, while also being computationally efficient and robust to noise in the data.

LGFeb 27, 2019
Deeper Connections between Neural Networks and Gaussian Processes Speed-up Active Learning

Evgenii Tsymbalov, Sergei Makarychev, Alexander Shapeev et al.

Active learning methods for neural networks are usually based on greedy criteria which ultimately give a single new design point for the evaluation. Such an approach requires either some heuristics to sample a batch of design points at one active learning iteration, or retraining the neural network after adding each data point, which is computationally inefficient. Moreover, uncertainty estimates for neural networks sometimes are overconfident for the points lying far from the training sample. In this work we propose to approximate Bayesian neural networks (BNN) by Gaussian processes, which allows us to update the uncertainty estimates of predictions efficiently without retraining the neural network, while avoiding overconfident uncertainty prediction for out-of-sample points. In a series of experiments on real-world data including large-scale problems of chemical and physical modeling, we show superiority of the proposed approach over the state-of-the-art methods.

MLOct 6, 2018
Constructing Graph Node Embeddings via Discrimination of Similarity Distributions

Stanislav Tsepa, Maxim Panov

The problem of unsupervised learning node embeddings in graphs is one of the important directions in modern network science. In this work we propose a novel framework, which is aimed to find embeddings by \textit{discriminating distributions of similarities (DDoS)} between nodes in the graph. The general idea is implemented by maximizing the \textit{earth mover distance} between distributions of decoded similarities of similar and dissimilar nodes. The resulting algorithm generates embeddings which give a state-of-the-art performance in the problem of link prediction in real-world graphs.