LGOct 15, 2023
Theoretical Evaluation of Asymmetric Shapley Values for Root-Cause AnalysisDomokos M. Kelen, Mihály Petreczky, Péter Kersch et al.
In this work, we examine Asymmetric Shapley Values (ASV), a variant of the popular SHAP additive local explanation method. ASV proposes a way to improve model explanations incorporating known causal relations between variables, and is also considered as a way to test for unfair discrimination in model predictions. Unexplored in previous literature, relaxing symmetry in Shapley values can have counter-intuitive consequences for model explanation. To better understand the method, we first show how local contributions correspond to global contributions of variance reduction. Using variance, we demonstrate multiple cases where ASV yields counter-intuitive attributions, arguably producing incorrect results for root-cause analysis. Second, we identify generalized additive models (GAM) as a restricted class for which ASV exhibits desirable properties. We support our arguments by proving multiple theoretical results about the method. Finally, we demonstrate the use of asymmetric attributions on multiple real-world datasets, comparing the results with and without restricted model families using gradient boosting and deep learning models.
MLFeb 13
Nonparametric Distribution Regression Re-calibrationÁdám Jung, Domokos M. Kelen, András A. Benczúr
A key challenge in probabilistic regression is ensuring that predictive distributions accurately reflect true empirical uncertainty. Minimizing overall prediction error often encourages models to prioritize informativeness over calibration, producing narrow but overconfident predictions. However, in safety-critical settings, trustworthy uncertainty estimates are often more valuable than narrow intervals. Realizing the problem, several recent works have focused on post-hoc corrections; however, existing methods either rely on weak notions of calibration (such as PIT uniformity) or impose restrictive parametric assumptions on the nature of the error. To address these limitations, we propose a novel nonparametric re-calibration algorithm based on conditional kernel mean embeddings, capable of correcting calibration error without restrictive modeling assumptions. For efficient inference with real-valued targets, we introduce a novel characteristic kernel over distributions that can be evaluated in $\mathcal{O}(n \log n)$ time for empirical distributions of size $n$. We demonstrate that our method consistently outperforms prior re-calibration approaches across a diverse set of regression benchmarks and model classes.
GLFeb 24, 2025
Mesterséges Intelligencia Kutatások MagyarországonAndrás A. Benczúr, Tibor Gyimóthy, Balázs Szegedy
Artificial intelligence (AI) has undergone remarkable development since the mid-2000s, particularly in the fields of machine learning and deep learning, driven by the explosive growth of large databases and computational capacity. Hungarian researchers recognized the significance of AI early on, actively participating in international research and achieving significant results in both theoretical and practical domains. This article presents some key achievements in Hungarian AI research. It highlights the results from the period before the rise of deep learning (the early 2010s), then discusses major theoretical advancements in Hungary after 2010. Finally, it provides a brief overview of AI-related applied scientific achievements from 2010 onward.
SIOct 20, 2021
Vaccine skepticism detection by network embeddingFerenc Béres, Rita Csoma, Tamás Vilmos Michaletzky et al.
We demonstrate the applicability of network embedding to vaccine skepticism, a controversial topic of long-past history. With the Covid-19 pandemic outbreak at the end of 2019, the topic is more important than ever. Only a year after the first international cases were registered, multiple vaccines were developed and passed clinical testing. Besides the challenges of development, testing, and logistics, another factor that might play a significant role in the fight against the pandemic are people who are hesitant to get vaccinated, or even state that they will refuse any vaccine offered to them. Two groups of people commonly referred to as a) pro-vaxxer, those who support vaccinating people b) vax-skeptic, those who question vaccine efficacy or the need for general vaccination against Covid-19. It is very difficult to tell exactly how many people share each of these views. It is even more difficult to understand all the reasoning why vax-skeptic opinions are getting more popular. In this work, our intention was to develop techniques that are able to efficiently differentiate between pro-vaxxer and vax-skeptic content. After multiple data preprocessing steps, we analyzed the tweet text as well as the structure of user interactions on Twitter. We deployed several node embedding and community detection models that scale well for graphs with millions of edges.
CRMay 28, 2020
Blockchain is Watching You: Profiling and Deanonymizing Ethereum UsersFerenc Béres, István András Seres, András A. Benczúr et al.
Ethereum is the largest public blockchain by usage. It applies an account-based model, which is inferior to Bitcoin's unspent transaction output model from a privacy perspective. Due to its privacy shortcomings, recently several privacy-enhancing overlays have been deployed on Ethereum, such as non-custodial, trustless coin mixers and confidential transactions. In our privacy analysis of Ethereum's account-based model, we describe several patterns that characterize only a limited set of users and successfully apply these quasi-identifiers in address deanonymization tasks. Using Ethereum Name Service identifiers as ground truth information, we quantitatively compare algorithms in recent branch of machine learning, the so-called graph representation learning, as well as time-of-day activity and transaction fee based user profiling techniques. As an application, we rigorously assess the privacy guarantees of the Tornado Cash coin mixer by discovering strong heuristics to link the mixing parties. To the best of our knowledge, we are the first to propose and implement Ethereum user profiling techniques based on quasi-identifiers. Finally, we describe a malicious value-fingerprinting attack, a variant of the Danaan-gift attack, applicable for the confidential transaction overlays on Ethereum. By incorporating user activity statistics from our data set, we estimate the success probability of such an attack.
DCFeb 16, 2018
Online Machine Learning in Big Data StreamsAndrás A. Benczúr, Levente Kocsis, Róbert Pálovics
The area of online machine learning in big data streams covers algorithms that are (1) distributed and (2) work from data streams with only a limited possibility to store past data. The first requirement mostly concerns software architectures and efficient algorithms. The second one also imposes nontrivial theoretical restrictions on the modeling methods: In the data stream model, older data is no longer available to revise earlier suboptimal modeling decisions as the fresh data arrives. In this article, we provide an overview of distributed software architectures and libraries as well as machine learning models for online learning. We highlight the most important ideas for classification, regression, recommendation, and unsupervised modeling from streaming data, and we show how they are implemented in various distributed data stream processing systems. This article is a reference material and not a survey. We do not attempt to be comprehensive in describing all existing methods and solutions; rather, we give pointers to the most important resources in the field. All related sub-fields, online algorithms, online learning, and distributed data processing are hugely dominant in current research and development with conceptually new research results and software components emerging at the time of writing. In this article, we refer to several survey results, both for distributed data processing and for online machine learning. Compared to past surveys, our article is different because we discuss recommender systems in extended detail.