CVJun 10, 2022
Explaining Image Classifiers Using Contrastive Counterfactuals in Generative Latent SpacesKamran Alipour, Aditya Lahiri, Ehsan Adeli et al.
Despite their high accuracies, modern complex image classifiers cannot be trusted for sensitive tasks due to their unknown decision-making process and potential biases. Counterfactual explanations are very effective in providing transparency for these black-box algorithms. Nevertheless, generating counterfactuals that can have a consistent impact on classifier outputs and yet expose interpretable feature changes is a very challenging task. We introduce a novel method to generate causal and yet interpretable counterfactual explanations for image classifiers using pretrained generative models without any re-training or conditioning. The generative models in this technique are not bound to be trained on the same data as the target classifier. We use this framework to obtain contrastive and causal sufficiency and necessity scores as global explanations for black-box classifiers. On the task of face attribute classification, we show how different attributes influence the classifier output by providing both causal and contrastive feature attributions, and the corresponding counterfactual images.
LGDec 21, 2022
Consistent Range Approximation for Fair Predictive ModelingJiongli Zhu, Sainyam Galhotra, Nazanin Sabri et al.
This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods.
79.4CLMay 20
Residual Skill Optimization for Text-to-SQL EnsemblesJiongli Zhu, Haoquan Guan, Parjanya Prajakta Prashant et al.
Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.
LGJun 14, 2022
Combining Counterfactuals With Shapley Values To Explain Image ModelsAditya Lahiri, Kamran Alipour, Ehsan Adeli et al.
With the widespread use of sophisticated machine learning models in sensitive applications, understanding their decision-making has become an essential task. Models trained on tabular data have witnessed significant progress in explanations of their underlying decision making processes by virtue of having a small number of discrete features. However, applying these methods to high-dimensional inputs such as images is not a trivial task. Images are composed of pixels at an atomic level and do not carry any interpretability by themselves. In this work, we seek to use annotated high-level interpretable features of images to provide explanations. We leverage the Shapley value framework from Game Theory, which has garnered wide acceptance in general XAI problems. By developing a pipeline to generate counterfactuals and subsequently using it to estimate Shapley values, we obtain contrastive and interpretable explanations with strong axiomatic guarantees.
63.6LGMay 19
Fine-Tuning Without Forgetting via Loss-Adaptive Learning RatesParjanya Prajakta Prashant, Jiongli Zhu, Aldan Creo et al.
Fine-tuning large language models on new data improves task performance but degrades capabilities learned during pretraining, a phenomenon known as catastrophic forgetting. Existing methods mitigate this by modifying the fine-tuning objective to suppress high-loss tokens or sequences, but these tokens are essential for learning new tasks, especially those with poor pretraining coverage. In such settings, hard tokens should still contribute to learning, so forgetting must be controlled without suppressing them. We identify a simple mechanism for doing so: per-step forgetting is bounded by the product of the learning rate and the square root of the current training loss. This suggests that high-loss batches are especially prone to inducing forgetting. Motivated by this observation, we introduce FINCH, a loss-adaptive learning-rate schedule that reduces the learning rate on high-loss batches and increases it as the model converges, while leaving the fine-tuning objective unchanged. Across knowledge acquisition, science, and low-resource language adaptation benchmarks, FINCH reduces forgetting by 93% on average while matching the task performance of standard fine-tuning. On Qwen3-4B knowledge acquisition, FINCH cuts TruthfulQA degradation by 5x and reverses HaluEval degradation, while better preserving confidence calibration. Overall, our results show that learning-rate schedules are an effective tool to shape model behavior during fine-tuning, beyond just target-task optimization.
DBDec 2, 2025
Stress-Testing Causal Claims via Cardinality RepairsYarden Gabbay, Haoquan Guan, Shaull Almagor et al.
Causal analyses derived from observational data underpin high-stakes decisions in domains such as healthcare, public policy, and economics. Yet such conclusions can be surprisingly fragile: even minor data errors - duplicate records, or entry mistakes - may drastically alter causal relationships. This raises a fundamental question: how robust is a causal claim to small, targeted modifications in the data? Addressing this question is essential for ensuring the reliability, interpretability, and reproducibility of empirical findings. We introduce SubCure, a framework for robustness auditing via cardinality repairs. Given a causal query and a user-specified target range for the estimated effect, SubCure identifies a small set of tuples or subpopulations whose removal shifts the estimate into the desired range. This process not only quantifies the sensitivity of causal conclusions but also pinpoints the specific regions of the data that drive those conclusions. We formalize this problem under both tuple- and pattern-level deletion settings and show both are NP-complete. To scale to large datasets, we develop efficient algorithms that incorporate machine unlearning techniques to incrementally update causal estimates without retraining from scratch. We evaluate SubCure across four real-world datasets covering diverse application domains. In each case, it uncovers compact, high-impact subsets whose removal significantly shifts the causal conclusions, revealing vulnerabilities that traditional methods fail to detect. Our results demonstrate that cardinality repair is a powerful and general-purpose tool for stress-testing causal analyses and guarding against misleading claims rooted in ordinary data imperfections.
LGMar 4, 2024
OTClean: Data Cleaning for Conditional Independence Violations using Optimal TransportAlireza Pirhadi, Mohammad Hossein Moslemi, Alexander Cloninger et al.
Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.
CVApr 21, 2024
Enforcing Conditional Independence for Fair Representation Learning and Causal Image GenerationJensen Hwa, Qingyu Zhao, Aditya Lahiri et al.
Conditional independence (CI) constraints are critical for defining and evaluating fairness in machine learning, as well as for learning unconfounded or causal representations. Traditional methods for ensuring fairness either blindly learn invariant features with respect to a protected variable (e.g., race when classifying sex from face images) or enforce CI relative to the protected attribute only on the model output (e.g., the sex label). Neither of these methods are effective in enforcing CI in high-dimensional feature spaces. In this paper, we focus on a nascent approach characterizing the CI constraint in terms of two Jensen-Shannon divergence terms, and we extend it to high-dimensional feature spaces using a novel dynamic sampling strategy. In doing so, we introduce a new training paradigm that can be applied to any encoder architecture. We are able to enforce conditional independence of the diffusion autoencoder latent representation with respect to any protected attribute under the equalized odds constraint and show that this approach enables causal image generation with controllable latent spaces. Our experimental results demonstrate that our approach can achieve high accuracy on downstream tasks while upholding equality of odds.
LGMar 17, 2024
Graph Machine Learning based Doubly Robust Estimator for Network Causal EffectsSeyedeh Baharan Khatami, Harsh Parikh, Haowei Chen et al.
We address the challenge of inferring causal effects in social network data. This results in challenges due to interference -- where a unit's outcome is affected by neighbors' treatments -- and network-induced confounding factors. While there is extensive literature focusing on estimating causal effects in social network setups, a majority of them make prior assumptions about the form of network-induced confounding mechanisms. Such strong assumptions are rarely likely to hold especially in high-dimensional networks. We propose a novel methodology that combines graph machine learning approaches with the double machine learning framework to enable accurate and efficient estimation of direct and peer effects using a single observational social network. We demonstrate the semiparametric efficiency of our proposed estimator under mild regularity conditions, allowing for consistent uncertainty quantification. We demonstrate that our method is accurate, robust, and scalable via an extensive simulation study. We use our method to investigate the impact of Self-Help Group participation on financial risk tolerance.
LGApr 21, 2025
Causal DAG Summarization (Full Version)Anna Zeng, Michael Cafarella, Batya Kenig et al.
Causal inference aids researchers in discovering cause-and-effect relationships, leading to scientific insights. Accurate causal estimation requires identifying confounding variables to avoid false discoveries. Pearl's causal model uses causal DAGs to identify confounding variables, but incorrect DAGs can lead to unreliable causal conclusions. However, for high dimensional data, the causal DAGs are often complex beyond human verifiability. Graph summarization is a logical next step, but current methods for general-purpose graph summarization are inadequate for causal DAG summarization. This paper addresses these challenges by proposing a causal graph summarization objective that balances graph simplification for better understanding while retaining essential causal information for reliable inference. We develop an efficient greedy algorithm and show that summary causal DAGs can be directly used for inference and are more robust to misspecification of assumptions, enhancing robustness for causal inference. Experimenting with six real-life datasets, we compared our algorithm to three existing solutions, showing its effectiveness in handling high-dimensional data and its ability to generate summary DAGs that ensure both reliable causal inference and robustness against misspecifications.
LGNov 29, 2024
Scalable Out-of-distribution Robustness in the Presence of Unobserved ConfoundersParjanya Prashant, Seyedeh Baharan Khatami, Bruno Ribeiro et al.
We consider the task of out-of-distribution (OOD) generalization, where the distribution shift is due to an unobserved confounder ($Z$) affecting both the covariates ($X$) and the labels ($Y$). This confounding introduces heterogeneity in the predictor, i.e., $P(Y | X) = E_{P(Z | X)}[P(Y | X,Z)]$, making traditional covariate and label shift assumptions unsuitable. OOD generalization differs from traditional domain adaptation in that it does not assume access to the covariate distribution ($X^\text{te}$) of the test samples during training. These conditions create a challenging scenario for OOD robustness: (a) $Z^\text{tr}$ is an unobserved confounder during training, (b) $P^\text{te}(Z) \neq P^\text{tr}(Z)$, (c) $X^\text{te}$ is unavailable during training, and (d) the predictive distribution depends on $P^\text{te}(Z)$. While prior work has developed complex predictors requiring multiple additional variables for identifiability of the latent distribution, we explore a set of identifiability assumptions that yield a surprisingly simple predictor using only a single additional variable. Our approach demonstrates superior empirical performance on several benchmark tasks.
LGJun 30, 2025
KAIROS: Scalable Model-Agnostic Data ValuationJiongli Zhu, Parjanya Prajakta Prashant, Alex Cloninger et al.
Training data increasingly shapes not only model accuracy but also regulatory compliance and market valuation of AI assets. Yet existing valuation methods remain inadequate: model-based techniques depend on a single fitted model and inherit its biases, while algorithm-based approaches such as Data Shapley require costly retrainings at web scale. Recent Wasserstein-based model-agnostic methods rely on approximations that misrank examples relative to their true leave-one-out (LOO) utility. We introduce KAIROS, a scalable, model-agnostic valuation framework that assigns each example a distributional influence score: its contribution to the Maximum Mean Discrepancy (MMD) between the empirical training distribution and a clean reference set. Unlike Wasserstein surrogates, our MMD-based influence admits a closed-form solution that faithfully approximates the exact LOO ranking within $O(1/N^2)$ error, requires no retraining, and naturally extends to conditional kernels for unified label- and feature-error detection. Moreover, KAIROS supports efficient online updates: when a new batch of size m arrives, all scores can be updated in $O(mN)$ time, delivering up to 50x speedup without compromising ranking quality. Empirical evaluations on noise, mislabeling, and poisoning benchmarks show that KAIROS consistently outperforms state-of-the-art model-, Shapley-, and Wasserstein-based baselines in both accuracy and runtime. We provide rigorous theoretical guarantees, including symmetry for reproducible rankings and density-separation for interpretable thresholds.
LGJun 2, 2025
Stress-Testing ML Pipelines with Adversarial Data CorruptionJiongli Zhu, Geyang Xu, Felipe Lorenzi et al.
Structured data-quality issues, such as missing values correlated with demographics, culturally biased labels, or systemic selection biases, routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce SAVAGE, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. SAVAGE employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5 %) of structured corruptions identified by SAVAGE severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, SAVAGE provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows.
LGFeb 7, 2025
A Lightweight Method to Disrupt Memorized Sequences in LLMParjanya Prajakta Prashant, Kaustubh Ponkshe, Babak Salimi
As language models scale, their performance improves dramatically across a wide range of tasks, but so does their tendency to memorize and regurgitate parts of their training data verbatim. This tradeoff poses serious legal, ethical, and safety concerns, especially in real-world deployments. Existing mitigation techniques, such as differential privacy or model unlearning, often require retraining or access to internal weights making them impractical for most users. In this work, we introduce TokenSwap, a lightweight, post-hoc defense designed for realistic settings where the user can only access token-level outputs. Our key insight is that while large models are necessary for high task performance, small models (e.g., DistilGPT-2) are often sufficient to assign fluent, grammatically plausible probabilities to common function words - and crucially, they memorize far less. By selectively swapping token probabilities between models, TokenSwap preserves the capabilities of large models while reducing their propensity for verbatim reproduction. Evaluations on Pythia-6.9B and Llama-3-8B show up to a 10$\times$ drop in exact memorization with negligible task degradation. Our method offers a practical, accessible solution for mitigating memorized generation in deployed LLMs.
LGDec 17, 2021
Interpretable Data-Based Explanations for Fairness DebuggingRomila Pradhan, Jiongli Zhu, Boris Glavic et al.
A wide variety of fairness metrics and eXplainable Artificial Intelligence (XAI) approaches have been proposed in the literature to identify bias in machine learning models that are used in critical real-life contexts. However, merely reporting on a model's bias, or generating explanations using existing XAI techniques is insufficient to locate and eventually mitigate sources of bias. We introduce Gopher, a system that produces compact, interpretable and causal explanations for bias or unexpected model behavior by identifying coherent subsets of the training data that are root-causes for this behavior. Specifically, we introduce the concept of causal responsibility that quantifies the extent to which intervening on training data by removing or updating subsets of it can resolve the bias. Building on this concept, we develop an efficient approach for generating the top-k patterns that explain model bias that utilizes techniques from the machine learning (ML) community to approximate causal responsibility and uses pruning rules to manage the large search space for patterns. Our experimental evaluation demonstrates the effectiveness of Gopher in generating interpretable explanations for identifying and debugging sources of bias.
AIMar 22, 2021
Explaining Black-Box Algorithms Using Probabilistic Contrastive CounterfactualsSainyam Galhotra, Romila Pradhan, Babak Salimi
There has been a recent resurgence of interest in explainable artificial intelligence (XAI) that aims to reduce the opaqueness of AI-based decision-making systems, allowing humans to scrutinize and trust them. Prior work in this context has focused on the attribution of responsibility for an algorithm's decisions to its inputs wherein responsibility is typically approached as a purely associational concept. In this paper, we propose a principled causality-based approach for explaining black-box decision-making systems that addresses limitations of existing methods in XAI. At the core of our framework lies probabilistic contrastive counterfactuals, a concept that can be traced back to philosophical, cognitive, and social foundations of theories on how humans generate and select explanations. We show how such counterfactuals can quantify the direct and indirect influences of a variable on decisions made by an algorithm, and provide actionable recourse for individuals negatively affected by the algorithm's decision. Unlike prior work, our system, LEWIS: (1)can compute provably effective explanations and recourse at local, global and contextual levels (2)is designed to work with users with varying levels of background knowledge of the underlying causal model and (3)makes no assumptions about the internals of an algorithmic system except for the availability of its input-output data. We empirically evaluate LEWIS on three real-world datasets and show that it generates human-understandable explanations that improve upon state-of-the-art approaches in XAI, including the popular LIME and SHAP. Experiments on synthetic data further demonstrate the correctness of LEWIS's explanations and the scalability of its recourse algorithm.
LGJan 18, 2021
Through the Data Management Lens: Experimental Analysis and Evaluation of Fair ClassificationMaliha Tashfia Islam, Anna Fariha, Alexandra Meliou et al.
Classification, a heavily-studied data-driven machine learning task, drives an increasing number of prediction systems involving critical human decisions such as loan approval and criminal risk assessment. However, classifiers often demonstrate discriminatory behavior, especially when presented with biased data. Consequently, fairness in classification has emerged as a high-priority research area. Data management research is showing an increasing presence and interest in topics related to data and algorithmic fairness, including the topic of fair classification. The interdisciplinary efforts in fair classification, with machine learning research having the largest presence, have resulted in a large number of fairness notions and a wide range of approaches that have not been systematically evaluated and compared. In this paper, we contribute a broad analysis of 13 fair classification approaches and additional variants, over their correctness, fairness, efficiency, scalability, robustness to data errors, sensitivity to underlying ML model, data efficiency, and stability using a variety of metrics and real-world datasets. Our analysis highlights novel insights on the impact of different metrics and high-level approach characteristics on different aspects of performance. We also discuss general principles for choosing approaches suitable for different practical settings, and identify areas where data-management-centric solutions are likely to have the most impact.
DBApr 7, 2020
Causal Relational LearningBabak Salimi, Harsh Parikh, Moe Kayali et al.
Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials; unfortunately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from observational data have been developed in statistical studies and social sciences. However, existing methods critically rely on restrictive assumptions such as the study population consisting of homogeneous elements that can be represented in a single flat table, where each row is referred to as a unit. In contrast, in many real-world settings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. In this paper, we present a formal framework for causal inference from such relational data. We propose a declarative language called CaRL for capturing causal background knowledge and assumptions and specifying causal queries using simple Datalog-like rules.CaRL provides a foundation for inferring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational data to illustrate the applicability of CaRL in social sciences and healthcare.
DBAug 20, 2019
Data Management for Causal Algorithmic FairnessBabak Salimi, Bill Howe, Dan Suciu
Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflects discrimination, suggesting a data management problem. In this paper, we first make a distinction between associational and causal definitions of fairness in the literature and argue that the concept of fairness requires causal reasoning. We then review existing works and identify future opportunities for applying data management techniques to causal algorithmic fairness.
DBFeb 21, 2019
Capuchin: Causal Database Repair for Algorithmic FairnessBabak Salimi, Luke Rodriguez, Bill Howe et al.
Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflect discrimination, suggesting a database repair problem. Existing treatments of fairness rely on statistical correlations that can be fooled by statistical anomalies, such as Simpson's paradox. Proposals for causality-based definitions of fairness can correctly model some of these situations, but they require specification of the underlying causal models. In this paper, we formalize the situation as a database repair problem, proving sufficient conditions for fair classifiers in terms of admissible variables as opposed to a complete causal model. We show that these conditions correctly capture subtle fairness violations. We then use these conditions as the basis for database repair algorithms that provide provable fairness guarantees about classifiers trained on their training labels. We evaluate our algorithms on real data, demonstrating improvement over the state of the art on multiple fairness metrics proposed in the literature while retaining high utility.
DBAug 8, 2017
A Framework for Inferring Causality from Multi-Relational Observational Data using Conditional IndependenceSudeepa Roy, Babak Salimi
The study of causality or causal inference - how much a given treatment causally affects a given outcome in a population - goes way beyond correlation or association analysis of variables, and is critical in making sound data driven decisions and policies in a multitude of applications. The gold standard in causal inference is performing "controlled experiments", which often is not possible due to logistical or ethical reasons. As an alternative, inferring causality on "observational data" based on the "Neyman-Rubin potential outcome model" has been extensively used in statistics, economics, and social sciences over several decades. In this paper, we present a formal framework for sound causal analysis on observational datasets that are given as multiple relations and where the population under study is obtained by joining these base relations. We study a crucial condition for inferring causality from observational data, called the "strong ignorability assumption" (the treatment and outcome variables should be independent in the joined relation given the observed covariates), using known conditional independences that hold in the base relations. We also discuss how the structure of the conditional independences in base relations given as graphical models help infer new conditional independences in the joined relation. The proposed framework combines concepts from databases, statistics, and graphical models, and aims to initiate new research directions spanning these fields to facilitate powerful data-driven decisions in today's big data world.
DBNov 6, 2016
Causes for Query Answers from Databases: Datalog Abduction, View-Updates, and Integrity ConstraintsLeopoldo Bertossi, Babak Salimi
Causality has been recently introduced in databases, to model, characterize, and possibly compute causes for query answers. Connections between QA-causality and consistency-based diagnosis and database repairs (wrt. integrity constraint violations) have already been established. In this work we establish precise connections between QA-causality and both abductive diagnosis and the view-update problem in databases, allowing us to obtain new algorithmic and complexity results for QA-causality. We also obtain new results on the complexity of view-conditioned causality, and investigate the notion of QA-causality in the presence of integrity constraints, obtaining complexity results from a connection with view-conditioned causality. The abduction connection under integrity constraints allows us to obtain algorithmic tools for QA-causality.
DBSep 12, 2016
ZaliQL: A SQL-Based Framework for Drawing Causal Inference from Big DataBabak Salimi, Dan Suciu
Causal inference from observational data is a subject of active research and development in statistics and computer science. Many toolkits have been developed for this purpose that depends on statistical software. However, these toolkits do not scale to large datasets. In this paper we describe a suite of techniques for expressing causal inference tasks from observational data in SQL. This suite supports the state-of-the-art methods for causal inference and run at scale within a database engine. In addition, we introduce several optimization techniques that significantly speedup causal inference, both in the online and offline setting. We evaluate the quality and performance of our techniques by experiments of real datasets.
DBFeb 20, 2016
Causes for Query Answers from Databases, Datalog Abduction and View-Updates: The Presence of Integrity ConstraintsBabak Salimi, Leopoldo Bertossi
Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between queryanswer causality, consistency-based diagnosis, database repairs (wrt. integrity constraint violations), abductive diagnosis and the view-update problem have been established. In this work we further investigate connections between query-answer causality and abductive diagnosis and the view-update problem. In this context, we also define and investigate the notion of query-answer causality in the presence of integrity constraints.
DBJul 1, 2015
From Causes for Database Queries to Repairs and Model-Based Diagnosis and BackLeopoldo Bertossi, Babak Salimi
In this work we establish and investigate connections between causes for query answers in databases, database repairs wrt. denial constraints, and consistency-based diagnosis. The first two are relatively new research areas in databases, and the third one is an established subject in knowledge representation. We show how to obtain database repairs from causes, and the other way around. Causality problems are formulated as diagnosis problems, and the diagnoses provide causes and their responsibilities. The vast body of research on database repairs can be applied to the newer problems of computing actual causes for query answers and their responsibilities. These connections, which are interesting per se, allow us, after a transition -inspired by consistency-based diagnosis- to computational problems on hitting sets and vertex covers in hypergraphs, to obtain several new algorithmic and complexity results for database causality.
DBJun 13, 2015
Query-Answer Causality in Databases: Abductive Diagnosis and View-UpdatesBabak Salimi, Leopoldo Bertossi
Causality has been recently introduced in databases, to model, characterize and possibly compute causes for query results (answers). Connections between query causality and consistency-based diagnosis and database repairs (wrt. integrity constrain violations) have been established in the literature. In this work we establish connections between query causality and abductive diagnosis and the view-update problem. The unveiled relationships allow us to obtain new complexity results for query causality -the main focus of our work- and also for the two other areas.