DBMar 22, 2023
Semi-Oblivious Chase Termination for Linear Existential Rules: An Experimental StudyMarco Calautti, Mostafa Milani, Andreas Pieris
The chase procedure is a fundamental algorithmic tool in databases that allows us to reason with constraints, such as existential rules, with a plethora of applications. It takes as input a database and a set of constraints, and iteratively completes the database as dictated by the constraints. A key challenge, though, is the fact that it may not terminate, which leads to the problem of checking whether it terminates given a database and a set of constraints. In this work, we focus on the semi-oblivious version of the chase, which is well-suited for practical implementations, and linear existential rules, a central class of constraints with several applications. In this setting, there is a mature body of theoretical work that provides syntactic characterizations of when the chase terminates, algorithms for checking chase termination, precise complexity results, and worst-case optimal bounds on the size of the result of the chase (whenever is finite). Our main objective is to experimentally evaluate the existing chase termination algorithms with the aim of understanding which input parameters affect their performance, clarifying whether they can be used in practice, and revealing their performance limitations.
LGSep 24, 2024
Evaluating Blocking Biases in Entity MatchingMohammad Hossein Moslemi, Harini Balamurugan, Mostafa Milani
Entity Matching (EM) is crucial for identifying equivalent data entities across different sources, a task that becomes increasingly challenging with the growth and heterogeneity of data. Blocking techniques, which reduce the computational complexity of EM, play a vital role in making this process scalable. Despite advancements in blocking methods, the issue of fairness; where blocking may inadvertently favor certain demographic groups; has been largely overlooked. This study extends traditional blocking metrics to incorporate fairness, providing a framework for assessing bias in blocking techniques. Through experimental analysis, we evaluate the effectiveness and fairness of various blocking methods, offering insights into their potential biases. Our findings highlight the importance of considering fairness in EM, particularly in the blocking phase, to ensure equitable outcomes in data integration tasks.
23.4DBMar 12
Structure Selection for Fairness-Constrained Differentially Private Data SynthesisNaeim Ghahramanpour, Mostafa Milani
Differential privacy (DP) enables safe data release, with synthetic data generation emerging as a common approach in recent years. Yet standard synthesizers preserve all dependencies in the data, including spurious correlations between sensitive attributes and outcomes. In fairness-critical settings, this reproduces unwanted bias. A principled remedy is to enforce conditional independence (CI) constraints, which encode domain knowledge or legal requirements that outcomes be independent of sensitive attributes once admissible factors are accounted for. DP synthesis typically proceeds in two phases: (i) a measure- ment step that privatizes selected marginals, often structured via minimum spanning trees (MSTs), and (ii) a reconstruction step that fits a probabilistic model consistent with the noisy marginals. We propose PrivCI, which enforces CI during the measurement step via a CI-aware greedy MST algorithm that integrates feasibility checks into Kruskal's construction under the exponential mechanism, improving accuracy over competing methods. Experiments on standard fairness benchmarks show that PrivCI achieves stronger fidelity and predictive accuracy than prior baselines while satisfying the specified CI constraints.
LGMar 4, 2024
OTClean: Data Cleaning for Conditional Independence Violations using Optimal TransportAlireza Pirhadi, Mohammad Hossein Moslemi, Alexander Cloninger et al.
Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce \sys, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.
5.8DBMar 25
The Human Factor in Data Cleaning: Exploring Preferences and BiasesHazim AbdElazim, Shadman Islam, Mostafa Milani
Data cleaning is often framed as a technical preprocessing step, yet in practice it relies heavily on human judgment. We report results from a controlled survey study in which participants performed error detection, data repair and imputation, and entity matching tasks on census-inspired scenarios with known semantic validity. We find systematic evidence for several cognitive bias mechanisms in data cleaning. Framing effects arise when surface-level formatting differences (e.g., capitalization or numeric presentation) increase false-positive error flags despite unchanged semantics. Anchoring and adjustment bias appears when expert cues shift participant decisions beyond parity, consistent with salience and availability effects. We also observe the representativeness heuristic: atypical but valid attribute combinations are frequently flagged as erroneous, and in entity matching tasks, surface similarity produces a substantial false-positive rate with high confidence. In data repair, participants show a robust preference for leaving values missing rather than imputing plausible values, consistent with omission bias. In contrast, automation-aligned switching under strong contradiction does not exceed a conservative rare-error tolerance threshold at the population level, indicating that deference to automated recommendations is limited in this setting. Across scenarios, bias patterns persist among technically experienced participants and across diverse workflow practices, suggesting that bias in data cleaning reflects general cognitive tendencies rather than lack of expertise. These findings motivate human-in-the-loop cleaning systems that clearly separate representation from semantics, present expert or algorithmic recommendations non-prescriptively, and support reflective evaluation of atypical but valid cases.
DBNov 26, 2025
Beyond Accuracy: An Empirical Study of Uncertainty Estimation in ImputationZarin Tahia Hossain, Mostafa Milani
Handling missing data is a central challenge in data-driven analysis. Modern imputation methods not only aim for accurate reconstruction but also differ in how they represent and quantify uncertainty. Yet, the reliability and calibration of these uncertainty estimates remain poorly understood. This paper presents a systematic empirical study of uncertainty in imputation, comparing representative methods from three major families: statistical (MICE, SoftImpute), distribution alignment (OT-Impute), and deep generative (GAIN, MIWAE, TabCSDI). Experiments span multiple datasets, missingness mechanisms (MCAR, MAR, MNAR), and missingness rates. Uncertainty is estimated through three complementary routes: multi-run variability, conditional sampling, and predictive-distribution modeling, and evaluated using calibration curves and the Expected Calibration Error (ECE). Results show that accuracy and calibration are often misaligned: models with high reconstruction accuracy do not necessarily yield reliable uncertainty. We analyze method-specific trade-offs among accuracy, calibration, and runtime, identify stable configurations, and offer guidelines for selecting uncertainty-aware imputers in data cleaning and downstream machine learning pipelines.
LGNov 3, 2024
Reducing Biases in Record Matching Through Scores CalibrationMohammad Hossein Moslemi, Mostafa Milani
Record matching is the task of identifying records that refer to the same real-world entity across datasets. While most existing models optimize for accuracy, fairness has become an important concern due to the potential for unequal outcomes across demographic groups. Prior work typically focuses on binary outcomes evaluated at fixed decision thresholds. However, such evaluations can miss biases in matching scores--biases that persist across thresholds and affect downstream tasks. We propose a threshold-independent framework for measuring and reducing score bias, defined as disparities in the distribution of matching scores across groups. We show that several state-of-the-art matching methods exhibit substantial score bias, even when appearing fair under standard threshold-based metrics. To address this, we introduce two post-processing score calibration algorithms. The first, calib, aligns group-wise score distributions using the Wasserstein barycenter, targeting demographic parity. The second, ccalib, conditions on predicted labels to further reduce label-dependent biases, such as equal opportunity. Both methods are model-agnostic and require no access to model training data. calib also offers theoretical guarantees, ensuring reduced bias with minimal deviation from original scores. Experiments across real-world datasets and matching models confirm that calib and ccalib substantially reduce score bias while minimally impacting model accuracy.
DBAug 2, 2021
Extending Sticky-Datalog+/- via Finite-Position Selection Functions: Tractability, Algorithms, and OptimizationLeopoldo Bertossi, Mostafa Milani
Weakly-Sticky(WS) Datalog+/- is an expressive member of the family of Datalog+/- program classes that is defined on the basis of the conditions of stickiness and weak-acyclicity. Conjunctive query answering (QA) over the WS programs has been investigated, and its tractability in data complexity has been established. However, the design and implementation of practical QA algorithms and their optimizations have been open. In order to fill this gap, we first study Sticky and WS programs from the point of view of the behavior of the chase procedure. We extend the stickiness property of the chase to that of generalized stickiness of the chase (GSCh) modulo an oracle that selects (and provides) the predicate positions where finitely values appear during the chase. Stickiness modulo a selection function S that provides only a subset of those positions defines sch(S), a semantic subclass of GSCh. Program classes with selection functions include Sticky and WS, and another syntactic class that we introduce and characterize, namely JWS, of jointly-weakly-sticky programs, which contains WS. The selection functions for these last three classes are computable, and no external, possibly non-computable oracle is needed. We propose a bottom-up QA algorithm for programs in the class sch(S), for a general selection function S. As a particular case, we obtain a polynomial-time QA algorithm for JWS and weakly-sticky programs. Unlike WS, JWS turns out to be closed under magic-sets query optimization. As a consequence, both the generic polynomial-time QA algorithm and its magic-set optimization can be particularized and applied to WS.
DBApr 1, 2017
Ontological Multidimensional Data Models and Contextual Data QalityLeopoldo Bertossi, Mostafa Milani
Data quality assessment and data cleaning are context-dependent activities. Motivated by this observation, we propose the Ontological Multidimensional Data Model (OMD model), which can be used to model and represent contexts as logic-based ontologies. The data under assessment is mapped into the context, for additional analysis, processing, and quality data extraction. The resulting contexts allow for the representation of dimensions, and multidimensional data quality assessment becomes possible. At the core of a multidimensional context we include a generalized multidimensional data model and a Datalog+/- ontology with provably good properties in terms of query answering. These main components are used to represent dimension hierarchies, dimensional constraints, dimensional rules, and define predicates for quality data specification. Query answering relies upon and triggers navigation through dimension hierarchies, and becomes the basic tool for the extraction of quality data. The OMD model is interesting per se, beyond applications to data quality. It allows for a logic-based, and computationally tractable representation of multidimensional data, extending previous multidimensional data models with additional expressive power and functionalities.
DBMar 10, 2017
The Ontological Multidimensional Data ModelLeopoldo Bertossi, Mostafa Milani
In this extended abstract we describe, mainly by examples, the main elements of the Ontological Multidimensional Data Model, which considerably extends a relational reconstruction of the multidimensional data model proposed by Hurtado and Mendelzon by means of tuple-generating dependencies, equality-generating dependencies, and negative constraints as found in Datalog+-. We briefly mention some good computational properties of the model.
DBJul 10, 2016
Extending Weakly-Sticky Datalog+/-: Query-Answering Tractability and OptimizationsMostafa Milani, Leopoldo Bertossi
Weakly-sticky (WS) Datalog+/- is an expressive member of the family of Datalog+/- programs that is based on the syntactic notions of stickiness and weak-acyclicity. Query answering over the WS programs has been investigated, but there is still much work to do on the design and implementation of practical query answering (QA) algorithms and their optimizations. Here, we study sticky and WS programs from the point of view of the behavior of the chase procedure, extending the stickiness property of the chase to that of generalized stickiness of the chase (gsch-property). With this property we specify the semantic class of GSCh programs, which includes sticky and WS programs, and other syntactic subclasses that we identify. In particular, we introduce joint-weakly-sticky (JWS) programs, that include WS programs. We also propose a bottom-up QA algorithm for a range of subclasses of GSCh. The algorithm runs in polynomial time (in data) for JWS programs. Unlike the WS class, JWS is closed under a general magic-sets rewriting procedure for the optimization of programs with existential rules. We apply the magic-sets rewriting in combination with the proposed QA algorithm for the optimization of QA over JWS programs.
DBApr 22, 2016
A Hybrid Approach to Query Answering under Expressive Datalog+/-Mostafa Milani, Andrea Cali, Leopoldo Bertossi
Datalog+/- is a family of ontology languages that combine good computational properties with high expressive power. Datalog+/- languages are provably able to capture the most relevant Semantic Web languages. In this paper we consider the class of weakly-sticky (WS) Datalog+/- programs, which allow for certain useful forms of joins in rule bodies as well as extending the well-known class of weakly-acyclic TGDs. So far, only non-deterministic algorithms were known for answering queries on WS Datalog+/- programs. We present novel deterministic query answering algorithms under WS Datalog+/-. In particular, we propose: (1) a bottom-up grounding algorithm based on a query-driven chase, and (2) a hybrid approach based on transforming a WS program into a so-called sticky one, for which query rewriting techniques are known. We discuss how our algorithms can be optimized and effectively applied for query answering in real-world scenarios.
DBApr 13, 2015
Tractable Query Answering and Optimization for Extensions of Weakly-Sticky Datalog+-Mostafa Milani, Leopoldo Bertossi
We consider a semantic class, weakly-chase-sticky (WChS), and a syntactic subclass, jointly-weakly-sticky (JWS), of Datalog+- programs. Both extend that of weakly-sticky (WS) programs, which appear in our applications to data quality. For WChS programs we propose a practical, polynomial-time query answering algorithm (QAA). We establish that the two classes are closed under magic-sets rewritings. As a consequence, QAA can be applied to the optimized programs. QAA takes as inputs the program (including the query) and semantic information about the "finiteness" of predicate positions. For the syntactic subclasses JWS and WS of WChS, this additional information is computable.