David Heckerman

h-index97

60papers

20,016citations

Novelty35%

AI Score32

Ranked #127,882 of 194,257 authors (top 66%)#7,819 in AI (top 62%)

60 Papers

3.4CLAug 14, 2024

Fast Training Dataset Attribution via In-Context Learning

Milad Fotouhi, Mohammad Taha Bahadori, Oluwaseyi Feyisetan et al.

We investigate the use of in-context learning and prompt engineering to estimate the contributions of training data in the outputs of instruction-tuned large language models (LLMs). We propose two novel approaches: (1) a similarity-based approach that measures the difference between LLM outputs with and without provided context, and (2) a mixture distribution model approach that frames the problem of identifying contribution scores as a matrix factorization task. Our empirical comparison demonstrates that the mixture model approach is more robust to retrieval noise in in-context learning, providing a more reliable estimation of data contributions.

4.3MEApr 12, 2024Code

Multiply-Robust Causal Change Attribution

Victor Quintas-Martinez, Mohammad Taha Bahadori, Eduardo Santiago et al.

Comparing two samples of data, we observe a change in the distribution of an outcome variable. In the presence of multiple explanatory variables, how much of the change can be explained by each possible cause? We develop a new estimation strategy that, given a causal model, combines regression and re-weighting methods to quantify the contribution of each causal mechanism. Our proposed methodology is multiply robust, meaning that it still recovers the target parameter under partial misspecification. We prove that our estimator is consistent and asymptotically normal. Moreover, it can be incorporated into existing frameworks for causal attribution, such as Shapley values, which will inherit the consistency and large-sample distribution properties. Our method demonstrates excellent performance in Monte Carlo simulations, and we show its usefulness in an empirical application. Our method is implemented as part of the Python library DoWhy (arXiv:2011.04216, arXiv:2206.06821).

1.0CLDec 3, 2024

Removing Spurious Correlation from Neural Network Interpretations

Milad Fotouhi, Mohammad Taha Bahadori, Oluwaseyi Feyisetan et al.

The existing algorithms for identification of neurons responsible for undesired and harmful behaviors do not consider the effects of confounders such as topic of the conversation. In this work, we show that confounders can create spurious correlations and propose a new causal mediation approach that controls the impact of the topic. In experiments with two large language models, we study the localization hypothesis and show that adjusting for the effect of conversation topic, toxicity becomes less localized.

2.1AIFeb 13, 2023

Heckerthoughts

David Heckerman

This manuscript is technical memoir about my work at Stanford and Microsoft Research. Included are fundamental concepts central to machine learning and artificial intelligence, applications of these concepts, and stories behind their creation.

8.4LGJul 27, 2021

End-to-End Balancing for Causal Continuous Treatment-Effect Estimation

Mohammad Taha Bahadori, Eric Tchetgen Tchetgen, David E. Heckerman

We study the problem of observational causal inference with continuous treatments in the framework of inverse propensity-score weighting. To obtain stable weights, we design a new algorithm based on entropy balancing that learns weights to directly maximize causal inference accuracy using end-to-end optimization. In the process of optimization, these weights are automatically tuned to the specific dataset and causal inference algorithm being used. We provide a theoretical analysis demonstrating consistency of our approach. Using synthetic and real-world data, we show that our algorithm estimates causal effect more accurately than baseline entropy balancing.

6.5LGMay 13, 2021

Likelihoods and Parameter Priors for Bayesian Networks

David Heckerman, Dan Geiger

We develop simple methods for constructing likelihoods and parameter priors for learning about the parameters and structure of a Bayesian network. In particular, we introduce several assumptions that permit the construction of likelihoods and parameter priors for a large number of Bayesian-network structures from a small set of assessments. The most notable assumption is that of likelihood equivalence, which says that data can not help to discriminate network structures that encode the same assertions of conditional independence. We describe the constructions that follow from these assumptions, and also present a method for directly computing the marginal likelihood of a random sample with no missing observations. Also, we show how these assumptions lead to a general framework for characterizing parameter priors of multivariate distributions.

13.2MLMay 5, 2021

Parameter Priors for Directed Acyclic Graphical Models and the Characterization of Several Probability Distributions

Dan Geiger, David Heckerman

We develop simple methods for constructing parameter priors for model choice among Directed Acyclic Graphical (DAG) models. In particular, we introduce several assumptions that permit the construction of parameter priors for a large number of DAG models from a small set of assessments. We then present a method for directly computing the marginal likelihood of every DAG model given a random sample with no missing observations. We apply this methodology to Gaussian DAG models which consist of a recursive set of linear regression models. We show that the only parameter prior for complete Gaussian DAG models that satisfies our assumptions is the normal-Wishart distribution. Our analysis is based on the following new characterization of the Wishart distribution: let $W$ be an $n \times n$, $n \ge 3$, positive-definite symmetric matrix of random variables and $f(W)$ be a pdf of $W$. Then, f$(W)$ is a Wishart distribution if and only if $W_{11} - W_{12} W_{22}^{-1} W'_{12}$ is independent of $\{W_{12},W_{22}\}$ for every block partitioning $W_{11},W_{12}, W'_{12}, W_{22}$ of $W$. Similar characterizations of the normal and normal-Wishart distributions are provided as well.

17.1LGJul 22, 2020

Debiasing Concept-based Explanations with Causal Analysis

Mohammad Taha Bahadori, David E. Heckerman

Concept-based explanation approach is a popular model interpertability tool because it expresses the reasons for a model's predictions in terms of concepts that are meaningful for the domain experts. In this work, we study the problem of the concepts being correlated with confounding information in the features. We propose a new causal prior graph for modeling the impacts of unobserved variables and a method to remove the impact of confounding information and noise using a two-stage regression technique borrowed from the instrumental variable literature. We also model the completeness of the concepts set and show that our debiasing method works when the concepts are not complete. Our synthetic and real-world experiments demonstrate the success of our method in removing biases and improving the ranking of the concepts in terms of their contribution to the explanation of the predictions.

34.4LGFeb 1, 2020

A Tutorial on Learning With Bayesian Networks

David Heckerman

A Bayesian network is a graphical model that encodes probabilistic relationships among variables of interest. When used in conjunction with statistical techniques, the graphical model has several advantages for data analysis. One, because the model encodes dependencies among all variables, it readily handles situations where some data entries are missing. Two, a Bayesian network can be used to learn causal relationships, and hence can be used to gain understanding about a problem domain and to predict the consequences of intervention. Three, because the model has both a causal and probabilistic semantics, it is an ideal representation for combining prior knowledge (which often comes in causal form) and data. Four, Bayesian statistical methods in conjunction with Bayesian networks offer an efficient and principled approach for avoiding the overfitting of data. In this paper, we discuss methods for constructing Bayesian networks from prior knowledge and summarize Bayesian statistical methods for using data to improve these models. With regard to the latter task, we describe methods for learning both the parameters and structure of a Bayesian network, including techniques for learning with incomplete data. In addition, we relate Bayesian-network methods for learning to techniques for supervised and unsupervised learning. We illustrate the graphical-modeling approach using a real-world case study.

26.1AINov 6, 2019

Probabilistic Similarity Networks

David Heckerman

Normative expert systems have not become commonplace because they have been difficult to build and use. Over the past decade, however, researchers have developed the influence diagram, a graphical representation of a decision maker's beliefs, alternatives, and preferences that serves as the knowledge base of a normative expert system. Most people who have seen the representation find it intuitive and easy to use. Consequently, the influence diagram has overcome significantly the barriers to constructing normative expert systems. Nevertheless, building influence diagrams is not practical for extremely large and complex domains. In this book, I address the difficulties associated with the construction of the probabilistic portion of an influence diagram, called a knowledge map, belief network, or Bayesian network. I introduce two representations that facilitate the generation of large knowledge maps. In particular, I introduce the similarity network, a tool for building the network structure of a knowledge map, and the partition, a tool for assessing the probabilities associated with a knowledge map. I then use these representations to build Pathfinder, a large normative expert system for the diagnosis of lymph-node diseases (the domain contains over 60 diseases and over 100 disease findings). In an early version of the system, I encoded the knowledge of the expert using an erroneous assumption that all disease findings were independent, given each disease. When the expert and I attempted to build a more accurate knowledge map for the domain that would capture the dependencies among the disease findings, we failed. Using a similarity network, however, we built the knowledge-map structure for the entire domain in approximately 40 hours. Furthermore, the partition representation reduced the number of probability assessments required by the expert from 75,000 to 14,000.

1.2MLOct 22, 2019

Embedded Bayesian Network Classifiers

David Heckerman, Chris Meek

Low-dimensional probability models for local distribution functions in a Bayesian network include decision trees, decision graphs, and causal independence models. We describe a new probability model for discrete Bayesian networks, which we call an embedded Bayesian network classifier or EBNC. The model for a node $Y$ given parents $\bf X$ is obtained from a (usually different) Bayesian network for $Y$ and $\bf X$ in which $\bf X$ need not be the parents of $Y$. We show that an EBNC is a special case of a softmax polynomial regression model. Also, we show how to identify a non-redundant set of parameters for an EBNC, and describe an asymptotic approximation for learning the structure of Bayesian networks that contain EBNCs. Unlike the decision tree, decision graph, and causal independence models, we are unaware of a semantic justification for the use of these models. Experiments are needed to determine whether the models presented in this paper are useful in practice.

12.7AIJan 2, 2018

Accounting for hidden common causes when inferring cause and effect from observational data

David Heckerman

Identifying causal relationships from observation data is difficult, in large part, due to the presence of hidden common causes. In some cases, where just the right patterns of conditional independence and dependence lie in the data---for example, Y-structures---it is possible to identify cause and effect. In other cases, the analyst deliberately makes an uncertain assumption that hidden common causes are absent, and infers putative causal relationships to be tested in a randomized trial. Here, we consider a third approach, where there are sufficient clues in the data such that hidden common causes can be inferred.

2.5AIOct 27, 2016

Dependence and Relevance: A probabilistic view

Dan Geiger, David Heckerman

We examine three probabilistic concepts related to the sentence "two variables have no bearing on each other". We explore the relationships between these three concepts and establish their relevance to the process of constructing similarity networks---a tool for acquiring probabilistic knowledge from human experts. We also establish a precise relationship between connectedness in Bayesian networks and relevance in probability.

3.0AIJul 27, 2014

Modular Belief Updates and Confusion about Measures of Certainty in Artificial Intelligence Research

Eric J. Horvitz, David Heckerman

Over the last decade, there has been growing interest in the use or measures or change in belief for reasoning with uncertainty in artificial intelligence research. An important characteristic of several methodologies that reason with changes in belief or belief updates, is a property that we term modularity. We call updates that satisfy this property modular updates. Whereas probabilistic measures of belief update - which satisfy the modularity property were first discovered in the nineteenth century, knowledge and discussion of these quantities remains obscure in artificial intelligence research. We define modular updates and discuss their inappropriate use in two influential expert systems.

3.2AIApr 13, 2013

Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence (1993)

David Heckerman, E. Mamdani

This is the Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence, which was held in Washington, DC, July 9-11, 1993

23.5AIMar 27, 2013

Probabilistic Interpretations for MYCIN's Certainty Factors

David Heckerman

This paper examines the quantities used by MYCIN to reason with uncertainty, called certainty factors. It is shown that the original definition of certainty factors is inconsistent with the functions used in MYCIN to combine the quantities. This inconsistency is used to argue for a redefinition of certainty factors in terms of the intuitively appealing desiderata associated with the combining functions. It is shown that this redefinition accommodates an unlimited number of probabilistic interpretations. These interpretations are shown to be monotonic transformations of the likelihood ratio p(EIH)/p(El H). The construction of these interpretations provides insight into the assumptions implicit in the certainty factor model. In particular, it is shown that if uncertainty is to be propagated through an inference network in accordance with the desiderata, evidence must be conditionally independent given the hypothesis and its negation and the inference network must have a tree structure. It is emphasized that assumptions implicit in the model are rarely true in practical applications. Methods for relaxing the assumptions are suggested.

5.6AIMar 27, 2013

A Backwards View for Assessment

Ross D. Shachter, David Heckerman

Much artificial intelligence research focuses on the problem of deducing the validity of unobservable propositions or hypotheses from observable evidence.! Many of the knowledge representation techniques designed for this problem encode the relationship between evidence and hypothesis in a directed manner. Moreover, the direction in which evidence is stored is typically from evidence to hypothesis.

10.9AIMar 27, 2013

An Axiomatic Framework for Belief Updates

David Heckerman

In the 1940's, a physicist named Cox provided the first formal justification for the axioms of probability based on the subjective or Bayesian interpretation. He showed that if a measure of belief satisfies several fundamental properties, then the measure must be some monotonic transformation of a probability. In this paper, measures of change in belief or belief updates are examined. In the spirit of Cox, properties for a measure of change in belief are enumerated. It is shown that if a measure satisfies these properties, it must satisfy other restrictive conditions. For example, it is shown that belief updates in a probabilistic context must be equal to some monotonic transformation of a likelihood ratio. It is hoped that this formal explication of the belief update paradigm will facilitate critical discussion and useful extensions of the approach.

10.9AIMar 27, 2013

The Myth of Modularity in Rule-Based Systems

David Heckerman, Eric J. Horvitz

In this paper, we examine the concept of modularity, an often cited advantage of the ruled-based representation methodology. We argue that the notion of modularity consists of two distinct concepts which we call syntactic modularity and semantic modularity. We argue that when reasoning under certainty, it is reasonable to regard the rule-based approach as both syntactically and semantically modular. However, we argue that in the case of plausible reasoning, rules are syntactically modular but are rarely semantically modular. To illustrate this point, we examine a particular approach for managing uncertainty in rule-based systems called the MYCIN certainty factor model. We formally define the concept of semantic modularity with respect to the certainty factor model and discuss logical consequences of the definition. We show that the assumption of semantic modularity imposes strong restrictions on rules in a knowledge base. We argue that such restrictions are rarely valid in practical applications. Finally, we suggest how the concept of semantic modularity can be relaxed in a manner that makes it appropriate for plausible reasoning.

3.2AIMar 27, 2013

The Role of Calculi in Uncertain Inference Systems

Michael P. Wellman, David Heckerman

Much of the controversy about methods for automated decision making has focused on specific calculi for combining beliefs or propagating uncertainty. We broaden the debate by (1) exploring the constellation of secondary tasks surrounding any primary decision problem, and (2) identifying knowledge engineering concerns that present additional representational tradeoffs. We argue on pragmatic grounds that the attempt to support all of these tasks within a single calculus is misguided. In the process, we note several uncertain reasoning objectives that conflict with the Bayesian ideal of complete specification of probabilities and utilities. In response, we advocate treating the uncertainty calculus as an object language for reasoning mechanisms that support the secondary tasks. Arguments against Bayesian decision theory are weakened when the calculus is relegated to this role. Architectures for uncertainty handling that take statements in the calculus as objects to be reasoned about offer the prospect of retaining normative status with respect to decision making while supporting the other tasks in uncertain reasoning.

9.4AIMar 27, 2013

A Perspective on Confidence and Its Use in Focusing Attention During Knowledge Acquisition

David Heckerman, Holly B. Jimison

We present a representation of partial confidence in belief and preference that is consistent with the tenets of decision-theory. The fundamental insight underlying the representation is that if a person is not completely confident in a probability or utility assessment, additional modeling of the assessment may improve decisions to which it is relevant. We show how a traditional decision-analytic approach can be used to balance the benefits of additional modeling with associated costs. The approach can be used during knowledge acquisition to focus the attention of a knowledge engineer or expert on parts of a decision model that deserve additional refinement.

10.9AIMar 27, 2013

An Empirical Comparison of Three Inference Methods

David Heckerman

In this paper, an empirical evaluation of three inference methods for uncertain reasoning is presented in the context of Pathfinder, a large expert system for the diagnosis of lymph-node pathology. The inference procedures evaluated are (1) Bayes' theorem, assuming evidence is conditionally independent given each hypothesis; (2) odds-likelihood updating, assuming evidence is conditionally independent given each hypothesis and given the negation of each hypothesis; and (3) a inference method related to the Dempster-Shafer theory of belief. Both expert-rating and decision-theoretic metrics are used to compare the diagnostic accuracy of the inference methods.

28.5AIMar 27, 2013

A Tractable Inference Algorithm for Diagnosing Multiple Diseases

David Heckerman

We examine a probabilistic model for the diagnosis of multiple diseases. In the model, diseases and findings are represented as binary variables. Also, diseases are marginally independent, features are conditionally independent given disease instances, and diseases interact to produce findings via a noisy OR-gate. An algorithm for computing the posterior probability of each disease, given a set of observed findings, called quickscore, is presented. The time complexity of the algorithm is O(nm-2m+), where n is the number of diseases, m+ is the number of positive findings and m- is the number of negative findings. Although the time complexity of quickscore i5 exponential in the number of positive findings, the algorithm is useful in practice because the number of observed positive findings is usually far less than the number of diseases under consideration. Performance results for quickscore applied to a probabilistic version of Quick Medical Reference (QMR) are provided.

14.4AIMar 27, 2013

The Compilation of Decision Models

David Heckerman, John S. Breese, Eric J. Horvitz

We introduce and analyze the problem of the compilation of decision models from a decision-theoretic perspective. The techniques described allow us to evaluate various configurations of compiled knowledge given the nature of evidential relationships in a domain, the utilities associated with alternative actions, the costs of run-time delays, and the costs of memory. We describe procedures for selecting a subset of the total observations available to be incorporated into a compiled situation-action mapping, in the context of a binary decision with conditional independence of evidence. The methods allow us to incrementally select the best pieces of evidence to add to the set of compiled knowledge in an engineering setting. After presenting several approaches to compilation, we exercise one of the methods to provide insight into the relationship between the distribution over weights of evidence and the preferred degree of compilation.

12.2AIMar 27, 2013

Separable and transitive graphoids

Dan Geiger, David Heckerman

We examine three probabilistic formulations of the sentence a and b are totally unrelated with respect to a given set of variables U. First, two variables a and b are totally independent if they are independent given any value of any subset of the variables in U. Second, two variables are totally uncoupled if U can be partitioned into two marginally independent sets containing a and b respectively. Third, two variables are totally disconnected if the corresponding nodes are disconnected in every belief network representation. We explore the relationship between these three formulations of unrelatedness and explain their relevance to the process of acquiring probabilistic knowledge from human experts.

9.4AIMar 27, 2013

A Combination of Cutset Conditioning with Clique-Tree Propagation in the Pathfinder System

Jaap Suermondt, Gregory F. Cooper, David Heckerman

Cutset conditioning and clique-tree propagation are two popular methods for performing exact probabilistic inference in Bayesian belief networks. Cutset conditioning is based on decomposition of a subset of network nodes, whereas clique-tree propagation depends on aggregation of nodes. We describe a means to combine cutset conditioning and clique- tree propagation in an approach called aggregation after decomposition (AD). We discuss the application of the AD method in the Pathfinder system, a medical expert system that offers assistance with diagnosis in hematopathology.

12.2AIMar 27, 2013

Problem Formulation as the Reduction of a Decision Model

David Heckerman, Eric J. Horvitz

In this paper, we extend the QMRDT probabilistic model for the domain of internal medicine to include decisions about treatments. In addition, we describe how we can use the comprehensive decision model to construct a simpler decision model for a specific patient. In so doing, we transform the task of problem formulation to that of narrowing of a larger problem.

3.2AIMar 27, 2013

David Heckerman

A similarity network is a tool for constructing belief networks for the diagnosis of a single fault. In this paper, we examine modifications to the similarity-network representation that facilitate the construction of belief networks for the diagnosis of multiple coexisting faults.

20.5AIMar 20, 2013

An Approximate Nonmyopic Computation for Value of Information

David Heckerman, Eric J. Horvitz, Blackford Middleton

Value-of-information analyses provide a straightforward means for selecting the best next observation to make, and for determining whether it is better to gather additional information or to act immediately. Determining the next best test to perform, given a state of uncertainty about the world, requires a consideration of the value of making all possible sequences of observations. In practice, decision analysts and expert-system designers have avoided the intractability of exact computation of the value of information by relying on a myopic approximation. Myopic analyses are based on the assumption that only one additional test will be performed, even when there is an opportunity to make a large number of observations. We present a nonmyopic approximation for value of information that bypasses the traditional myopic analyses by exploiting the statistical properties of large samples.

21.0AIMar 20, 2013

Advances in Probabilistic Reasoning

Dan Geiger, David Heckerman

This paper discuses multiple Bayesian networks representation paradigms for encoding asymmetric independence assertions. We offer three contributions: (1) an inference mechanism that makes explicit use of asymmetric independence to speed up computations, (2) a simplified definition of similarity networks and extensions of their theory, and (3) a generalized representation scheme that encodes more types of asymmetric independence assertions than do similarity networks.

5.6AIMar 6, 2013

Inference Algorithms for Similarity Networks

Dan Geiger, David Heckerman

We examine two types of similarity networks each based on a distinct notion of relevance. For both types of similarity networks we present an efficient inference algorithm that works under the assumption that every event has a nonzero probability of occurrence. Another inference algorithm is developed for type 1 similarity networks that works under no restriction, albeit less efficiently.

21.0AIMar 6, 2013

Causal Independence for Knowledge Acquisition and Inference

David Heckerman

I introduce a temporal belief-network representation of causal independence that a knowledge engineer can use to elicit probabilistic models. Like the current, atemporal belief-network representation of causal independence, the new representation makes knowledge acquisition tractable. Unlike the atemproal representation, however, the temporal representation can simplify inference, and does not require the use of unobservable variables. The representation is less general than is the atemporal representation, but appears to be useful for many practical applications.

3.2AIMar 6, 2013

Diagnosis of Multiple Faults: A Sensitivity Analysis

David Heckerman, Michael Shwe

We compare the diagnostic accuracy of three diagnostic inference models: the simple Bayes model, the multimembership Bayes model, which is isomorphic to the parallel combination function in the certainty-factor model, and a model that incorporates the noisy OR-gate interaction. The comparison is done on 20 clinicopathological conference (CPC) cases from the American Journal of Medicine-challenging cases describing actual patients often with multiple disorders. We find that the distributions produced by the noisy OR model agree most closely with the gold-standard diagnoses, although substantial differences exist between the distributions and the diagnoses. In addition, we find that the multimembership Bayes model tends to significantly overestimate the posterior probabilities of diseases, whereas the simple Bayes model tends to significantly underestimate the posterior probabilities. Our results suggest that additional work to refine the noisy OR model for internal medicine will be worthwhile.

19.9AIFeb 27, 2013

A Decision-Based View of Causality

David Heckerman, Ross D. Shachter

Most traditional models of uncertainty have focused on the associational relationship among variables as captured by conditional dependence. In order to successfully manage intelligent systems for decision making, however, we must be able to predict the effects of actions. In this paper, we attempt to unite two branches of research that address such predictions: causal modeling and decision analysis. First, we provide a definition of causal dependence in decision-analytic terms, which we derive from consequences of causal dependence cited in the literature. Using this definition, we show how causal dependence can be represented within an influence diagram. In particular, we identify two inadequacies of an ordinary influence diagram as a representation for cause. We introduce a special class of influence diagrams, called causal influence diagrams, which corrects one of these problems, and identify situations where the other inadequacy can be eliminated. In addition, we describe the relationships between Howard Canonical Form and existing graphical representations of cause.

52.1AIFeb 27, 2013

Learning Bayesian Networks: The Combination of Knowledge and Statistical Data

David Heckerman, Dan Geiger, David Maxwell Chickering

We describe algorithms for learning Bayesian networks from a combination of user knowledge and statistical data. The algorithms have two components: a scoring metric and a search procedure. The scoring metric takes a network structure, statistical data, and a user's prior knowledge, and returns a score proportional to the posterior probability of the network structure given the data. The search procedure generates networks for evaluation by the scoring metric. Our contributions are threefold. First, we identify two important properties of metrics, which we call event equivalence and parameter modularity. These properties have been mostly ignored, but when combined, greatly simplify the encoding of a user's prior knowledge. In particular, a user can express her knowledge-for the most part-as a single prior Bayesian network for the domain. Second, we describe local search and annealing algorithms to be used in conjunction with scoring metrics. In the special case where each node has at most one parent, we show that heuristic search can be replaced with a polynomial algorithm to identify the networks with the highest score. Third, we describe a methodology for evaluating Bayesian-network learning algorithms. We apply this approach to a comparison of metrics and search procedures.

23.1AIFeb 27, 2013

A New Look at Causal Independence

David Heckerman, John S. Breese

Heckerman (1993) defined causal independence in terms of a set of temporal conditional independence statements. These statements formalized certain types of causal interaction where (1) the effect is independent of the order that causes are introduced and (2) the impact of a single cause on the effect does not depend on what other causes have previously been applied. In this paper, we introduce an equivalent a temporal characterization of causal independence based on a functional representation of the relationship between causes and the effect. In this representation, the interaction between causes and effect can be written as a nested decomposition of functions. Causal independence can be exploited by representing this decomposition in the belief network, resulting in representations that are more efficient for inference than general causal models. We present empirical results showing the benefits of a causal-independence representation for belief-network inference.

39.1AIFeb 27, 2013

Learning Gaussian Networks

Dan Geiger, David Heckerman

We describe algorithms for learning Bayesian networks from a combination of user knowledge and statistical data. The algorithms have two components: a scoring metric and a search procedure. The scoring metric takes a network structure, statistical data, and a user's prior knowledge, and returns a score proportional to the posterior probability of the network structure given the data. The search procedure generates networks for evaluation by the scoring metric. Previous work has concentrated on metrics for domains containing only discrete variables, under the assumption that data represents a multinomial sample. In this paper, we extend this work, developing scoring metrics for domains containing all continuous variables or a mixture of discrete and continuous variables, under the assumption that continuous data is sampled from a multivariate normal distribution. Our work extends traditional statistical approaches for identifying vanishing regression coefficients in that we identify two important assumptions, called event equivalence and parameter modularity, that when combined allow the construction of prior distributions for multivariate normal parameters from a single prior Bayesian network specified by a user.

26.6AIFeb 20, 2013

A Bayesian Approach to Learning Causal Networks

David Heckerman

Whereas acausal Bayesian networks represent probabilistic independence, causal Bayesian networks represent causal relationships. In this paper, we examine Bayesian methods for learning both types of networks. Bayesian methods for learning acausal networks are fairly well developed. These methods often employ assumptions to facilitate the construction of priors, including the assumptions of parameter independence, parameter modularity, and likelihood equivalence. We show that although these assumptions also can be appropriate for learning causal networks, we need additional assumptions in order to learn causal networks. We introduce two sufficient assumptions, called {em mechanism independence} and {em component independence}. We show that these new assumptions, when combined with parameter independence, parameter modularity, and likelihood equivalence, allow us to apply methods for learning acausal networks to learn causal networks.

23.1AIFeb 20, 2013

Learning Bayesian Networks: A Unification for Discrete and Gaussian Domains

David Heckerman, Dan Geiger

We examine Bayesian methods for learning Bayesian networks from a combination of prior knowledge and statistical data. In particular, we unify the approaches we presented at last year's conference for discrete and Gaussian domains. We derive a general Bayesian scoring metric, appropriate for both domains. We then use this metric in combination with well-known statistical facts about the Dirichlet and normal--Wishart distributions to derive our metrics for discrete and Gaussian domains.

10.9AIFeb 20, 2013

A Definition and Graphical Representation for Causality

David Heckerman, Ross D. Shachter

We present a precise definition of cause and effect in terms of a fundamental notion called unresponsiveness. Our definition is based on Savage's (1954) formulation of decision theory and departs from the traditional view of causation in that our causal assertions are made relative to a set of decisions. An important consequence of this departure is that we can reason about cause locally, not requiring a causal explanation for every dependency. Such local reasoning can be beneficial because it may not be necessary to determine whether a particular dependency is causal to make a decision. Also in this paper, we examine the graphical encoding of causal relationships. We show that influence diagrams in canonical form are an accurate and efficient representation of causal relationships. In addition, we establish a correspondence between canonical form and Pearl's causal theory.

12.2AIFeb 20, 2013

A Characterization of the Dirichlet Distribution with Application to Learning Bayesian Networks

Dan Geiger, David Heckerman

We provide a new characterization of the Dirichlet distribution. This characterization implies that under assumptions made by several previous authors for learning belief networks, a Dirichlet prior on the parameters is inevitable.

14.2LGFeb 13, 2013

Asymptotic Model Selection for Directed Networks with Hidden Variables

Dan Geiger, David Heckerman, Christopher Meek

We extend the Bayesian Information Criterion (BIC), an asymptotic approximation for the marginal likelihood, to Bayesian networks with hidden variables. This approximation can be used to select models given large samples of data. The standard BIC as well as our extension punishes the complexity of a model according to the dimension of its parameters. We argue that the dimension of a Bayesian network with hidden variables is the rank of the Jacobian matrix of the transformation between the parameters of the network and the parameters of the observable variables. We compute the dimensions of several networks including the naive Bayes model with a hidden root node.

10.3LGFeb 13, 2013

Efficient Approximations for the Marginal Likelihood of Incomplete Data Given a Bayesian Network

David Maxwell Chickering, David Heckerman

We discuss Bayesian methods for learning Bayesian networks when data sets are incomplete. In particular, we examine asymptotic approximations for the marginal likelihood of incomplete data given a Bayesian network. We consider the Laplace approximation and the less accurate but more efficient BIC/MDL approximation. We also consider approximations proposed by Draper (1993) and Cheeseman and Stutz (1995). These approximations are as efficient as BIC/MDL, but their accuracy has not been studied in any depth. We compare the accuracy of these approximations under the assumption that the Laplace approximation is the most accurate. In experiments using synthetic data generated from discrete naive-Bayes models having a hidden root node, we find that the CS measure is the most accurate.

14.4AIFeb 13, 2013

Decision-Theoretic Troubleshooting: A Framework for Repair and Experiment

John S. Breese, David Heckerman

We develop and extend existing decision-theoretic methods for troubleshooting a nonfunctioning device. Traditionally, diagnosis with Bayesian networks has focused on belief updating---determining the probabilities of various faults given current observations. In this paper, we extend this paradigm to include taking actions. In particular, we consider three classes of actions: (1) we can make observations regarding the behavior of a device and infer likely faults as in traditional diagnosis, (2) we can repair a component and then observe the behavior of the device to infer likely faults, and (3) we can change the configuration of the device, observe its new behavior, and infer the likelihood of faults. Analysis of latter two classes of troubleshooting actions requires incorporating notions of persistence into the belief-network formalism used for probabilistic inference.

14.4AIFeb 6, 2013

Structure and Parameter Learning for Causal Independence and Causal Interaction Models

Christopher Meek, David Heckerman

This paper discusses causal independence models and a generalization of these models called causal interaction models. Causal interaction models are models that have independent mechanisms where a mechanism can have several causes. In addition to introducing several particular types of causal interaction models, we show how we can apply the Bayesian approach to learning causal interaction models obtaining approximate posterior distributions for the models and obtain MAP and ML estimates for the parameters. We illustrate the approach with a simulation study of learning model posteriors.

8.0LGFeb 6, 2013

Models and Selection Criteria for Regression and Classification

David Heckerman, Christopher Meek

When performing regression or classification, we are interested in the conditional probability distribution for an outcome or class variable Y given a set of explanatoryor input variables X. We consider Bayesian models for this task. In particular, we examine a special class of models, which we call Bayesian regression/classification (BRC) models, that can be factored into independent conditional (y|x) and input (x) models. These models are convenient, because the conditional model (the portion of the full model that we care about) can be analyzed by itself. We examine the practice of transforming arbitrary Bayesian models to BRC models, and argue that this practice is often inappropriate because it ignores prior knowledge that may be important for learning. In addition, we examine Bayesian methods for learning models from data. We discuss two criteria for Bayesian model selection that are appropriate for repression/classification: one described by Spiegelhalter et al. (1993), and another by Buntine (1993). We contrast these two criteria using the prequential framework of Dawid (1984), and give sufficient conditions under which the criteria agree.

20.7LGFeb 6, 2013

A Bayesian Approach to Learning Bayesian Networks with Local Structure

David Maxwell Chickering, David Heckerman, Christopher Meek

Recently several researchers have investigated techniques for using data to learn Bayesian networks containing compact representations for the conditional probability distributions (CPDs) stored at each node. The majority of this work has concentrated on using decision-tree representations for the CPDs. In addition, researchers typically apply non-Bayesian (or asymptotically Bayesian) scoring functions such as MDL to evaluate the goodness-of-fit of networks to the data. In this paper we investigate a Bayesian approach to learning Bayesian networks that contain the more general decision-graph representations of the CPDs. First, we describe how to evaluate the posterior probability that is, the Bayesian score of such a network, given a database of observed cases. Second, we describe various search spaces that can be used, in conjunction with a scoring function and a search procedure, to identify one or more high-scoring networks. Finally, we present an experimental evaluation of the search spaces, using a greedy algorithm and a Bayesian scoring function.

13.2LGJan 30, 2013

Learning Mixtures of DAG Models

Bo Thiesson, Christopher Meek, David Maxwell Chickering et al.

We describe computationally efficient methods for learning mixtures in which each component is a directed acyclic graphical model (mixtures of DAGs or MDAGs). We argue that simple search-and-score algorithms are infeasible for a variety of problems, and introduce a feasible approach in which parameter and structure search is interleaved and expected data is treated as real data. Our approach can be viewed as a combination of (1) the Cheeseman--Stutz asymptotic approximation for model posterior probability and (2) the Expectation--Maximization algorithm. We evaluate our procedure for selecting among MDAGs on synthetic and real examples.

9.6LGJan 30, 2013

An Experimental Comparison of Several Clustering and Initialization Methods

Marina Meila, David Heckerman

We examine methods for clustering in high dimensions. In the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the Expectation-Maximization (EM) algorithm, a winner take all version of the EM algorithm reminiscent of the K-means algorithm, and model-based hierarchical agglomerative clustering. We learn naive-Bayes models with a hidden root node, using high-dimensional discrete-variable data sets (both real and synthetic). We find that the EM algorithm significantly outperforms the other methods, and proceed to investigate the effect of various initialization schemes on the final solution produced by the EM algorithm. The initializations that we consider are (1) parameters sampled from an uninformative prior, (2) random perturbations of the marginal distribution of the data, and (3) the output of hierarchical agglomerative clustering. Although the methods are substantially different, they lead to learned models that are strikingly similar in quality.

23.5AIJan 30, 2013

The Lumiere Project: Bayesian User Modeling for Inferring the Goals and Needs of Software Users

Eric J. Horvitz, John S. Breese, David Heckerman et al.

The Lumiere Project centers on harnessing probability and utility to provide assistance to computer software users. We review work on Bayesian user models that can be employed to infer a users needs by considering a user's background, actions, and queries. Several problems were tackled in Lumiere research, including (1) the construction of Bayesian models for reasoning about the time-varying goals of computer users from their observed actions and queries, (2) gaining access to a stream of events from software applications, (3) developing a language for transforming system events into observational variables represented in Bayesian user models, (4) developing persistent profiles to capture changes in a user expertise, and (5) the development of an overall architecture for an intelligent user interface. Lumiere prototypes served as the basis for the Office Assistant in the Microsoft Office '97 suite of productivity applications.