Introduction to Bayesian Networks & BayesiaLab Artificial Intelligence for Research and Analytics Stefan Conrady
Dr. Lionel Jouffe DOI: 10.13140/2.1.4737.6965
Introduction to Bayesian Networks & BayesiaLab
Table of Contents Table of Contents Introduction
4
1. The New Bayesian Network Paradigm in Context
5
A Map of Analytic Modeling
5
Quadrant 2: Predictive Modeling
6
Quadrant 4: Explanatory Modeling
7
Bayesian Networks: Theory and Data
8
Bayesian Networks: Association and Causation
9
2. Bayesian Network Theory
10
A Bayesian Network Example
11
A Dynamic Bayesian Network Example
12
Probabilistic Semantics
13
Evidential Reasoning
15
Causal Reasoning
15
Causal Discovery
17
3. Bayesian Networks and BayesiaLab in Practice
ii!
ii
19
BayesiaLab 5.3
19
The BayesiaLab Workflow
20
BayesiaLab in Context
21
BayesiaLab’s Features and Functions in Practice
22
Knowledge Modeling (Quadrant 4)
22
Knowledge Discovery with Machine Learning (Quadrant 3)
23
Reasoning Under Uncertainty
23
Discrete, Nonlinear and Nonparametric Modeling
23
Unsupervised Structural Learning (Quadrant 3)
24
Supervised Learning (Quadrant 2)
24
Clustering (Quadrant 2/3)
25
Observational Inference (Quadrant 1/2)
26
Causal Inference (Quadrant 3/4)
26
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
Diagnosis, Prediction and Simulation
26
Effects Analysis (Quadrants 3/4)
27
Analyzing Observational Studies
27
Optimization (Quadrant 4)
28
Summary
28
References
29
Contact Information
30
Bayesia USA
30
Bayesia Singapore Pte. Ltd.
30
Bayesia S.A.S.
30
Copyright
30
www.bayesia.us
iii !
Introduction to Bayesian Networks & BayesiaLab
Introduction With Professor Judea Pearl receiving the prestigious 2011 A.M. Turing Award, Bayesian networks have presumably received more public recognition than ever before. Judea Pearl’s achievement of establishing Bayesian networks as a new paradigm is fittingly summarized by Stuart Russell: “[Judea Pearl] is credited with the invention of Bayesian networks, a mathematical formalism for defining complex probability models, as well as the principal algorithms used for inference in these models. This work not only revolutionized the field of artificial intelligence but also became an important tool for many other branches of engineering and the natural sciences. He later created a mathematical framework for causal inference that has had significant impact in the social sciences.” While their theoretical properties made Bayesian networks immediately attractive for academic research, especially with regard to the study of causality, the arrival of practically feasible machine learning algorithms has allowed Bayesian networks to grow beyond its origin in the field of computer science. Since the first release of the BayesiaLab software package in 2001, Bayesian networks have finally become accessible to a wide range of scientists and analysts for use in many other disciplines. In this introductory paper, we present Bayesian networks (the paradigm) and BayesiaLab (the software tool), from the perspective of the applied researcher. In Chapter 1 we begin with the role of Bayesian networks in today’s world of analytics, juxtaposing them with traditional statistics and more recent innovations in data mining. Once we establish how Bayesian networks fit into the proverbial big picture, we present in Chapter 2 the mathematical formalism that underpins this paradigm. While employing Bayesian networks for research has become remarkably easy with BayesiaLab, we need to emphasize importance of their theory. Only a deep understanding of this theory will allow researchers to fully appreciate the wide-ranging benefits of Bayesian networks. Finally, in Chapter 3, we provide an overview of the BayesiaLab software platform, which leverages the Bayesian networks paradigm to far greater extent than any other tool that has ever been available in this field. We show how the theoretical properties of Bayesian networks translate into an extremely powerful and universal research tool for many fields of study, ranging from bioinformatics to marketing science and beyond.
4
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
1. The New Bayesian Network Paradigm in Context As we introduce Bayesian networks as a new paradigm, we will first present them in the context of what can perhaps be described as the field of “analytic modeling.” Such context is particularly important given the attention that Big Data and related technologies receive these days. Their dominance in terms of publicity does perhaps drown out other many other important methods of scientific inquiry. Equally important is positioning Bayesian networks vis-à-vis traditional parametric statistical methods, which have supported a myriad of scientific advances in the 20th century and that continue to serve as valid and valuable tools for researchers today.
A Map of Analytic Modeling Following the ideas of Breiman (2001) and Shmueli (2010), we create a map of “analytic modeling” that is defined by two axes (Figure 1). The x-axis reflects the Modeling Purpose, ranging from Association/Correlation to Causation. Tags on the x-axis furthermore indicate a conceptual progression that includes description, prediction, explanation, simulation, and optimization. The y-axis shows the Model Source, or, more precisely, the source of the model specification: On the one end, we have Theory as the source, one the other end, we have Data as the source. Theory is furthermore tagged with Parametric as the prevalent modeling approach, and Human Intelligence, indicating the origin of Theory. On the opposite end of the y-axis, Data is associated with Machine Learning and Artificial Intelligence. It is also tagged with Algorithmic, to highlight the contrast with the mostly parametric modeling generated from theory.
www.bayesia.us
5
Introduction to Bayesian Networks & BayesiaLab
Predictive Modeling Scoring Machine Learning
Data
Classification
Algorithmic
Model Source
Artificial Intelligence
Forecasting
Q2 Q3 Q1 Q4 Explanatory Modeling Economics
Parametric
Human Learning Human Intelligence
Operations Research
Risk Analysis
Social Sciences
Theory
Decision Analysis Epidemiology
"Reasoning" Description
Prediction
Association Correlation
Explanation
Simulation
Modeling Purpose
Optimization
Causation
Figure 1. Conceptual map of analytic modeling Needless to say, this is a highly simplified view of the world, and readers can rightfully point out the limitations of this presentation.1 Despite this caveat, we will now use the proposed map and its coordinate system to position different modeling approaches.
Quadrant 2: Predictive Modeling Many of today’s predictive modeling techniques are algorithmic and would fall mostly into Quadrant 2. In Quadrant 2, a researcher would be mostly interested in the predictive performance, i.e.
Y
of interest
= f (X) .
Neural networks are a typical example of implementing machine learning techniques in this context. Such models are often devoid of any theory, however, they can be excellent “statistical devices” for producing predictions. 1
For instance, one could easily expand this overview by adding a third dimension, perhaps including the type of para-
meter estimation. With such an additional axis, one could differentiate “frequentist” and “Bayesian” estimation methods.
6
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
Quadrant 4: Explanatory Modeling In Quadrant 4, the researcher is interested in identifying a model structure that best reflects the underlying “true” data generating process, i.e. we are looking for an explanatory model. Thus, the function f is of greater importance than Y:
Y=
f of interest
(X) .
Traditional statistical techniques, which have an explanatory purpose and that are used in epidemiology and the social sciences, would mostly belong in Quadrant 4. Regressions are the best-known models in this context. Extending further into the causal direction, we would progress into the field of operations research, including simulation and optimization. Despite the diverging objectives of Predictive Modeling and Explanatory Modeling, i.e. predicting Y versus learning f, the respective methods are not necessarily incompatible. In Figure 1, this is suggested by the blue boxes that gradually fade out as they cross the boundaries and extend beyond their “home” quadrant. However, the best-performing modeling approaches do rarely serve predictive and explanatory purposes equally well. In many situations, the optimal fit-for-purpose models remain very distinct from each other. In fact, Schmueli (2010) has shown that structurally “less true” models can yield better predictive performance than the “true” explanatory model. We should also point out that recent advances in machine learning and data mining have mostly occurred in Quadrant 2, and thus disproportionately benefitted predictive modeling. Unfortunately, most machinelearned models are also remarkably difficult to interpret in terms of their structural meaning, so new theories are rarely generated this way. For instance, the well-known Netflix Prize competition generated phenomenally-performing predictive models, but they yielded little explanatory insight into consumers’ choices. Conversely, in Quadrant 4, purposefully machine learning explanatory models remains rather difficult. As opposed to Quadrant 2, the availability of ever-increasing amounts of data is not even a major advantage in the discovery of theory through machine learning.
www.bayesia.us
7
Introduction to Bayesian Networks & BayesiaLab
Bayesian Networks: Theory and Data With regard to the observed division (horizontal in our map) between Theory and Data as a model source, Bayesian networks have a special role. Bayesian networks can be built from human knowledge, i.e. from theory, or, they can be machine learned from data. Thus they cover the entire spectrum in terms of their model source. Also, due to their graphical structure, machine-learned Bayesian networks are intuitively interpretable, thus facilitating human learning and theory building. As emphasized by the bidirectional arc in Figure 2, Bayesian networks allow human learning and machine learning to interact efficiently. This way, Bayesian networks can be developed from a combination of human and artificial intelligence.
Machine Learning
Data Algorithmic
Model Source
Artificial Intelligence
Human Learning Human Intelligence
Bayesian Networks
Parametric
Theory
Figure 2. Bayesian networks spanning theory and data.
8
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
Bayesian Networks: Association and Causation Beyond transcending the boundaries between theory and data (on the y-axis or our map), Bayesian networks have very unique properties when it comes to causality. Under certain conditions and with specific theory-driven assumptions, Bayesian networks can perform causal inference. Thus Bayesian network models can cover the entire range from association to causation, thus spanning the entire x-axis of our analytics map.
Machine Learning
Algorithmic
Model Source
Artificial Intelligence
Data
Human Learning Human Intelligence
Bayesian Networks
Q2 Q3 Q1 Q4
Parametric
Causal Assumptions
Theory
"Reasoning" Description
Prediction
Association Correlation
Explanation
Model Purpose
Simulation
Optimization
Causation
Figure 3. Bayesian networks spanning data to theory, and association to causation. As a result, Bayesian networks are a highly versatile modeling framework, making them suitable for many problem domains. The mathematical formalism underpinning the Bayesian network paradigm will be presented in the next chapter, Bayesian Network Theory.
www.bayesia.us
9
Introduction to Bayesian Networks & BayesiaLab
2. Bayesian Network Theory2 Probabilistic models based on directed acyclic graphs have a long and rich tradition, beginning with the work of geneticist Sewall Wright in the 1920s. Variants have appeared in many fields. Within statistics, such models are known as directed graphical models; within cognitive science and artificial intelligence (AI), such models are known as Bayesian networks. The name honors the Rev. Thomas Bayes (1702-1761), whose rule for updating probabilities in the light of new evidence is the foundation of the approach. Rev. Bayes addressed both the case of discrete probability distributions of data and the more complicated case of continuous probability distributions. In the discrete case, Bayes’ theorem relates the conditional and marginal probabilities of events A and B, provided that the probability of B does not equal zero:
P(A∣B) =
P(B∣A)P(A) P(B)
In Bayes’ theorem, each probability has a conventional name: P(A) is the prior probability (or “unconditional” or “marginal” probability) of A. It is “prior” in the sense that it does not take into account any information about B; however, the event B need not occur after event A. In the nineteenth century, the unconditional probability P(A) in Bayes’s rule was called the “antecedent” probability; in deductive logic, the antecedent set of propositions and the inference rule imply consequences. The unconditional probability P(A) was called “a priori” by Ronald A. Fisher. • P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is derived from or depends upon the specified value of B. • P(B|A) is the conditional probability of B given A. It is also called the likelihood. • P(B) is the prior or marginal probability of B, and acts as a normalizing constant. Bayes theorem in this form gives a mathematical representation of how the conditional probability of event A given B is related to the converse conditional probability of B given A. The initial development of Bayesian networks in the late 1970s was motivated by the need to model the topdown (semantic) and bottom-up (perceptual) combination of evidence in reading. The capability for bidirectional inferences, combined with a rigorous probabilistic foundation, led to the rapid emergence of Bayesian
2
For the technical portion of this introduction, we defer to the words of Judea Pearl, who originally coined the term
“Bayesian network.” We are grateful to him for allowing us to use and adapt large sections from one of his technical reports for our purposes (Pearl and Russell, 2000).
10
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
networks as the method of choice for uncertain reasoning in AI and expert systems replacing earlier, ad hoc rule-based schemes. The nodes in a Bayesian network represent variables of interest (e.g. the temperature of a device, the gender of a patient, a feature of an object, the occurrence of an event) and the links represent statistical (informational)3 or causal dependencies among the variables. The dependencies are quantified by conditional probabilities for each node given its parents in the network. The network supports the computation of the posterior probabilities of any subset of variables given evidence about any other subset.
A Bayesian Network Example Figure 4 shows a very simple Bayesian network consisting of only two nodes and one link, representing the joint probability distribution of the variables Eye Color and Hair Color in a given population. In this case, the conditional probabilities of Hair Color given the values of its parent, Eye Color, are provided in a table. It is important to point out that this Bayesian network does not contain any causal assumptions, i.e. we have no knowledge of the causal order between the variables, so the interpretation here should be merely statistical (informational).
Figure 4. A Bayesian network representing the statistical relationship between to two variables. Figure 5 illustrates another simple yet typical Bayesian network. In contrast to the statistical relationships in Figure 4, the diagram in Figure 5 describes the causal relationships among the season of the year (X1), whether it is raining (X2), whether the sprinkler is on (X3), whether the pavement is wet (X4), and whether the pavement is slippery (X5). Here the absence of a direct link between X1 and X5, for example, captures our understanding that there is no direct influence of season on slipperiness — the influence is mediated by the wetness of the pavement (if freezing is a possibility then a direct link could be added).
3
“informational” and “statistical” are treated here as equivalent concepts and can be used interchangeably.
www.bayesia.us
11
Introduction to Bayesian Networks & BayesiaLab
Figure 5. A Bayesian network representing causal influences among five variables. Perhaps the most important aspect of a Bayesian networks is that they are direct representations of the world, not of reasoning processes. The arrows in the diagram represent real causal connections and not the flow of information during reasoning (as in rule-based systems and neural networks). Reasoning processes can operate on Bayesian networks by propagating information in any direction. For example, if the sprinkler is on, then the pavement is probably wet (deduction, prediction, simulation); if someone slips on the pavement, that also provides evidence that it is wet (abduction, reasoning to a probable cause or diagnosis). On the other hand, if we see that the pavement is wet, that makes it more likely that the sprinkler is on or that it is raining (abduction); but if we then observe that the sprinkler is on, that reduces the likelihood that it is raining (explaining away). It is this last form of reasoning, explaining away, that is especially difficult to model in rule-based systems and neural networks in any natural way, because it seems to require the propagation of information in two directions.
A Dynamic Bayesian Network Example Entities that live in a changing environment must keep track of variables whose values change over time. Dynamic Bayesian networks capture this process by representing multiple copies of the state variables, one for each time step. A set of variables Xt denotes the world state at time t and a set of variables Et denotes the observations available at time t. The model P(Et|Xt) is encoded in the conditional probability distributions for the observable variables, given the state variables. The transition model P(Xt+1|Xt) relates the state at time t to the state at time t+1. Keeping track of the world means computing the current probability distribu-
12
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
tion over world states given all past observations, i.e., P(Xt|E1,…,Et). Dynamic Bayesian networks are strictly more expressive than other temporal probability models such as hidden Markov models and Kalman filters.
time step t
time step t+1 Figure 6. Dynamic Bayesian network.
Probabilistic Semantics Any complete probabilistic model of a domain must, either explicitly or implicitly, represent the joint probability distribution — the probability of every possible event as defined by the combination of the values of all the variables. There are exponentially many such events, yet Bayesian networks achieve compactness by factoring the joint distribution into local, conditional distributions for each variable given its parents. If xi denotes some value of the variable Xi and pai denotes some set of values for the parents of Xi, then P(xi|pai) denotes this conditional distribution. For example, P(x4|x2,x3) is the probability of wetness given the values of sprinkler and rain. The global semantics of Bayesian networks specifies that the full joint distribution is given by the product
P(xi ,..., xn ) = ∏ P(xi pai ) i
www.bayesia.us
(1)
13
Introduction to Bayesian Networks & BayesiaLab
In our example network, we have
P(x1 , x2 , x3 , x4 , x5 ) = P(x1 )P(x2∣x1 )P(x3∣x1 )P(x4∣x2 , x3 )P(x5∣x4 )
(2)
It becomes clear that the number of parameters, i.e. the number of conditional probability distributions, grows linearly with the size of the network, i.e. the number of variables, however, the conditional probability distribution grows exponentially with the number of parents. Further savings can be achieved using compact parametric representations — such as noisy-OR models, decision trees, or neural networks — for the conditional distributions. There is also an entirely equivalent local semantics, which asserts that each variable is independent of its nondescendants in the network given its parents. For example, the parents of X4 in Figure 7 are X2 and X3 and they render X4 independent of the remaining nondescendant, X1. That is,
P(x4∣x 1 , x2 , x3 ) = P(x4∣x2 , x3 ) . Non-Descendants
Parents
Descendant
Figure 7. Variable X4 is independent of its non-descendants, in this case X1, given its parents, X3 and X2 The collection of independence assertions formed in this way suffices to derive the global assertion in Equation 1, and vice versa. The local semantics is most useful in constructing Bayesian networks, because selecting as parents all the direct causes (or direct relationships) of a given variable invariably satisfies the local
14
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
conditional independence conditions. The global semantics leads directly to a variety of algorithms for reasoning.
Evidential Reasoning From the product specification in Equation 1 one can express the probability of any desired proposition in terms of the conditional probabilities specified in the network. For example the probability that the sprinkler is on given that the pavement is slippery is
P(X 3 = on∣X5 = true) = =
∑ ∑
==
x1 , x2 , x4
∑
P(x1 , x2 , X 3 = on, x4 , X5 = true)
x1 , x2 , x3 , x4
x1 , x2 , x4
∑
P(X 3 = on, X5 = true) P(X5 = true)
P(x1 , x2 , x3 , x4 , X5 = true)
P(x1 )P(x2∣x1 )P(X 3 = on∣x1 )P(x4∣x2 , X 3 = on)P(X5 = true∣x4 )
x1 , x2 , x3 , x4
P(x1 )P(x2∣x1 )P(x3∣x1 )P(x4∣x2 , x3 )P(X5 = true∣x4 )
The first algorithms proposed for probabilistic calculations in Bayesian networks used a local distributed message-passing architecture, typical of many cognitive activities. Initially this approach was limited to treestructured networks, but was later extended to general networks in Lauritzen and Spiegelhalter’s (1988) method of junction tree propagation. A number of other exact methods have been developed and can be found in recent textbooks. It is easy to show that reasoning in Bayesian networks subsumes the satisfiability problem in propositional logic and hence is NP-hard. Monte Carlo simulation methods can be used for approximate inference (Pearl, 1988) giving gradually improving estimates as sampling proceeds. These methods use local message propagation on the original network structure unlike junction tree methods. Alternatively, variational methods provide bounds on the true probability.
Causal Reasoning Most probabilistic models, including general Bayesian networks, describe a distribution over possible observed events — as in Equation 1 — but say nothing about what will happen if a certain intervention occurs. For example, what if I turn the sprinkler on instead of just observing that it is turned? What effect does that have on the season, or on the connection between wetness and slipperiness? A causal network, intuitive-
www.bayesia.us
15
Introduction to Bayesian Networks & BayesiaLab
ly speaking, is a Bayesian network with the added property that the parents of each node are its direct causes — as in Figure 2. In such a network, the result of an intervention is obvious: the sprinkler node is set to X3 = on and the causal link between the season X1 and the sprinkler X3 is removed (see Figure 8). All other causal links and conditional probabilities remain intact so the new model is
P(x1 , x2 , x4 , x5 ) = P(x1 )P(x2∣x1 )P(x4∣x2 , X 3 = on)P(x5∣x4 ). Notice that this differs from observing that X3=on, which would result in a new model that included the term P(X3=on|x1). This mirrors the difference between seeing and doing: after observing that the sprinkler is on, we wish to infer that the season is dry, that it probably did not rain, and so on; an arbitrary decision to turn the sprinkler on should not result in any such beliefs.
Figure 8. A causal network reflecting the intervention, X3=on Causal networks are more properly defined, then, as Bayesian networks in which the correct probability model after intervening to fix any node’s value is given simply by deleting links from the node’s parents. For example, Fire → Smoke is a causal network whereas Smoke → Fire is not, even though both networks are equally capable of representing any joint distribution on the two variables. Causal networks model the environment as a collection of stable component mechanisms. These mechanisms may be reconfigured locally by interventions, with correspondingly local changes in the model. This, in turn, allows causal networks to be used very naturally for prediction by an agent that is considering various courses of action.
16
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
Learning Bayesian Networks In machine learning approaches, the conditional probabilities P(xi|pai) are typically estimated with the maximum likelihood approach (observed frequencies in the dataset). In pure Bayesian approaches, models are designed by expertise and include hyperparameter nodes. Data (usually scarce) is used as pieces of evidence set in the networks for incrementally updating the distributions of the hyperparameters (Bayesian updating). It is also possible to machine learn the structure of a Bayesian network, and two families of methods are available for that purpose. The first one, the constraint-based algorithms, is based on the probabilistic semantic of Bayesian networks. Links are added or deleted according to the results of statistical tests, which identify marginal and conditional independencies. The second approach, the score-based algorithms, is based on a metric measuring the quality of candidate networks with respect to the observed data. This metric trades off network complexity against degree of fit to the data, typically expressed as the likelihood of the data given the network. As a substrate for learning, Bayesian networks have the advantage that it is relatively easy to encode prior knowledge in network form, e.g. by fixing portions of the structure or defining forbidden arcs.
Causal Discovery One of the most exciting prospects in recent years has been the possibility of using Bayesian networks to discover causal structures in raw statistical data — a task previously considered impossible without controlled experiments. Consider, for example, the following intransitive pattern of dependencies among three events: A and B are dependent. B and C are dependent, yet A and C are independent. If you ask a person to supply an example of three such events, the example would invariably portray A and C as two independent causes and B as their common effect, namely, A → B ← C. (For instance A and C could be the outcomes of two fair coins and B represents a bell that rings whenever either coin comes up heads.)
www.bayesia.us
17
Introduction to Bayesian Networks & BayesiaLab
Figure 9. Causal model for variables A, C and B, representing two fair coins and a bell respectively Fitting this dependence pattern with a scenario in which B is the cause and A and C are the effects is mathematically feasible but very unnatural, because it must entail fine tuning of the probabilities involved; the desired dependence pattern will be destroyed as soon as the probabilities undergo a slight change. Such thought experiments tell us that certain patterns of dependency, which are totally void of temporal information, are conceptually characteristic of certain causal directionalities and not others. When put together systematically, such patterns can be used to infer causal structures from raw data and to guarantee that any alternative structure compatible with the data must be less stable than the one(s) inferred; namely slight fluctuations in parameters will render that structure incompatible with the data.
18
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
3. Bayesian Networks and BayesiaLab in Practice While the conceptual advantages mentioned in the previous chapter have been known in the world of academia for some time, prior to BayesiaLab’s first release in 2001, leveraging these properties for practical research applications was virtually impossible for non-computer scientists.
BayesiaLab 5.3 BayesiaLab is a powerful desktop application (Windows/Mac/Unix) with a highly-sophisticated graphical user interface, which provides scientists a comprehensive “lab” environment for machine learning, knowledge modeling, diagnosis, analysis, simulation, and optimization. With BayesiaLab, Bayesian networks have become a powerful and practical tool to gain deep understanding of high-dimensional domains. It leverages the inherently graphical structure of Bayesian networks for exploring and explaining complex problems.
Figure 10. BayesiaLab 5.3 Professional Edition BayesiaLab is the result of nearly twenty years of software development by Dr. Lionel Jouffe and Dr. Paul Munteanu. In 2001, their research efforts led to the formation of Bayesia S.A.S., headquartered in Laval in northwestern France. Today, the company has grown to become the leading supplier of Bayesian networkrelated technologies for hundreds major corporations around the world.
www.bayesia.us
19
Introduction to Bayesian Networks & BayesiaLab
The BayesiaLab Workflow BayesiaLab is designed around the Bayesian network paradigm, as is illustrated in Figure 11. It covers the entire research workflow from model generation to analysis, simulation, and optimization. The entire research workflow is fully contained in a single “lab” environment, which provides analysts the ultimate flexibility in moving back and forth between different elements of the research task.
Figure 11. BayesiaLab workflow with Bayesian networks at its core.
20
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
BayesiaLab in Context In Chapter 1, we presented — at a conceptual level — that Bayesian networks can cover the entire map of analytics. Figure 12 shows what this means in practice for the researcher. BayesiaLab’s functions, represented as blue boxes, are positioned across this map, and they demonstrate the universal applicability of the Bayesian network paradigm.
Implementing Bayesian Networks with BayesiaLab
Machine Learning
Data Algorithmic
Model Source
Artificial Intelligence
Supervised Learning
Human Learning Human Intelligence
Data Clustering
Variable Clustering
Parameter Learning
Q2 Q3 Q1 Q4
Unsupervised Structural Learning
Probabilistic Structural Equation Models
Total & Direct Effects Analysis
Target Optimization
Bayesian Updating
Parametric
Theory
Influence Diagrams
Knowledge Modeling
"Reasoning" Description
Prediction
Association Correlation
Explanation
Modeling Purpose
Simulation
Optimization
Causation
Figure 12. BayesiaLab functions positioned on the analytics map.
www.bayesia.us
21
Introduction to Bayesian Networks & BayesiaLab
BayesiaLab’s Features and Functions in Practice In Chapter 1 we presented a very conceptual context for Bayesian networks; Chapter 2 provided a theoretical rationale. In the following, last section of Chapter 3 we switch to a less formal narrative that connects researchers’ problem-solving needs to specific BayesiaLab functions. We link features and functions to their corresponding quadrant on the analytics map, where applicable.
Knowledge Modeling (Quadrant 4) In today’s business environment that strives to be data-driven, old-fashioned expert knowledge is sometimes perceived as merely qualitative or seen as “soft” knowledge. With billions of “hard” data points being accumulated every second, what cannot be counted may not count for much these days. A lifetime of experience in any particular domain may appear insignificant in comparison to the huge quantities of newly generated data. This mindset has a critical flaw, which is that causal relationships remain difficult to be machine-learned from data. Rather, causal reasoning typically requires some form of assumptions, i.e. assumptions coming from human knowledge. Experts often express causal paths in the form of graphs. This visual representation of causes and effects has a direct analogue in the network graph in BayesiaLab’s graph panel. Nodes (representing variables) can be added and positioned with a mouseclick, arcs (representing relationships) can be “drawn” between nodes. The causal direction is simply encoded in the direction of the arc. The quantitative nature of dependencies, plus many other attributes can be managed in the Node Editor, which is available by right-clicking any node. BayesiaLab thus facilitates intuitively encoding one’s own understanding of a domain with a minimum of effort. Simultaneously it enforces internal consistency, so that no impossible conditions are accidentally encoded. In addition to allowing users to directly encode their explicit knowledge by drawing a network in the graph panel, the Bayesia Expert Knowledge Elicitation Environment (BEKEE) is available as an extension to BayesiaLab. It allows to systematically elicit both explicit and tacit knowledge of a group of experts during brainstorming sessions.
22
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
Knowledge Discovery with Machine Learning (Quadrant 3) Despite our emphasis on the relevance of human expert knowledge, especially for identifying causal relations, there is no doubt that there is a lot to learn from data, regardless of whether the data is sparse or “big”. BayesiaLab features a very comprehensive array of highly optimized learning algorithms that can quickly uncover so-far-unknown structures in datasets. This proves to be particularly powerful regardless of whether you have a handful of variables or thousands of variables, with millions of potentially relevant relationships.
Reasoning Under Uncertainty Based on a Bayesian network, BayesiaLab can reliably carry out inference with multiple pieces of uncertain and even conflicting evidence. The inherent ability of Bayesian networks to facilitate computations under uncertainty makes them highly suitable for a wide range of real-world applications. Reasoning under uncertainty applies in two ways: • Diagnosis (inference from effect to cause) • Simulation (inference from cause to effect) Maintaining uncertainty during inference automatically prevents presenting potentially misleading point estimates.
Discrete, Nonlinear and Nonparametric Modeling BayesiaLab processes all data on a discretized basis. As part of BayesiaLab’s Data Import Wizard, a number of methods are available to discretize any continuous variables. In BayesiaLab, all “parameters” describing probabilistic relationships between variables are contained in conditional probability tables (or cubes/hypercubes when two dimensions are exceeded), which means that no functional forms are utilized. Given this nonparametric, discrete approach, BayesiaLab can implicitly handle highly nonlinear relationships between variables. All the optimization criteria of BayesiaLab’s learning algorithms are based on informa-
www.bayesia.us
23
Introduction to Bayesian Networks & BayesiaLab
tion theory (e.g. the Minimum Description Length). With that, no assumptions of linearity are made at any point.
Unsupervised Structural Learning (Quadrant 3) In statistics, unsupervised learning is typically understood to be a classification or clustering task. To make a very clear distinction, we put emphasis on “structural” in “Unsupervised Structural Learning”, which covers a number of important algorithms in BayesiaLab. Unsupervised Structural Learning means that BayesiaLab can discover probabilistic relationships between a large number of variables, without the need to define inputs or outputs. One might say that this is the quintessential form of knowledge discovery, as no assumptions whatsoever are required to perform these algorithms on unknown datasets.4
Supervised Learning (Quadrant 2) Supervised Learning in BayesiaLab has the same objective as many traditional modeling techniques, i.e. to develop a model for predicting a target variable. Some other data mining packages also offer “Bayesian Networks” as an option in their array of available techniques. However, in most cases, these packages are restricted in their capabilities to a very limited type of network, i.e. the Naïve Bayesian Network.
4
However, the analyst can still use any available domain knowledge to define structural constraints.
24
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
Within BayesiaLab, a vastly greater number of algorithms is available to search for a Bayesian network that best describes the target variable, while taken into account the complexity of the resulting network. The Markov Blanket algorithm should be highlighted here as its speed is particularly helpful whenever dealing with a larger number of variables. In this context, the Markov Blanket also serves as an exceptionally powerful variable selection algorithm. Finally, structural coefficient analysis, cross-validation and data perturbation functions are available for thoroughly bootstrapping, testing and validating the robustness of candidate networks, helping the analyst to make a trade-off between precision and parsimony. These validation methods are applicable to both Supervised and Unsupervised Learning.
Clustering (Quadrant 2/3) Clustering in BayesiaLab covers both data clustering (e.g. by observations) and variable clustering, which, as the name implies, allows the grouping of variables according to the strength of their mutual relationships.
www.bayesia.us
25
Introduction to Bayesian Networks & BayesiaLab
A third variation of this concept is of particular importance in BayesiaLab: the semi-automatic Multiple Clustering workflow can be described as a kind of nonlinear, nonparametric and nonorthogonal factor analysis.
In practice, Multiple Clustering often serves as the basis for developing Probabilistic Structural Equation Models (Quadrant 3/4) with BayesiaLab.
Observational Inference (Quadrant 1/2) One of the basic properties of Bayesian networks is that they are “omnidirectional observational inference engines.” Given an observation on any of the networks nodes (or a subset of nodes), one can compute the posterior probabilities of all other nodes in the network. Both exact and approximate observational inference algorithms are implemented in BayesiaLab.
Causal Inference (Quadrant 3/4) Besides observational inference, BayesiaLab also offers causal inference for computing the impact of intervening on a subset of variables instead of merely observing their states. Both Pearl’s Do-Operator and Jouffe’s Likelihood Matching are available for this purpose.
Missing Values Processing BayesiaLab offers a range of sophisticated methods for missing values processing from which the analyst can choose. During network learning, BayesiaLab performs missing values processing automatically “behind the scenes”. More specifically, the Structural Expectation-Maximization algorithm and the Dynamic Completion algorithm are automatically applied after each modification of the network during learning, i.e. after every single arc addition, suppression and inversion.
Diagnosis, Prediction and Simulation In the Bayesian network framework, diagnosis, prediction and simulation are identical computations. They all consist of inference conditional upon evidence. The distinction only exists from the perspective of the researcher, who would presumably sees the symptom of a disease as an effect and the disease itself as the cause. Hence, carrying out inference based on observed symptoms is interpreted as “diagnosis.”
26
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
BayesiaLab offers a considerable number of functions relating to inference. For instance, inference can be performed by setting evidence, i.e. clicking on any one of the Monitors, and results are returned instantly for all the other Monitors. Batch Inference is available when inference needs to be computed for a large number of records. For instance, this can be used for applying a predictive score for all customers in a database. The Adaptive Questionnaire function provides guidance in terms of the optimum sequence for seeking evidence. With every piece of evidence set, BayesiaLab determines which is the next best piece of evidence to obtain for a maximum information gain with respect to the target variable. In a medical context, this allows to optimally “escalate” diagnostic procedures, from “low-cost & small-gain evidence (e.g. measuring the patient’s blood pressure) to “high-cost & large-gain” evidence (e.g. performing an MRI scan).
Effects Analysis (Quadrants 3/4) Many research activities focus on estimating the size of an effect, for instance establishing the treatment effect of a new drug or determining the sales impact of a new advertising campaign. Other studies are about attribution, i.e. they attempt to decompose observed effects into their causes and thus allocate contributions. All of the above questions can be answered, if the domain is fully understood, which is a priori never the case. However, if we are able to build an adequate model of the domain that captures all of its dynamics, BayesiaLab will be able to extract the effects. BayesiaLab employs simulation to derive effects, as parameters per se do not exist in this nonparametric framework. As all the dynamics of the domain are encoded in discrete conditional probability tables, effect sizes only manifest themselves when different conditions are simulated. Total Effects Analysis, Target Mean Analysis and many more of BayesiaLab’s functions offer the analyst ways to study effects, especially nonlinear and interactive effects.
Analyzing Observational Studies This simulation approach also offers special opportunities for evaluating observational studies. More specifically, it can help overcome the problem of systematic differences between treatment and control groups. BayesiaLab’s Likelihood Matching performs on-the-fly matching of pretreatment covariates as part of the Direct Effects Analysis, thus yielding the “exclusive” effect of a particular variable on the target, everything else being equal. This also obliterates the need for separately performing matching techniques, such as propensity score matching.
www.bayesia.us
27
Introduction to Bayesian Networks & BayesiaLab
Optimization (Quadrant 4) The ability to perform inference across all possible states of all nodes of the network also facilitates searching for optimum values. BayesiaLab’s Target Dynamic Profile and Target Optimization provide the toolsets for this purpose. Using this function in combination with Direct Effects is of particular interest when searching for the optimum combination of variables that have a nonlinear relationship with the target (and co-relations between the drivers). A typical example would be searching for the optimum mix of an array of marketing instruments. BayesiaLab’s Target Optimization with Direct Effects will search, within the constraints set by the analyst, for those scenarios that optimize the target criterion.
Summary BayesiaLab has consequently implemented (and expanded upon) the theory of Bayesian networks, which was first introduced by Judea Pearl in the 1980s. In a remarkably short time, BayesiaLab has translated cutting-edge theoretical research into highly relevant and practical tools for applied scientists. BayesiaLab has opened up entirely new avenues for exploring, understanding and explaining complex problem domains.
28
www.bayesia.us
Introduction to Bayesian Networks & BayesiaLab
References Barber, David. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012. Barnard, G. A, and T. Bayes. “Studies in the History of Probability and Statistics: IX. Thomas Bayes’s Essay Towards Solving a Problem in the Doctrine of Chances.” Biometrika 45, no. 3 (1958): 293–315. Breiman, Leo. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16, no. 3 (2001): 199–231. Darwiche, Adnan. “Bayesian Networks.” Communications of the ACM 53, no. 12 (December 2010): 80. doi:10.1145/1859204.1859227. ———. Modeling and Reasoning with Bayesian Networks. 1st ed. Cambridge University Press, 2009. Koller, Daphne, and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. 1st ed. The MIT Press, 2009. Neapolitan, Richard E. Learning Bayesian Networks. Prentice Hall, 2003. Pearl, Judea. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press, 2009. Pearl, Judea, and Stuart Russell. Bayesian Networks. UCLA Congnitive Systems Laboratory, November 2000. http://bayes.cs.ucla.edu/csl_papers.html. Russell, Stuart. “Judea Pearl - A.M. Turing Award Winner.” Accessed August 31, 2013. http://amturing.acm.org/award_winners/pearl_2658896.cfm. Shmueli, Galit. “To Explain or to Predict?” Statistical Science 25, no. 3 (August 2010): 289–310. doi: 10.1214/10-STS330. Spirtes, Peter, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search, Second Edition. 2nd ed. The MIT Press, 2001.
www.bayesia.us
29
Introduction to Bayesian Networks & BayesiaLab
Contact Information Bayesia USA 312 Hamlet’s End Way
Franklin, TN 37067
USA
Phone: +1 888-386-8383
[email protected]
www.bayesia.us
Bayesia Singapore Pte. Ltd. 28 Maxwell Road
#03-05, Red Dot Traffic
Singapore 069120
Phone: +65 3158 2690
[email protected]
www.bayesia.sg
Bayesia S.A.S. 6, rue Léonard de Vinci
BP 119
53001 Laval Cedex
France
Phone: +33(0)2 43 49 75 69
[email protected]
www.bayesia.com
Copyright © 2014 Bayesia USA, Bayesia S.A.S., and Bayesia Singapore. All rights reserved.
30
www.bayesia.us