Bachelor thesis
Computational limits of probabilistic programming languages
Study course: Informatik Editing time: Nov 2013 - Mar 2014
Carsten Brandt (325418)
[email protected] TU Berlin Fakult¨at IV - DIMA Adviser: Larysa Visengeriyeva 1st examiner: Professor Markl 14th March 2014
i
Abstract This work describes the possibilities and limits of Probabilistic Programming, which is a new approach to statistical modelling using programming languages to describe arbitrary complex statistical models. Probabilistic Programming Systems aim to introduce an abstraction layer over inference algorithms making them easier to use by hiding the complexity of statistical inference from the user. This work gives a small background on the statistical concepts used in Probabilistic Programming and describes the state of the art of inference algorithms that can be used to build inference engines that can work on all possible statistical models. It also gives a short overview over the computability of probability distributions which describes the upper limit of what is possible with the Probabilistic Programming approach.
Zusammenfassung In dieser Arbeit beleuchte ich die M¨oglichkeiten und Grenzen von Probabilistic Programming, einen neuen Ansatz f¨ ur das erstellen von statistischen Modellen mithilfe von Programmiersprachen. Dieser Ansatz erlaubt es durch die Ausdrucksst¨arke von Programmiersprachen beliebig komplexe statistische Modelle zu erstellen. Das Ziel von Probabilistic Programming Systemen ist es, eine Abstraktionsschicht u ¨ber Inferenz Algorithmen zu erstellen, um diese einfacher benutzbar zu machen, indem die Komplexit¨at von Inferenz Algorithmen unter dieser Schicht m¨oglichst versteckt bleibt. ¨ Diese Arbeit gibt einen kurzen Uberblick u ¨ber die statistischen Konzepte, die f¨ ur das Verwenden von Probabilistic Programming Languages notwendig sind und beschreibt den aktuellen Stand der Forschung von Inferenzalgorithmen. Dieses Wissen kann verwendet werden, um Inferenz Systeme zu bauen, die auf beliebigen statistischen Modellen arbeiten k¨ onnen. Ich gebe außerdem einen kurzen Einblick in die Forschung zur Berechenbarkeit von Wahrscheinlichkeitsverteilungen, welche die Obergrenze von dem, zu beschreiben Versucht, was mit dem Probabilistic Programming Ansatz m¨ oglich ist.
Eidesstattliche Erkl¨ arung Die selbstst¨andige und eigenh¨andige Anfertigung versichere ich an Eides Statt.
Contents 1 Introduction 1.1 Scientific background . . . . . . . . . . . . . . . . . . . . . . . 1.2 The goal of this work . . . . . . . . . . . . . . . . . . . . . . .
1 1 2
2 Statistical Inference 2.1 Bayesian probability . . . . . . . . . . . . . . . . . . . . . . . 2.2 Random variables and Probability distribution . . . . . . . . . 2.3 Doing inference on a model . . . . . . . . . . . . . . . . . . .
3 3 4 5
3 Probabilistic Programming 3.1 Representations of uncertainty 3.2 Graphical models . . . . . . . 3.3 Stochastic processes . . . . . . 3.4 Probabilistic Programs . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
9 10 11 13 13
4 Inference Algorithms 4.1 Exact inference . . . . . . . . . . . . . . . . . . 4.1.1 Maximum likelihood estimation . . . . . 4.1.2 Linear regression . . . . . . . . . . . . . 4.1.3 MAP . . . . . . . . . . . . . . . . . . . . 4.1.4 The expectation maximization algorithm 4.1.5 Variable elimination . . . . . . . . . . . 4.2 Approximate Inference (Sampling algorithms) . 4.2.1 MCMC . . . . . . . . . . . . . . . . . . 4.2.2 The Monte Carlo Principle . . . . . . . . 4.2.3 Using Markov Chains . . . . . . . . . . . 4.2.4 The Metropolis Hastings Algorithm . . . 4.2.5 Choosing the proposal distribution . . . 4.2.6 The “Burn in” period . . . . . . . . . . . 4.2.7 The Gibbs sampler . . . . . . . . . . . . 4.3 Variational inference . . . . . . . . . . . . . . . 4.4 Model validation . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
16 16 16 17 18 18 19 21 21 22 23 25 26 28 28 29 29
5 Computational limits 5.1 Computability in general . . . . . . . . . . . . 5.1.1 Computable random variables . . . . . 5.1.2 Transformations of random variables . 5.2 Limits for calculating conditional distributions 5.2.1 Discrete random variables . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
30 31 32 32 33 34
. . . . .
5.3
5.2.2 Continuous random variables . . . . . . . . . . . . . . 34 Making non-computable distributions computable . . . . . . . 35
6 Summary
36
7 References
37
Appendices
40
Appendix A Probability Distributions A.1 Binomial distribution . . . . . . . . A.2 Gaussian (normal) distribution . . A.3 Properties of distributions . . . . . A.3.1 Mean (µ) . . . . . . . . . . A.3.2 Variance (σ 2 ) . . . . . . . .
40 40 40 41 41 41
v
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1 1.1
Introduction Scientific background
In many research fields researchers have to deal with data analysis on large amounts of data. This includes but is not limited to Artificial Intelligence, Machine Learning, Data Mining, and statistics in general [Mur12, Roy11]. Making reason of data can become a quite complex task when the amount of data is huge and the models become more and more complex. Inferring a statistical model and its parameters given some data always involves many steps and requires great effort to get done properly [DAR13]. This involves mainly three steps: (1) finding an appropriate model, (2) finding an inference algorithm to use with this model, and (3) running this algorithm to infer the parameters of the model given data. The theory of Probabilistic Programming as described in [DAR13] aims to simplify this process by introducing an abstraction layer over the complexity of statistical and inference algorithms. With Probabilistic Programming researchers have to do only the first step, which requires domain knowledge of their research field: creating the model. For finding an inference algorithm and doing inference on the model no domain knowledge is needed. This can be handled with statistical and numerical knowledge only, So in an ideal case the rest would be done by the Probabilistic Programming System. This is a great advantage compared to how this work has been done in the past. In [Roy11, p.21] Dan Roy goes even further by saying: “The probabilistic programming approach has the potential to revolutionize not only statistical modeling, but also statistical computation.” Probabilistic Programming makes it easy to write inference and statistical programs without the need of deep knowledge on how to create complex inference engines [DAR13]. Like in traditional programming, where we write for example C code instead of assembler, or SQL instead of a query execution plan, with probabilistic programming we get an abstraction layer over the complexity of inference algorithms. It combines the expressive power of programming languages, which can be used for writing complex software, with the concepts of probability theory creating powerful tools for statistical inference. Given the statistical models and the data, the runtime engine should choose a suitable algorithm for us and do the inference out of the box. Probabilistic Programming finds its main usage in Artificial Intelligence, but may also be applied to many other fields where statistical computation and modelling is used [DAR13, Wam13]. 1
1.2
The goal of this work
A programming language on such an abstraction level needs efficient inference algorithms in the background that are able to solve arbitrary, real-world, statistical problems at manageable time. Inference is an umbrella term for many algorithms [Was04]. The main goal of this work is to analyse the state of the art of inference algorithms, which could be developed in order to provide an efficient inference engine. There is a great amount of inference problems and a much greater amount of algorithms described in the literature to solve them [DAR13]. Also not every algorithm fits well for each problem, so finding the right algorithm for an inference task is already a big part of the solving process. The choice of the right algorithms and their optimization is the main task of a Probabilistic Programming engine as “the real power of a probabilistic programming language lies in its compiler or runtime environment [...]. In addition to its usual duties, the compiler or runtime needs to figure out how to perform inference on the program.” [Cro13]. Given this context I am going to analyze the state of the art of inference algorithms according to their use cases and flexibility. In chapter 2 I will first give a short introduction to the background of statistics, probability theory and statistical inference. After that I will give a definition of Probabilistic Programming in chapter 3 and describe different ways of representing uncertainty in form of statistical models. Based on the background and definition I will dive into the different inference algorithms in chapter 4 picking out a few representatives of different kinds of algorithms and describe their general idea and how they work. While analyzing the algorithms I will also provide an overview of the current state of scientific research on Probabilistic Programming and inference algorithms in general. Another point that will be examined in chapter 5 is the complexity and computational limits of probability distributions and algorithms to see which problems are computable at least and which may be not. This information is crucial for building probabilistic programs that need to be able to choose algorithms automatically. For finding an algorithm suited for a given problem the compiler has to be able to figure out how complex the problem is. Based on this information the best algorithm can be chosen to allow computation in manageable time.
2
2
Statistical Inference
Statistical inference in general is about learning facts about given data. Inference is done creating statistical models, which in general are mathematical or otherwise formal descriptions of a real world situation using probability theory to represent uncertain facts [Was04]. Having created a statistical model with unknown parameters we want to infer the parameters for the model that describe the data best. This can never be done in a perfect way because in all real data there is some uncertainty and randomness so we have to use probability theory to figure out the most likely fitting parameters instead of the real perfect ones which are hidden in the real world. I assume that the reader is familiar with the basics of probability theory and statistics in general so I will only give a short introduction limited to the concepts used later in other chapters. For more information and detailed explanation please refer to statistic books like [Was04] or [Mur12].
2.1
Bayesian probability
To quantify uncertainty we use a concept called probability. We can intuitively introduce this concept in form of percentage chance that can be applied to an event and quantifies the chance of this event to occur. Among others there are two different interpretations of probability: The Bayesians and the Frequentists. Even though the Frequentists interpretation, which is based on counting the frequency of events occurring and deriving probability from that, is the interpretation which is widely used in statistics, the Bayesian interpretation is the one that is relevant for Probabilistic Programming [Was04, p. 206] [DP13, Ch. 1]. Bayesian probability has the advantage that it assigns probability not as a general property or state of the world but as an individuals belief of an event to occur. This way it is possible to update a probability based on new observations or when the world or knowledge changes [DP13, Ch. 1]. When making an experiment with uncertain result we define the sample space Ω as the set of all possible outcomes. For a statistical model this can be all the values that the models variables can take [Was04]. For example if we throw a dice there are six possible outcomes: Ω = {1, 2, 3, 4, 5, 6} A single outcome is ω ∈ Ω and the probability of it occurring is denoted as P(ω) which can take values between 0 and 1. Values close to 0 represent very unlikely outcomes of the experiment and outcomes with values close to 1 are 3
very likely to occur. The sum of all probabilities for all elements in Ω must be equal to 1. We can define events as subsets of Ω. Let for example A ⊂ Ω be the event that the “dice shows 6” and B ⊂ Ω the event that the “dice number is greater than 3”. We call P(A) = 61 the prior probability as the probability of A occurring without further knowledge available. We intuitively assume it to be 16 because we know a dice has six sides which all can occur with the same probability. Because Bayesian probability allows us to update our belief, we also have P(A|B), which we call the posterior probability of A given B [Was04, p. 31]. Knowing that B has occurred allows us to update our belief of A occurring. We now have only three possible sides available and six is one of them. So we update our belief and assign the probability of P(A|B) = 13 .
2.2
Random variables and Probability distribution
Having defined probability we can now introduce random variables which are the key element of building statistical models. A random variable is a function that assigns a real number to each outcome ω ∈ Ω (cp. [Was04, p. 29]): X:Ω→X ⊂R As an example we can say that X is the sum of pips in an experiment of throwing two dice. In this example X can take discrete values between 2 and 12 and is therefore called a discrete random variable. We again use the function P to assign probability to each value that X can take. We denote P(X = x) as the probability that the random variable X is realized to the value x ∈ X = {2, 3, . . . , 12}. For discrete random variables we also call this the probability mass function which assigns a value between 0 and 1 to each possible value of the random variable [Mur12, p. 28]. It is also possible to have random variables that can take real numbers as their values. These can be used for modelling any variable representing continuous values. We call these continuous random variables and P the probability density function (PDF) or probability measure. As P assigns a probability to each assignment for a random variable and the sum of all probabilities has to be 1, we say that P induces a distribution (probability measure) of X over X . There are also two other notations for the function behind P (X = x) which is denoted as fX (x) and the function that describes P (X < x) which is denoted as FX (x). FX is called the cumulative distribution function (CDF) [Was04, p. 8]. 4
There is an important difference between the distribution of discrete random variables and continuous random variables. For continuous variables it is not useful to ask for P(X = x) because the probability of X being equal to an exact number out of R is infinitesimal small i.e. equal to zero. For continuous random variables we therefore ask for the probability of X lying in an interval between a and b: P(a ≤ X ≤ b) which can be calculated using the CDF [Was04]: P(a ≤ X ≤ b) = FX (b) − FX (a) For distribution functions we often use common existing functions that have certain properties instead of inventing our own distribution function for each of the variables. For example there is the normal distribution that assigns equal probability to values right and left of a mean value. The probability of a value depends on the distance from this mean so the farer the value is from the mean the less probability is assigned to it. For X being normally distributed we write: X ∼ N (µ, σ 2 ) This reads as “X is normally distributed with parameters µ and σ 2 ”. µ and σ 2 refer to expectation (the mean value) and variance of the function respectively [Was04]. I have composed a list of commonly used distribution functions that I am using in this work which you can find in appendix A.
2.3
Doing inference on a model
While probability theory and general statistics are about studying the properties of data generated by some process, statistical inference is about what we can say about the process that generated the data [Was04]. We can define a model that describes a statistical process that generates data in the real world using statistical i.e. mathematical formalisms. The model is constructed incorporating as much domain knowledge as possible defining random variables, parameters of their distributions and dependencies. This process should not depend on the inference method that should be used later for doing inference on the model [BEPW96, p. 9]. We can use this model in two ways: (1) to generate data samples from it (see figure 1 “simulation”), and (2) we can use existing data and infer the best matching parameters of the model given that data (see figure 1 “inference/query”).
5
DEDUCTION (from general to specific) simulation model
data inference/query INDUCTION (from specific to general)
Figure 1: Using a statistical model: The “forward direction” is to simulate data with the model and given parameters. The “backward direction” is to infer the models parameters given observed data. cp. [Was04, p. 14, fig.1.1]. As a simple example of a statistical model we can take a model that has only one variable. This variable has a distribution for the values it can take. In this case lets assume it is normally distributed: X ∼ N (µ, σ 2 ) This model has two parameters that are the parameters of the distribution of the first variable. The parameters describe how our model behaves and what kind of data it will output. We can use this model for simulation and run it with different parameters to get some data (see figure 2). This direction does not require any special methodology, we just run the model calculating the values using the formula of the normal distribution (see appendix A) and save or plot the result. The backward direction, which is to infer the model’s parameters given some data, however is much harder to do and requires sophisticated techniques dependent on the model’s complexity. Even for a simple model like the one from the example above it takes some effort to infer the parameters µ and σ 2 . The real values of µ and σ 2 are hidden in the real world, so we can only guess them. What we expect from statistical inference is not to give us the real values but value distributions for these parameters so we can see how likely the values are. Doing inference we always have to check how good our estimation of the parameters are. We may also want to check how well our model fits the data and whether we might have to adjust it for a better result [BEPW96, p. 31ff.]. 6
Figure 2: Data generated by a model with a normally distributed variable run three times with different parameters µ and σ 2 . Each run generated 50 data points. This illustrates the “forward direction” (simulation) according to figure 1 and shows the influence of different parameters on the arrangement of the generated data points. As an example for showing how inference works we now create an even simpler model, because inference on the model from above would be too much to explain in a simple way. This somehow illustrates the difference in complexity of the two directions. Imagine we are given a data set for some variables X and Y (see figure 3a) and we have an idea on how Y may be related to X. We choose a simple linear regression model (Model 1) and a quadratic linear regression model (Model 2) (cp. [Was04, pp. 242ff.]) for doing inference on them: Model 1: Yi = a + bXi + i Model 2: Yi = a + bXi2 + i With expectation E(i |Xi ) = 0 and variance V ar(i |Xi ) = σ 2 We now choose an inference method for inferring the parameters of our models. An appropriate method for both of the models would be a linear regression using the ordinary least squares method (OLS) which is also described in chapter 4.1.2 [Was04, p. 244].
7
(a) Data plot X vs. Y
(b) Linear Model
(c) Quadratic Model
Figure 3: Doing inference on observed data points (X, Y ) with two different models using the ordinary least squares method. The ordinary least squares method minimizes the error terms (residuals) that are the difference between observed data values Yi and the estimated values of the model Yˆi . The error term ˆi is calculated for all data points as ˆi = Yi − Yˆi . Putting a hat on a variable (ˆ) is a common notation to indicate that it is an estimated value. Ordinary least squares regression works by minimizing the residual sum of squares (RSS), which is defined as RSS =
n X
ˆi2
i=1
Running the regression for each model in a statistics tool will give us values of the parameters a and b, which allow to draw the models expectation E(Y ). See figure 3b for Model 1 and figure 3c for Model 2. For each model we will also be given values for σ 2 which allows us to simulate data again. Doing this we should get similar data to what we have observed dependent on how good our model is. The quality measure of a model can be detected by many factors that we can calculate using parameters and data. Popular methods for this are Hypothesis Testing and p-values [Was04, p. 179], the R2 and adjusted R2 values [BEPW96], the Bayesian and Akaice information criterion [Cha80, pp. 232,233] and many others [Mur12, p. 22] which all allow to determine the models quality with respect to different criteria and thus allow to compare different models. By using these methods it is possible to find the best model for given data.
8
3
Probabilistic Programming
In a statistical model, when there are n parameters the room of possible values for all the parameters is n-dimensional. Finding an algorithm to compute these models and finding the parameters is a complex job and a good understanding of numerical programming and statistics is needed to manage this work. Thus inference on models was something that could only be done by few people who have this knowledge and also for them it is very time consuming [DAR13]. Probabilistic Programming addresses this problem by rapidly simplifying the inference and algorithm part. With Probabilistic Programming there is only need to define the model and in an ideal case the inference engine will figure out the right algorithm to use and will do inference on the Probabilistic Program. This at least is the final goal of Probabilistic Programming, which is not yet reached but current state of the art of Probabilistic Programming Languages already scratch it [Gor13]. Inference, which in the domain of Probabilistic Programming can also be referred to as queries [Goo13], is the main task a Probabilistic Programming System has to accomplish – given the statistical model and some data it tries to find the parameters for the model, that describe the data best. In an optimal case the model will then be able to generate the same data we gave it to do inference on [Was04, p. 106ff.]. A Probabilistic Programming Language is a programming language that has random variables and statistical distributions as first citizens and thus lets us define probabilistic models in an efficient manner [Goo13,Cro13]. A Probabilistic Programming System provides a Probabilistic Programming Language for defining statistical models, a set of inference algorithms for doing inference on the model [Gor13, Cro13]. Traditional programming languages are made for deterministic problem solving, Probabilistic Programs are made for dealing with uncertainty. We could see Probabilistic Programming as a marriage of 1. probability theory, its mathematical formalisms for representing uncertainty and incorporating new evidence – for modelling, 2. statistics – for doing inference, and 3. programming languages – making the both first citizens of a programming language [Roy11, Cro13]. A Probabilistic Program describes a statistical model and thus, as described in chapter 2, can be run forwards and backwards. It can be used to generate data and also be used to infer parameters from observed data. 9
The backward direction is, what is new in this idea. Running the forward direction is already possible with traditional programming languages by using a random source and applying formulas to it. Most of modern existing programming languages have a random function already implemented [Rei03, p. 219]. For running the backward direction more sophisticated algorithms have to be invented that use the program in a different way [Cro13, Goo13].
3.1
Representations of uncertainty
The first thing needed to tackle uncertain information with computational algorithms is the need for a representation of uncertainty in general. This representation has to be general enough to be applied to various and arbitrary complex use cases [Roy11]. As we have already seen before, a statistical model provides such a representation. In this section I will give a formal definition of a statistical model and point out several problems with different ways of representing uncertainty in statistical models. A statistical model in general is a set of distributions or densities F [Was04, p. 105]. There are two different types of models that differ significantly in the way how they are used and how inference can be performed on them. A parametric model is a set F that can be parametrized by a finite number of parameters. A non-parametric model is a set F that can not be parametrized by a finite number of parameters i.e. the number of parameters may become infinite [Was04, p. 106]. Non parametric models are more intuitive as they do not make much assumptions about the data which is more natural as we normally do not know much about our data before we analyse it. However for non-parametric models the number of parameters increases with the amount of data so for large amount of data these likely become unhandleable [Mur12, p. 16]. Parametric models have the general form: F = f (x; θ) : θ ∈ Θ where Θ ⊂ Rk is the parameter space and θ = (θ1 , . . . , θk ) is the models parameter [Was04, p. 145]. The function f in this case is a probability mass function or a probability density function (cp. chapter 2.2). Using parametric models may not be intuitive on first sight because we can not be sure how the underlying data generating process really behaves. However it allows incorporation of domain knowledge for building a model, allowing better interpretation and reduces the problem of inference to the finding of the parameter θ [Was04, p. 145]. 10
The main problem we face when using this representation is the infinitely large space of possible outcomes even in simple models. To define a probabilistic model a probability has to be assigned to any possible outcome (realisation) for the models parameters. Even for n binary variables, there are 2n such configurations, ruling out an exhaustive tabular representation for all but the smallest models [Roy11, p. 17]. This problem can be solved by adding some structure that helps reducing the complexity. In the following I will show different ways of representing models in a structured way.
3.2
Graphical models
We can apply structure to our model by making assumptions about the internal relations and relevance of parameters and contained random variables. Without any assumptions about the model we treat all variables as dependent on each other. By introducing graphical models we make assumptions about the independence of variables in the model. We also say that a variable is not relevant for another if they do not influence each other [Pea88, p. 14]. These assumptions lead to massive simplification of the models structure which can now be represented as a graph in which the edges represent direct influence of the variables. A directed graphical model also known as Bayesian Network can be composed given a finitely or countably indexed collection of random variables {Xv }v∈V . It is an acyclic graph (V, E) where vertexes V represent the random variables and edges E ⊂ V × V represent the conditional independence structure [Pea88, Roy11]. It is also possible to create undirected graphical models in which two variables that are connected with an edge influence each other but we do not specify a direction of influence. This representation is useful when modelling for example pixels in an image. We assume a pixel influences its neighbors but is also influenced by its neighbors itself. Undirected graphical models are also known as pairwise Markov graphs [Was04, p. 300] or Markov random fields [Mur12, p. 661]. The graph representation enables factorization of complex models into simple ones using the knowledge about combinatorial analysis and graph algorithms to reduce the complexity of a given problem. It simplifies the way of computing the distribution of the model [Roy11, p. 17]. The graph structure is also crucial for developing efficient inference. The local relevance information of a graphical model helps to exclude parameters from the calculation that are not of interest for the current query [Pea88]. 11
A
B
C
D
E
Figure 4: Visualization of a directed graphical models independence structure with 5 variables {XA , XB , XC , XD , XE }. As an example for a directed graphical model we can take one with five random variables XA , XB , . . . XE : F = fXv (xv ; θ) : v ∈ {A, B, C, D, E}; θ ∈ Θ Figure 4 shows a visualization of the independence structure. The structure describes that fXA can be computed without knowledge of all other distributions, fXB and fXC each depend on the value of fXA , fXD depends on the values of fXB and fXC and fXE needs fXC available to be computed. I now give the following three definitions of sets of vertexes of the graph that will later be used in chapter 4.1.5 to explain the inference algorithm “variable elimination” that can be used to do inference on graphical models: • pa(A) are the parent nodes of A i.e. the variables that influence XA . • ch(A) are the child nodes of A i.e. those who are influenced by XA . • spo(A) are the parents of the children of A that are not children of A. The set also does not include A itself. Looking at the graph in figure 4 spo(C) is {B}. If B would depend on C, spo(C) would be empty. Having a graph representation alone is not enough as this could still be an infinitely large set of vertexes and edges hence the representation of the graph model is important for handling complex graph structures. A simple list of all vertexes and edges or a drawing of nodes connected with lines is obviously not suitable in this case. Introduction of different notations can enable modelling of infinite graphs and other infinite structures. There are several notations that allow expression of infinite structures in a finite way for example the Plate Notation [Bun94], which allows replication of sub-graphs for building large graph 12
structures. Another example are Object-Oriented Bayesian Networks as described by Friedman, Koller and Pfeffer [KP97,FKP98] who introduce a notation based on the object orientation pattern that allows describing Bayesian Networks as a relation between classes. In this representation similar structures can shared common definitions by using inheritance. Also Markov Logic Networks can be used to describe graphical models using a finite set of first-order logic constraints that describe the relation between graph nodes in a general way [Mur12, p. 674].
3.3
Stochastic processes
An other possible representation of uncertain information can be a stochastic process. A stochastic process {Xt : t ∈ T } is a collection of random variables where T is called the index space and Xt are random variables. The index space can either be discrete or continuous [Was04, p. 473]. A simple example for a stochastic process is an independent identically distributed (i.i.d.) set of random variables known as a white noise process [Cha80]. More interesting stochastic processes however are processes that have serial dependence also known as auto correlation, which means that the value or the distribution of Xt depends on some values of {Xt−k : k > 0}. Stochastic processes allow modelling of changing behavior for example time series [Cha80, p. 33] and other serial dependent behavior [Was04]. They can also be used for representing recursion in probabilistic programs [Roy11, p. 20]. Furthermore a great advantage of Stochastic processes is that they allow the expression of an infinite number of random variables in a compact notation [Was04].
3.4
Probabilistic Programs
While graphical models and stochastic processes are representing a static dependency structure between the involved random variables they can not be applied to a wide variety of dynamic processes that need to be modelled in the domain of for example Machine Learning. Applications in such fields require a probabilistic model to support recursion and dynamically added dependencies [Roy11, p. 20]. Probabilistic Programs provide such a representation by allowing to write arbitrary models for a given problem using the formalism of probability theory and the power of programming languages to express complex things in a well structured way. Most of them are based on deterministic programming languages that provide concepts like loops, decision branches (if ... then, else) and recursion and thus allow defining models in a dynamic way [Goo13]. 13
This also allows the definition of infinite-dimensional, recursively-defined stochastic processes [Roy11]. With a Probabilistic Program it is possible to represent arbitrary models using the runtime call graph as the conditional independence structure. From the runtime call graph which basically shows which function or operation has been called after another it is possible to derive a fine grained conditional independence structure that can be used to implement various inference algorithms for Probabilistic Programming [Roy11, p.20]. Based on deterministic programming languages, Probabilistic Programming Languages benefit from the concepts and paradigms, which are used for building complex software today, for building complex probabilistic models [Roy11]. A sample probabilistic program is shown in listing 1. The call graph of the program will show different dependency structure for the variable observation when the value of tau changes. This is what makes probabilistic programs more expressive than graphical models [DP13].
14
import pymc as pm lambda1 = pm. E x p o n e n t i a l ( ” lambda1 ” , 10 ) lambda2 = pm. E x p o n e n t i a l ( ” lambda2 ” , 20 ) tau = pm. D i s c r e t e U n i f o r m ( ” tau ” , l o w e r =0, upper =1) @pm. d e t e r m i n i s t i c def lambda( tau=tau , lambda1=lambda1 , lambda2=lambda2 ) : i f tau > 1 : return lambda1 else return lambda2 o b s e r v a t i o n = pm. P o i s s o n ( ” obs ” , lambda) model = pm. Model ( [ o b s e r v a t i o n , lambda1 , lambda2 , tau ] ) Code listing 1: An example Probabilistic Program written in Python using the PyMC Probabilistic Programming library. First it defines three random variables lambda1, lambda2 and tau with different distributions. Then another variable is defined using the function lambda which is a deterministic combination of these three variables. The distribution of observation is different dependent on the value of tau. This example is inspired by the motivation example in chapter 1 of [DP13]. PyMC: https://github.com/pymc-devs/pymc PyMC documentation: http://pymc-devs.github.io/pymc/
15
4
Inference Algorithms
There are many different ways for doing statistical inference and thus there is a huge amount of algorithms in this field that also use different techniques to achieve their goal. These algorithms can be divided into two parts namely algorithms for parametric models i.e. parametric inference and non-parametric inference. For this chapter I am going to focus on inference algorithms for parameter estimation. There are also methods for comparing models and checking the quality of a model [Was04, p. 108] but I will not cover them here as they do not apply to what is used in probabilist programming. In general parametric inference can be divided into three categories: exact inference, approximate inference and variational inference.
4.1
Exact inference
As exact inference we describe algorithms that use exact calculation to figure out parameters of a model. These algorithms are based on deterministic functions and will generate the same result for each calculation when the same data and model are used. 4.1.1
Maximum likelihood estimation
The basic idea of the maximum likelihood method is to find the parameters that are the most likely ones for a model given observed data. This means the method finds the parameters that maximize the likelihood function for an i.i.d. set of random variables X1 , . . . , Xn with distribution f (x; θ) that is given by [Was04, p. 149]: Ln (θ) =
n Y
f (Xi ; θ)
i=1
The maximum likelihood estimator (MLE) denoted by θˆ is then calculated as follows (cp. [Was04, Mur12]): θˆ , arg max log L(θ) θ
Calculating the maximum of a function can be done by derivation. With θˆ we now have the estimated parameter for the distribution f . In other words the likelihood that the model generated the given data is maximised for the parameters estimated by the maximum likelihood method. 16
An important property of MLE regarding computation is that we can work with log Ln (θ) instead of Ln (θ) which is often easier to compute. Also multiplying Ln (θ) by any positive constant c it will not change the MLE [Mur12]. It can also be shown that the MLE is consistent, equivariant, asymptotically normal and asymptotically optimally (efficient) for simple models which make it the estimator of choice in these cases [Was04, pp. 153ff.]. While the MLE gives an estimation of the parameters θˆ we can use a method called “The Delta Method” to get the estimated distribution. Given τ = g(θ) where g is a smooth function the Delta Method allows to calculate ˆ See [Was04, p. 160] for more details. the distribution of τˆ = g(θ). 4.1.2
Linear regression
A famous method that uses the maximum likelihood method is the linear regression with the ordinary least squares method that we have already seen in the example in chapter 2.3. The model used for the simple linear regression is of the following form: Yi = β0 + β1 Xi + i The expectation E(i |Xi ) = 0 and the variance is V ar(i |Xi ) = σ 2 [Was04, p. 245]. This model can be extended for multiple variables and also for a non linear case. Extension with non linear cases is done by transforming the data of the model for example running the linear regression using the substitution Zi = Xi2 . The most common transformations are polynomial and logarithmic [Was04, pp. 269ff.]. This method minimizes the sum of squares of the error terms i calculating the parameters of the model that minimize the prediction error and thus are the most likely ones [Was04, pp. 245ff.]. For the linear model we assume a Gaussian distribution for i however this would lead to bad fit of the model when observing outliners. The model can be improved in this case by replacing the Gaussian distribution by another one that assigns more probability to outliners. This approach is called robust linear regression [Mur12, p. 223].
17
4.1.3
MAP
The maximum a posteriori algorithm (MAP) is the Bayesian variant of the maximum likelihood estimator [Mur12]. The idea of this method is to calculate the maximum of p(θ|D), which is the distribution of θ given all we know about the model i.e. the data D: θˆ , arg max p(θ|D) θ
This can be calculated computing the point estimate of p(θ|D) by computing the posterior mean, median or mode [Mur12, p. 149]. MAP is a good basis for understanding the idea of other algorithms such as the expectation maximization algorithm that is also based on the Bayesian concept of posterior and prior probability of the parameters. For an application of the MAP inference method applied to undirected graphical models you may refer to [KS08]. 4.1.4
The expectation maximization algorithm
When having observed data, the most appropriate methods for doing exact inference would be MLE or the MAP algorithm. However these algorithms have problems when dealing with incomplete data and more complex models which is often the case. It is possible to work with modified versions of ML and MAP in these cases, but if we also have other constraints to apply to our models parameters such that for example sums of weights parameters have to be 1, we need a more flexible algorithm that can handle this situation. The expectation maximization algorithm (EM) introduces a solution to this problem [Mur12, p. 348]. The basic idea is to maximize the log likelihood of θ given the data similar to MAP or ML but allows the data to be incomplete. EM uses a formula for this that includes observed data xi but also hidden values zi that are not observable. This function is called the “complete data log likelihood”: lc (θ) =
N X
log p(xi , zi |θ)
i=1
This formula can not be computed because we do not have the values of zi . Instead we will compute the expected complete data log likelihood to approximate it [Mur12, p.350].
18
The expected complete data log likelihood is given as: Q(θ, θt−1 ) = E lc (θ) D, θt−1 t is the current iteration of the algorithm and D is the observed data until point t. Q is called the auxiliary function. The EM algorithm is an iterative algorithm that works in two steps. The E-step is to compute Q(θ, θt−1 ). The M-step is then optimizing the parameters [Mur12, p. 350]. The E-step In the E-step Q(θ, θt−1 ) is calculated from parameters with respect to old parameters θt−1 and newly observed data D. The M-step In the M-step the Q-function is optimized with respect to θ: θt , arg max Q(θ, θt−1 ) θ
The above formula describes the traditional M-step of the EM algorithm. For computing the MAP estimation we can adjust the M-step to optimize θ in a slightly different way: θt , arg max Q(θ, θt−1 ) + log p(θ) θ
The EM algorithm monotonically increases the log likelihood of the observed data. After a new step the likelihood will never be lower than before. The algorithm converges [Mur12, p. 350]. 4.1.5
Variable elimination
The variable elimination algorithm can be used to do exact inference on Bayesian networks (graphical models) [Coz00]. This algorithm in a more general form is also known as bucket elimination [Dec98]. Its implementation focus lies on simplicity and it aims to be understandable without much knowledge about graph manipulation algorithms [Dec98]. The goal is to calculate the posterior marginal distribution of a set of query variables Xq given observations E. E is a set of variables and their realizations to a specific value. The algorithm now calculates p(Xq |E). The query Xq given observations E may only involve a subset of all variables X. If the density p(Xi | pa(Xi )) is necessary for answering the query then we call Xi a requisite variable. The set of requisite variables is denoted as XR [Coz00]. 19
We can now write down the formula needed for calculating the posterior distribution [Coz00]: ! X Y p(Xq |E) = p Xi |pa(Xi ) Xi ∈XR
XR \{Xq ,XE }
Let now N be the number of variables that are not observed and are also not in Xq . We can write them down in an ordered set: {X1 , X2 , . . . , XN } [Coz00] shows that for each variable from this set the above density can be transformed into the following form (example transformation for X1 ): ! Y X p ch(X1 )|pa(X1 ), spo(X1 ) = p Xj |pa(Xj ) X1
Xi ,ch(X1 )
These densities for all variables in the ordered set are put into a pool that is then used for running the algorithm. The first step is the elimination step: We take all densities from the pool that contain X1 and calculate a new density p(ch(X1 )|pa(X1 ), spo(X1 )) which can be unnormalized and add this density to the pool. X1 is now “eliminated” from the pool. We repeat this step for each variable in the ordered set. When done with the elimination steps we obtained at least one density in the pool for Xq . If there is more than one we need to multiply these and normalize the result. This way we get p(Xq |E). The order of the variables in {X1 , X2 , . . . , XN } is not important for the algorithm to work but is relevant for computational efficiency [Coz00]. With variable elimination it is also possible to calculate the maximum a posteriori when replacing the summations with maximizations [Dec98].
20
4.2
Approximate Inference (Sampling algorithms)
Approximate inference or simulation methods are especially useful in Bayesian inference. These are algorithms that calculate the posterior distribution by drawing samples from the model. In contrast to exact inference the results will differ between multiple runs [Was04, p. 505]. 4.2.1
MCMC
MCMC or “Markov Chain Monte Carlo algorithms” is a family of different algorithms. All these algorithms have in common that they use the “Monte Carlo” principle in combination with Markov Chains to optimize the result by spending most of the time in relevant areas and avoiding the uninteresting parts. MCMC is under the top 10 of the most important algorithms of the 20th century [Mur12] as it is not only used in statistical research but also many other fields like mechanics, physics and computer graphics [ADFDJ03]. It is not limited to sampling probability distributions from statistical models but can also be used for integrating or maximizing complex functions that are not tractable with conventional algorithms [Was04]. Compared to variational inference MCMC is often slower for smaller problems so it is the algorithms of choice for more complex models and large amounts of data [Mur12]. Advantages of MCMC compared to other algorithms are: • Sampling with MCMC is easier to implement [Mur12, p. 837]. An example for this is the implementation in PyMC which fits into currently 186 lines of Python code1 . • It fits for more models e.g. models that vary in structure dependent on variable values [Mur12, p. 837]. This counts for most of the turing complete Probabilistic Programming Languages as they are described in chapter 3.4. • Can be faster with huge data sets or complex models [Mur12, p. 837]. • MCMC is a general purpose inference algorithm suitable for nearly every kind of model [DP13] • It is possible to compute several chains in parallel which is a good advantage for multiprocessor systems and can be faster to compute results [ADFDJ03, p. 17]. 1
PyMC implementation of MCMC: https://github.com/pymc-devs/pymc/blob/ master/pymc/step_methods/metropolis.py
21
All MCMC algorithms work by creating samples from a target distribution and approximate the distribution using these samples. With a very high number of iterations the sampled probability distribution converges to the target distribution. By not trying to deterministically look at a models structure it is very flexible regarding the models it can do inference on [DP13, Mur12]. 4.2.2
The Monte Carlo Principle
The main idea of all MCMC algorithms is to explore the space of the distribution and try to stay inside the most interesting parts i.e. the places with the most probability in a distribution and to avoid spending time in regions with no interest: “The advantage of Monte Carlo integration over deterministic integration arises from the fact that the former positions the integration grid (samples) in regions of high probability.” – [ADFDJ03] The idea of the Monte Carlo Principle is to draw an i.i.d. set of N samples {x(i) }N i=1 from a target density p(x), which is defined on a high-dimensional space X [ADFDJ03]. X in a probabilistic programming context can be the combination of all possible values that all defined random variables of a model can take. p(x) is the distribution over X. Given a specific x it will return the probability of this exact combination of parameters occurring. The target density p(x) can be approximated by summing up the probability located at all different positions in X. Mathematically this is denoted by δx(i) (x) which represents the so called delta-Dirac mass located at x(i) . Summing up all these point probabilities will give the approximated distribution N 1 X δx(i) (x). pN (x) = N i=1
It is also possible to obtain the maximum of p(x) that is the mode of the distribution. This point marks the most likely combination of input parameters given the output, which is the key question in statistical inference when inferring the models parameters (cp. chapter 2): xˆ , arg max p(x(i) ) x(i) ;i=1...N
22
Simply drawing samples using the Monte Carlo principle would be enough when p(x) is a Gaussian like distribution [ADFDJ03]. Different and more complex distributions however need more sophisticated techniques as they are being described in the following sections. There are several simple ways of Monte Carlo sampling such as “rejection sampling” and “importance sampling” but they are not really practical [ADFDJ03, p. 11] and also do not work well in high dimensional space [Mur12, p. 837] so I will not explain them here as they are not relevant for Probabilistic Programming. 4.2.3
Using Markov Chains
MCMC algorithms are Monte Carlo Algorithms that use the Markov Chain mechanism to generate samples x(i) from a state space X. Markov chains are a special type of stochastic processes that are used in MCMC algorithms to describes transitions in the state space X. These Markov Chains are constructed in a way that the algorithm spends more time in the most important regions. Construction of the Markov chain is the crucial part of creating an MCMC algorithm [ADFDJ03]. We will see how it works by looking at a sample x(i) ∈ X which can take s unique values X = {x1 , x2 , . . . , xs }. The stochastic process x(i) is called a Markov chain if p(x(i) |x(i−1) , . . . , x(1) ) = T (x(i) |x(i−1) ) where T is the so called transition matrix of the Markov Chain which denotes the probability for getting from x(i−1) to x(i) which remains invariant. You can see in the formula that the next step x(i) only depends on the current x(i−1) but is independent from all other steps before. This characteristic is called memorylessness which is a great advantage as the algorithm can do many steps without having to save the state of all the steps it has taken before [ADFDJ03, p. 13]. For an example lets take a Markov Chain with three states (s = 3) (see fig. 5) and the transition matrix 0 0 1 T = 0 0.1 0.9 . 0.6 0.4 0 The first row indicates the probability to get from state s1 to s1 , s2 and s3 respectively. The interpretation for all other rows is the same. We will later see how these matrices are created. 23
0.1 s2 1
s1
0.9 0.4
s3
0.6 Figure 5: Markov Chain with 3 states Lets take the probability vector for the initial state by assigning random probabilities for every state: µ(x(1) ) = (0.5, 0.2, 0.3) Regardless of the starting point an MCMC chain will converge to the target distribution for large number of samples [ADFDJ03]. In our example it will converge to p(x) = (0.2, 0.4, 0.4). Running each step we are multiplying this vector with the transition matrix T : µ(x(1) )T 1 = (0.18, 0.14, 0.68) µ(x(1) )T 2 = (0.40, 0.29, 0.31) µ(x(1) )T 3 = (0.19, 0.15, 0.66) ... µ(x(1) )T N = (0.20, 0.40, 0.40) The convergation of the chain depends on two characteristics the transition matrix i.e. the Markov chain has to fulfill [ADFDJ03]: 1. Irreducibility. For any state there must be positive probability of visiting all other states. In other words, the transition graph has to be connected. 2. Aperiodicity: The chain should not get trapped in cycles.
24
1. initialize x(0) 2. For i = 0 to N − 1 Sample u ∼ U[0,1] Sample x0 ∼ q(x0 |x(i) ) n o p(x0 )q(x(i) |x0 ) (i) 0 If u < A(x , x ) = min 1, p(x(i) )q(x0 |x(i) ) x(i+1) = x0 else x(i+1) = x(i) Code listing 2: MH algorithm pseudo code U is the uniform distribution. 4.2.4
The Metropolis Hastings Algorithm
The Metropolis Hastings Algorithm (MH) is the most popular MCMC algorithm. It is the generalization of most popular MCMC implementations such as Gibbs sampling [ADFDJ03]. The basic idea of MH is that moving from a state x to x0 depends on some probability q(x0 |x) which is called the proposal distribution (kernel). It is called proposal distribution because it describes the probability of accepting the proposed move from x to x0 . The most frequently used q is a Gaussian distribution centered on x: q(x0 |x) = N (x, Σ) which means that the most probability lies in staying in the current states neighbourhood but it is still possible to jump elsewhere [Mur12]. In every step the decision whether to accept the proposal is made based on an acceptance probability A which is calculated from q and p. The acceptance probability A is calculated as follows: n p(x0 ) o A = min 1, p(x) Listing 2 shows a pseudo code of how MH works. It uses a slightly different formula for calculating A which will be explained shortly. The idea of using the proposal distribution is that if we see that x0 is more probable than x we definitively go there but if not it is still possible to go the step so we are not greedy for more probability, it is still possible to visit not so probable areas so the whole space will be covered regardless of the initial state [Mur12]. If the proposal distribution is asymmetric (q(x0 |x) 6= q(x|x0 ))
25
we need the Hastings correction to calculate A which is given by p(x0 )q(x(i) |x0 ) A = min 1, . p(x(i) )q(x0 |x(i) ) This ensures that the proposal distribution does not favor specific regions in the sample space so we ensure we get the real target distribution. The MH algorithm itself is quite simple. The main difficulty using it and getting good results lies in the choice of q(x) which we will discuss in the following. However if you are only interested in the mode of the distribution i.e. the most probable point in the state space choosing q(x) is not that hard. 4.2.5
Choosing the proposal distribution
The most important part of all MH algorithms is the choice of the proposal distribution q(x). It has to be chosen so that all states that have non zero probability in the target distribution have non-zero probability in the proposal distribution. The Gaussian random walk proposal which is to choose q(x0 |x) = N (x, σ 2 ) has non-zero probability in the whole space and is therefore a valid proposal distribution. This algorithm is also called the random walk Metropolis algorithm [Mur12, p. 848]. Another instance of MCMC is the independence sampler which is to choose q(x0 |x) = q(x0 ) so the new state is independent from the old state [ADFDJ03, p. 17], [Mur12, p. 848]. In practice it is important to choose the right variance so that the probability is spread to the right places. This is important because with small variance the probability of exploring the whole space is quite small (figure 6a). It should also not be too wide because the acceptance probability will become lower for wider variance which results in a big number of moves being rejected. The chain is called to be “sticky” in this case (figure 6b). With the right variance we get a very good approximation of the target distribution (figure 6c) [Mur12]. However this is not a problem if we are only interested in the mode of the target distribution p. We do not need to care too much when choosing q(x) in this case. We can use a technique called “simulated annealing” which is to use an inhomogeneous Markov Chain whose invariant distribution at iteration i is not p(x) but pi (x)αp1/Ti (x). This algorithm will spend more time in the mode region and will not waste computation time in regions of non interest [ADFDJ03, p. 17]. A visualization is shown in figure 7.
26
Figure 6: MH algorithm with different variance in the proposal distribution. Figure from [ADFDJ03, p. 18]
Figure 7: Discovering the mode of the target distribution using simulated annealing. Figure from [ADFDJ03, p. 19] 27
4.2.6
The “Burn in” period
When running the MCMC algorithm it is useful to add a so called “burn in” period that is used to ensure the chain gets warm. We can start sampling MCMC with any prior probability we like, however the first samples will not give a good representation of the posterior distribution so we just throw them away. All further samples after the burn in period are used for constructing the posterior distribution [Mur12, p. 856], [DP13]. 4.2.7
The Gibbs sampler
The Gibbs sampler is also an MCMC algorithm which is used for inference on graphical models [ADFDJ03, p. 21]. In Gibbs sampling we generate samples of variables by sampling each component of a model. Therefore dependency of variables has to be acyclic and we have to know the internals of the model [Mur12, p. 838]. Gibbs sampling can only be used on restricted forms of models like graphical models which stands in contrast to general MCMC, which can be used for all probabilistic programs. Gibbs sampling is a distributed algorithm but it is not parallel because samples have to be generated sequentially [Mur12]. This means that it is possible to run several parts of the execution on different machines or CPU cores but the results have to be combined together in a central place. The Gibbs sampler needs a burn in period to give proper results Gibbs needs a burn in period. There are several exertions to the Gibbs sampler such as collapsed Gibbs sampling which analytically integrates out unknown parameters so that they will not be part of the sampling [Mur12, p.841].
28
4.3
Variational inference
The variational inference is a combination of the benefits we have from exact inference and approximate inference. It achieves that we can get the speed benefits of MAP, and at the same time captures the statistical benefits of the Bayesian approach. It can be used when a given distribution is not computable to approximate it. The general idea of a variational algorithm is divided into the following steps: • guess an approximation q(x) to the target distribution from some tractable family. • make this approximation as close as possible to the true posterior p∗ , p(x|D) This allows us to reduce the inference to an optimization problem. The variational inference has the potential to rise the speed of the inference at the expense of accuracy. This is done by relaxing the constraints and/or approximating the objective [Mur12, p. 731].
4.4
Model validation
When creating statistical models and doing inference it is always important to check the results. There are many ways that allow validating an inferred model such as cross validation [Mur12, p. 22] which is a technique that uses parts of the observed data that is not used for inference to check whether the model describes it best or not. An important thing to be aware of is that a model must not match the data in a perfect way, it must also be able to predict unseen observations and new data. Making a model fit the data perfectly will result on nearly all new data and unseen values to not match it. This problem is called over fitting, which means that the model fits the data too good [Mur12, p. 22]. As far as I could figure out this is not covered by Probabilistic Programs as of today.
29
5
Computational limits
As it is the case in traditional programming there are also limits of what a program can compute and what kind of problems are expressible in a probabilistic programming language. The more obvious limits of computability are set by the hardware available to run the software on. These are quite easy to describe and understand. The theoretical limits i.e. saying what was possible if we had unlimited hardware is much harder though. There are problems in the real world that are not expressible as programs or probabilistic programs. In the following I will describe how we get to describe and find these limits. For traditional programming these questions are already well covered in the study of computability theory over the last century covering the computability of real number, continuous functions and higher types [Wei00]. The basic motivation behind computability theory is to find out the limits of what is possible with the currently available knowledge and tools to see on which points new techniques and ideas have to be invented to be able to work with the problems. Alan Turing defines a number very pragmatically as computable “if its decimal can be written down by a machine” [Tur36]. This definition can be expanded to more than just numbers but tuples, sequences and even more complex structures [Roy11] We can use computability theory as a basis for creating similar definitions in the field of probabilistic programming i.e. defining computable probability distributions. In particular this means to investigate the class of probability distributions that can be represented by algorithms [Roy11, p. 25]. This is equal to discovering the limits of probabilistic programming as it is about representing probabilistic models as programs. Models i.e. algorithms written in a probabilistic programming language represent a probability distribution and computing the conditional probability distribution out of two or more distributions is the main task done when doing inference on probabilistic program.
30
5.1
Computability in general
I will now give a few definitions from computability theory that are needed to describe the limits of computability with respect to probabilistic programming and probability distributions. Lets start with the following definition of computability in general which is based on Turing machines (cp. [Tur36]): “We say that a partial function f : N → N is partial computable when there is some Turing machine that, on input n ∈ dom(f ), eventually outputs f (n) on its output tape and halts, and on input n ∈ / dom(f ), never halts. We say that a function f : N → N is total computable or simply computable when it is a partial computable function that is total, i.e., dom(f ) = N.” – [Roy11, p.26] Derived from this definition, computability can be defined for other structures such as sequences, tuples and even more complex structures: Computably enumerable set A set is said to be computably enumerable when it is the domain of a partial computable function. A set is said to be co-computable enumerable when its complement is computable enumerable [Roy11, p. 27]. computable metric space In this work I only cover the intuitive idea of computability so I’ll leave the definition of a computable metric space open and refer to [Roy11, Def. II.3] for a detailed definition and explanation. For giving the definition here it would be necessary to introduce all the work that [Roy11] is based on. You can think of this to be similar to the computably enumerable set, which is not really true but should be enough to understand the idea of what is being said in the following. computable probability space A computable probability space is a pair (S, t) where S is a computable metric space and p is computable probability measure on S [Roy11, p. 35]. almost decidable set Let S be a computable metric space and let µ ∈ M1 (S) be a probability distribution on S. A (Borel) measurable subset A ⊂ S is said to be µ-almost decidable when there are two computably enumerable open sets U and V such that ...U C A and V C AC and p(U) + P(V) = 1 [Roy11, p. 35]. 31
5.1.1
Computable random variables
As we have seen before a random variable will generate random output that is bound to a distribution. This distribution defines the probability for each output value to come out. Based on this we can say that a random variable can be seen as a mapping from some source of randomness to an output inducing a distribution on the output space. This means for generating a value from a random variable it takes a random source and uses this to draw a value according to its distribution. For the definition of a computable random variable we can use a sequence of independent fair coin flips as the source of randomness. This is equivalent to more sophisticated sources as shown by [Roy11, p. 32] and [Str05, p. 157ff.]. Following from this it should be possible to generate arbitrary distributions for random variables using a number of fair coin flips as input. The basic probability space used in the following is ({0, 1}N , P). We can then define a computable random variable as follows: “Let S be a computable metric space. Then a random variable X in S is a computable random variable when X is computable on some P -measure one subset of {0, 1}N .” – [Roy11, Def. 11.12] A different view on this i.e. an important property of a computable random variable is that for every finite portion of output it has consumed only a finite number of input bits of the randomness space [Roy11, pp. 3233]. This means that assuming a turing machine computes the values of the random variable it will read from the randomness space writing the output value and will stop when having read all information. 5.1.2
Transformations of random variables
An important operation in a probabilistic program is the transformation of random variables i.e. their probability distributions. Given a random variable X, its PDF fX and CDF FX such a transformation can be generalized as applying an arbitrary function to it. The result is a new random variable Y = r(X) which has its own distribution that depends on the distribution of X. For an intuitive example we may say that Y = X 2 or Y = eX . For discrete random variables the calculation is given as [Was04, p. 60]: fY (y) = P (Y = y) = P (r(X) = y) = P ({x : r(x) = y}) = P (X ∈ r−1 (y)) As we can see from the above formula, we need the reverse of the function r to be applied to y. So for computing the derived distribution we need to find 32
the reverse of r. This is not possible in general but could be implemented on a known set of functions that may be used for creating a model. For the continuous random variables the calculation is a bit harder. According to [Was04, p.61], finding fY takes three steps: 1. For each y, find the set Ay = {x : r(x) ≤ y}. 2. Find the CDF as FY (y) = P (Y ≤ y) = P (r(X) ≤ y) Z = P ({x : r(x) = y}) = fX (x)dx Ay
3. we get the PDF by derivation: fY (y) = FY0 (y). These computations can be generalized to multiple variables as shown by [Was04, p. 63ff.] so computation can be applied to models containing an arbitrary amount of random variables. It should be noted that the calculation shown in this section is not what has to be done when sampling from a model as we are not interested in calculating the complete distribution in that case but only in a single realization. We could simply apply r to the value of X to get the value of Y . The problems in calculation given above only apply in cases where we try to do exact inference for the models distribution.
5.2
Limits for calculating conditional distributions
In probabilistic programming doing inference on a program given observed values of some random variables is the same as calculating the conditional distribution of the unobserved variable values U given the observed values O denoted as P (U |O) [Roy11]. Recall that the conditional distribution of event B given another event A captures the likelihood of B occurring given that event A has already occurred. A and B in this case are not limited to events but can also be random variables where P (B|A) represents the distribution of B given A. The formula for calculating the conditional distribution, given a measurable space S, µ ∈ M1 (S) as a probability measure on S and A, B ⊆ S measurable sets, is [Roy11, p. 48]: µ(B|A) =
µ(B ∩ A) µ(A) 33
,
µ(A) > 0
Given the definition above, µ( · |A) is a probability measure. However it is only well-defined when µ(A) > 0 so we can not use this definition when working with continuous random variables as in this case all events have probability zero [Roy11, p. 38]. This does not match the case of most probabilistic programs which need to calculate distribution on continuous and higher order objects. To work around this problem for showing the limits of Probabilistic Programming a complex definition has to be invented which is shown by [Roy11, pp. 40ff.], which then allows to draw conclusions as they are shown in the next sections. A general idea of when a conditional distribution is considered to be computable is given by: “If there is an algorithm for a joint distribution on a pair (X, Y ) of random variables, is there an algorithm for the conditional distribution P [X|Y = y]?” – [Roy11] It can be shown that there is no general solution for calculating the conditional probability distribution. [Roy11] however gives some circumstances under which conditioning is a computable operation. 5.2.1
Discrete random variables
The Computability of conditional distributions in the discrete case is naturally not as complex as calculating the conditional distribution of the continuous case as the space of calculation is countable. It is not surprising that there is no restriction of computability of conditional distribution. It is easy to show computability for Discontinuous distributions such as Bernoulli, geometric and uniform [Roy11, pp. 52,53]. When µ is computable probability measure and A is an almost decidable set the conditional probability given A is computable [Roy11, Lemma III.13]. 5.2.2
Continuous random variables
The conditional density function for the continuous case is the following: pY |X (y|x) = R
pX|Y (x|y) pX|Y (x|y)PY (dy)
As of [Roy11] the conditional distribution of continuous random variables is not computable in general. There is no general way of computing the conditional distribution. This is shown by [Roy11, p. 58] drawing an analogy to the halting problem. 34
Let H = {n ∈ N : h(n) < ∞}, i.e., the indexes of the Turing machines that halt (on input 0). A classic result in computability theory shows that the halting set H is not computable [Tur36]. The conclusion from this (cp. [Roy11, Theorem III.34.]) is that calculating the conditional probability of real-valued random variables, even when there exists a PX -almost continuous version of the conditional distribution, is not computable. This theorem can be strengthened to the following (cp. [Roy11, Theorem III.40.]): Calculating the conditional probability of real-valued random variables, even when there exists an everywhere continuous version of the conditional distribution, is not computable. It is also shown that conditioning is only at least as hard as the halting problem but not harder [Roy11, p. 63]. Conditioning is commutable in the setting of [Roy11, Lemma 111.16] which is the following: Let X and Y be computable random variables in computable metric spaces S and T , respectively. Assume that PX is concentrated on a computably discrete set D. Then the conditional distribution P [Y |X] is computable, uniformly in X, Y and D.
5.3
Making non-computable distributions computable
Confronted with a problem that is not computable we have to search for heuristics or other approximations to handle this problem anyway. [Roy11, p. 50] shows that any conditioning of a not noisy variable can be calculated when it was corrupted by independent noise. He draws an interesting analogy to information theory, that is that adding noise to some data makes it computable even though it was not computable before. This still has limits because even though we corrupt the density with a Gaussian noise with given σ 2 and zero mean we have to choose a value of σ 2 that allows the original distribution to recover, which means that the added noise must not disturb the original distribution in a way so that it does not behave like before. The value for σ 2 can not be found in the general case. Further conditions are described by [Roy11, p. 50ff].
35
6
Summary
For now we have seen most of the underlying concepts of Probabilistic Programming and limits of computability of probability distributions. There are already a great amount of possibilities around for implementing Probabilistic Programming Languages that target the goal of making inference an easy job (cp. [Cro13]) for most of domain experts that have to deal with uncertain problems and need to make reason of data. The current state of the art of Probabilistic Programming Systems such as Church and PyMC have already realized a great amount of the concepts described in this work [Goo13]. However the goal of fully automatic system as described by [DAR13] is not implemented yet. Most of the systems use only one algorithm when doing automatic inference and let the user choose when allowing manual inference. What could be invented in this case is an algorithm that is able to choose the right inference method for a given problem using the knowledge about inference algorithms described in this work. However it is questionable whether this movement goes into the right direction because if the aim of probabilistic programming is to allow doing it with less knowledge a basic understanding of the underlying algorithm is needed to allow interpretation of the result and, even more important, the user must be able to judge the quality of the inference result. For computability we have seen that working with discrete variables is easy as they are already completely representable by computers. When working with complex models and continuous variables more sophisticated algorithms must be used to handle these problems. The halting set is not computable and though are certain conditional probability distributions. This means there are probability distributions that are not expressible in a probabilistic programming language. This all however is of theoretical nature so we can not say whether we’ll hit these limits and where we will hit them. We have seen that it is possible to make things computable if need up to a limited point. The vast majority of inference problems is computable though.
36
7
References
[ADFDJ03] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1-2):5–43, 2003. [BEPW96] Klaus Backhaus, Bernd Erichson, Wulff Plinke, and Rolf Weiber. Multivariate Analysemethoden: Eine anwendungsorientierte Einf¨ uhrung (Auflage 8). Springer-Lehrbuch, 1996. [Bun94]
Wray L. Buntine. Operations for learning with graphical models. arXiv preprint cs/9412102, 1994.
[Cha80]
C Chatfield. The analysis of time series: an introduction. 1980.
[Coz00]
Fabio Gagliardi Cozman. Generalizing variable elimination in bayesian networks. In Workshop on Prob. Reasoning in Bayesian Networks at SBIA/Iberamia, pages 21–26, 2000.
[Cro13]
B. Cronin. Why Probabilistic Programming Matters., 2013. Posting on Google Plus retrieved on 4 Aug. 2013 https://plus.google.com/u/0/107971134877020469960/ posts/KpeRdJKR6Z1.
[DAR13]
Probabilistic Programming for Advancing Machine Learning (PPAML). Broad Agency Announcement. DARPA-BAA13-31, 2013. https://www.fbo.gov/?s=opportunity&mode= form&id=4a3a33415756f596fae229defed7deaa&tab=core&_ cview=0.
[Dec98]
Rina Dechter. Bucket elimination: A unifying framework for probabilistic inference. In Learning in graphical models, pages 75–104. Springer, 1998.
[DP13]
C. Davidson-Pilon. Probabilistic Programming and Bayesian Methods for Hackers, 2013. Unpublished. https://github.com/CamDavidsonPilon/ Probabilistic-Programming-and-Bayesian-Methods-for-Hackers.
[FKP98]
Nir Friedman, Daphne Koller, and Avi Pfeffer. Structured representation of complex stochastic systems. In AAAI/IAAI, pages 157–164, 1998.
[Goo13]
Noah D Goodman. The principles and practice of probabilistic programming. In Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 399–402. ACM, 2013.
[Gor13]
AD. Gordon. An Agenda for Probabilistic Programming: Usable, Portable, and Ubiquitous, 2013. http://research.microsoft.com/en-us/projects/fun/ agendaforprobabilisticprogramming.pdf.
[KP97]
Daphne Koller and Avi Pfeffer. Object-oriented bayesian networks. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pages 302–313. Morgan Kaufmann Publishers Inc., 1997.
[KS08]
J¨org Kappes and Christoph Schn¨orr. Map-inference for highlyconnected graphs with dc-programming. In Pattern Recognition, pages 1–10. Springer, 2008.
[Mur12]
KP. Murphy. Machine learning: a probabilistic perspective, volume 98. CRC Press, 2012.
[Pea88]
Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988.
[Rei03]
Edwin D Reilly. Milestones in computer science and information technology. Greenwood Publishing Group, 2003.
[Roy11]
Daniel M. Roy. Computability, inference and modeling in probabilistic programming. PhD thesis, Massachusetts Institute of Technology, 2011. MIT/EECS George M. Sprowls Doctoral Dissertation Award.
[Str05]
Daniel W Stroock. An introduction to Markov processes, volume 230. Springer, 2005.
[Tur36]
Alan Mathison Turing. On computable numbers, with an application to the entscheidungsproblem. J. of Math, 58:345–363, 1936.
[Wam13]
D. Wampler. Programming Trends to Watch: Logic and Probabilistic Programming, 2013. retrieved on 18 Aug. 2013 from http://thinkbiganalytics.com/ programming-trends-to-watch-logic-and-probabilistic-programming/. 38
[Was04]
L. Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer Publishing Company, Incorporated, 2004. http://research.rmutp.ac.th/research/ AConciseCourseinStatisticalInference.pdf.
[Wei00]
Klaus Weihrauch. Springer, 2000.
Computable analysis:
39
an introduction.
A A.1
Probability Distributions Binomial distribution
Binomial distribution describes an experiment with two single outcomes, which is repeated n times. The results of the single repetitions are independent. When the discrete random variable X has a binomial distribution with the probability θ we write: X ∼ Bin(n, θ) The probabilty mass function is defined as following: n x Bin(x|n, θ) = θ (1 − θ)n−x x n n! with , x (n − x)!x! µ = θ, σ 2 = nθ(1 − θ)
[Was04]
A.2
Gaussian (normal) distribution
When the continous random variable X has the Gaussian distribution with we write: X ∼ N (µ, σ 2 ) The probability density function is defined as following: 1 2 1 e− 2σ2 (x−µ) N (x|µ, σ 2 ) = √ 2 2πσ
For µ = 0 and σ = 1 we speak of a standard normal distribution. [Was04]
40
A.3
Properties of distributions
The most important properties of probability distributions are the mean µ and the variance σ 2 . A.3.1
Mean (µ)
or expected value is defined as follows: for discrete X : E[X] ,
X
xp(x)
x∈X Z xp(x)dx for continuous X : E[X] , X
[Mur12, p.33] A.3.2
Variance (σ 2 )
is the measure of the “spread” of a distribution: for discrete X : var[X] , E[(X − µ)2 ] Z for continuous X : var[X] , (x − µ)2 p(x)dx [Mur12, p.34]
41