A Course in Bayesian Graphical Modeling for Cognitive Science

4 downloads 79439 Views 4MB Size Report
Mar 28, 2008 - A Course in Bayesian Graphical Modeling for Cognitive Science ..... We will start with the prior assumption for the rate θ that all possible rates ...
A Course in Bayesian Graphical Modeling for Cognitive Science Michael D. Lee University of California, Irvine [email protected]

Eric-Jan Wagenmakers University of Amsterdam [email protected]

March 28, 2008

κ

ξ

γ

α

β

πa

ψ

πb

πc

D

1

πd

CONTENTS

Contents

1

1 Preliminaries 1.1 Probability Distributions and Sampling . . . . . . . . . . . . . . . . . . . . . . . . 1.2 WinBUGS and Matbugs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 5

2 Some Examples With Binomials 2.1 Inferring a Rate . . . . . . . . . . . 2.2 The Difference Between Two Rates 2.3 Inferring a Common Rate . . . . . 2.4 Prior and Posterior Prediction . .

. . . .

7 7 9 10 11

3 Some Examples With Gaussians 3.1 Inferring Means and Standard Deviations . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Seven Scientists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Repeated Measurement of IQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17 17 18 19

4 Basic Data Analysis 4.1 Pearson Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Kappa Coefficient of Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Change Detection in Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . .

23 23 25 28

5 Exams and Quizzes 5.1 Exam Scores . . . . . . . . . . . . . . . . 5.2 Exam Scores With Individual Differences 5.3 Twenty Questions . . . . . . . . . . . . . 5.4 The Two Country Quiz . . . . . . . . . .

. . . .

31 31 33 34 37

6 Memory Retention 6.1 No Individual Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Full Individual Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Structured Individual Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41 41 44 48

7 Signal Detection Theory 7.1 Standard Signal Detection Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Hierarchical Signal Detection Theory . . . . . . . . . . . . . . . . . . . . . . . . . .

53 53 57

8 Multidimensional Scaling 8.1 City-Block MDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Euclidean MDS With Individual Differences . . . . . . . . . . . . . . . . . . . . . .

63 63 66

9 Take The Best 9.1 TTB With Fixed Search Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Inferring the TTB Search Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69 69 72

. . . .

. . . .

. . . .

1

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2

Contents

10 Number Concepts 10.1 Knower Level Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Analog Representation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75 76 83

11 SIMPLE 11.1 Standard SIMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 A Hierarchical Extension of SIMPLE . . . . . . . . . . . . . . . . . . . . . . . . . .

87 87 92

References

97

chapter 1

PRELIMINARIES

1.1

Probability Distributions and Sampling

Statistics exists to handle uncertainty, and allow inferences to be made with incomplete and noisy information. In Bayesian statistics, every variable that we are interested in is represented by a probability distribution. These distributions capture everything we do and do not know about the relative likelihood of different values of those variables, given the information we have available at every stage in an analysis. In general, probability distributions can take any form. They just need to give some nonnegative likelihood to every value that the variable can take. If the variable is discrete, and you can count the number of values it can take, these likelihoods need to sum to one. If the variable is continuous, the total area under the likelihood curve needs to integrate to one (this means the relative likelihoods can exceed one, which seems to cause great consternation in some psychology circles). Check that you understand what probability distributions mean by interpreting the two shown in Figure 1.1. A.

0.25

B.

0.03

0.025

Probability Density

Probability

0.2

0.15

0.1

0.02

0.015

0.01

0.05 0.005

0

1

2

3

4

5

6

7

0 50

8

Y

60

70

80

90

100

110

120

130

140

150

160

Z

Figure 1.1: Two probability distributions for interpretation.

Although variables can, in principle, take any form, in practice there is a relatively small set of special forms that cover a wide range of useful modeling possibilities. These special forms correspond to established statistical families, like the Binomial and the Gaussian, that have natural interpretations and wide applicability. Each of these established distributions has one or more parameters, and a specific set of parameter values corresponds to a specific distribution within the family. Think of the statistical family like the Binomial as a machine, and the parameters as tuning knobs on that machine. In any one knob setting, a single distribution is produced by the machine. As the knobs are twirled, the whole range of representations made possible by the machine is swept 3

4

CHAPTER 1. PRELIMINARIES

out. Statistical inference essentially amounts to seeing if a machine can capture the right shape to fit data, and, if it can, seeing what values of the parameters are needed to make the match. As a concrete example, consider the Binomial family. This corresponds to a process in which a total of n binary trials are considered, each independently having probability of success θ. The outcome is the total number of successes, X. We write that “X is distributed as a Binomial process with success rate θ and number of trials n” as X ∼ Binomial(0.5, 20). Figure 1.2 shows four specific probability distributions all corresponding to a Binomial process. These distributions show the relative likelihood of X taking each of the possible values, under the θ and n parameter settings.

0.3

0.25

0.25

Probability

0.3

0.2

0.15

0.2

0.15

0.1

0.1

0.05

0.05

0

Probability

B.

0

1

2

3

4

5

6

7

8

9

0

10

1

2

3

4

5

X

C.

D.

0.3

0.3

0.25

0.25

0.2

0.15

0.05

0.05

X

7

8

9

10

0.15

0.1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

6

0.2

0.1

0

0

X

Probability

Probability

A.

0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

X

Figure 1.2: Four Binomial probability distributions, corresponding to (A) X ∼ Binomial(0.5, 10), (B) X ∼ Binomial(0.2, 10), (C) X ∼ Binomial(0.9, 20), (D) X ∼ Binomial(0.5, 20).

Sampling from the probability distribution involves selecting or “drawing” an X value according to these relative likelihoods. As more and more of these samples are drawn, and a list of them was built up, the histogram of that list would start to resemble the probability distributions from which they are drawn.

1.2. WINBUGS AND MATBUGS

1.2

5

WinBUGS and Matbugs

We are using the Matbugs function to call the WinBUGS software from within Matlab, and to return the results of the WinBUGS sampling to a Matlab variable for further analysis. The code fragment we are using to do this is below. % Use WinBUGS to Sample [samples, stats, structarray] = matbugs(datastruct, ... fullfile(pwd, ’Rate\_1.txt’), ... ’init’, init0, ... ’nChains’, nchains, ... ’view’, 1, ’nburnin’, nburnin, ’nsamples’, nsamples, ... ’thin’, 1, ’DICstatus’, 0, ’refreshrate’,100, ... ’monitorParams’, {’theta’}, ... ’Bugdir’, ’C:/Program Files/WinBUGS14’); Some of these options control software input and output: • datastruct contains the data (i.e., the observed variables) you are passing from Matlab variables to your graphical model • fullfile gives the name of the text file containing the WinBUGS scripting of your graphical model • view when set to 0 means WinBUGS automatically terminates at the end of sampling and returns to Matlab, and when set to 1 allows WinBUGS to be used for preliminary inspection before exiting WinBUGS and returning to Matlab • refreshrate gives the number of samples between the WinBUGS software refreshing its displays, which are possible to monitor in real-time on long sampling runs • monitorParams gives the list of variables that will be monitored and returned to the Matlab in the structures samples variable • Bugdir gives the location of the WinBUGS software The other options define the values for computational sampling parameters: • init gives the initial values for the variables you want to make inferences about (i.e., the unobserved variables); if you do not specify a starting value WinBUGS will try sampling from the prior, which may or may not lead to numerical crashes, so it is safer to give a starting value to everything (especially nodes that have no parents) • nChains gives the number of chains to include in the sampling run; multiple chains amount to multiple independent runs of the same model with the same data (although you can vary the starting point per chain), and so provide a key test of convergence • nburnin gives the number of ‘burn-in’ samples, which are consecutive samples drawn without being recorded at the beginning of a sampling run • nsamples gives the number of recorded samples from the posterior to be drawn

6

CHAPTER 1. PRELIMINARIES • thin gives the number of drawn samples between those that are recorded, so a value of 2 would mean only every second drawn sample would be included; this is important when successive samples are not independent, as shown by autocorrelations • DICstatus gives an option to calculate the Divergence Information Criterion statistic that the authors of WinBUGS like (because they invented it), and is intended to do model selection, but is open to challenge

chapter 2

INFERENCES INVOLVING BINOMIAL DISTRIBUTIONS

2.1

Inferring a Rate

A binary process is anything where there are only two possible outcomes. It might be that something either happens or does not happen, or that something either succeeds or fails, or takes one value rather than the other. An inference that often is important for these sorts of processes is the underlying rate at which the process takes one value rather than the other. Inferences about the rate can be made by observing how many times the process takes each value over a number of trials. Suppose that one of the values (e.g., the number of successes) happens k out of n trials. These are known, or observed, data. The unknown variable of interest is the rate θ at which the values are produced. Assuming the trials are statistically independent (i.e., that what happened on one trial does not influence the others), the number of successes k follows a Binomial distribution, k ∼ Binomial(θ, n). This relationship means that by observing the k successes out of n trials, it is possible to update our knowledge about the rate θ. The basic idea of Bayesian analysis is that what we know, and do not know, about the variables of interest is always represented by probability distributions. Data like the k and n allow us to update prior distributions for the unknown variables into posterior distributions that incorporate the new information. We will start with the prior assumption for the rate θ that all possible rates between 0 and 1 are equally likely. This is the uniform prior θ ∼ Uniform(0, 1).

θ

θ ∼ Uniform(0, 1)

k

k ∼ Binomial(θ, n)

n Figure 2.1: Graphical model for inferring the rate of a binary process.

The graphical model representation for this problem is shown in Figure 2.1. The nodes represent variables of interest, and the graph structure is used to indicate dependencies between the variables, with children depending on their parents. We use the conventions of representing unobserved 7

8

CHAPTER 2. SOME EXAMPLES WITH BINOMIALS

variables without shading and observed variables with shading, and continuous variables with circular nodes and discrete variables with square nodes. Thus, the observed discrete counts of the numbers of successes k and trials n are represented by shaded and square nodes, and the unknown continuous rate θ is represented by an unshaded and circular node. Because the number of successes k depends on the number of trials n and the rate of success θ, the nodes representing n and θ are directed towards the node representing k. The advantage of using the language of graphical models is that it gives a complete and interpretable representation of a Bayesian probabilistic model. Also, using the WinBUGS software, it is easy to implement a graphical model, and the various computational algorithms built in to the software are then able to do all of the inference automatically. The following code implements the graphical model in WinBUGS. # Inferring A Rate model { # Prior on Rate theta ~ dunif(0,1) # Observed Counts k ~ dbin(theta,n) }

The Matlab script Rate_1.m makes up some data (i.e., set values for k and n), and then call WinBUGS to sample from the graphical model. The Matlab code also draws the posterior distribution of the rate θ. It should look something like Figure 2.2 3.5

3

Posterior Density

2.5

2

1.5

1

0.5

0 0

0.2

0.4

0.6

0.8

1

Rate

Figure 2.2: Posterior distribution of rate θ for k = 5 successes out of n = 10 trials, based on 5,000 posterior samples.

2.2. THE DIFFERENCE BETWEEN TWO RATES

9

Exercises Once the code is working, here is a list of exercises to complete. Try and think of the message each exercise has for making inferences about psychological variables from data. 1. Alter the data to k = 50 and n = 100, and compare the posterior for the rate θ to the original with k = 5 and n = 10. 2. Alter the data to k = 99 and n = 100, and comment on the shape of the posterior for the rate θ. 3. Alter the data to k = 0 and n = 1, and comment on what this demonstrates about the Bayesian approach. 4. Alter the data to anything else you think is interesting. 5. Alter the number of samples drawn from the posterior to a larger number, like 105. What does this achieve? 6. Alter the number of chains to 2. What does this achieve? 7. Alter the prior distribution on the rate θ to theta~dbeta(5,2), corresponding to the prior assumption that 5 successes and 2 failures have already been observed. How does this change in prior affect posterior distributions of θ for k = 5 and n = 10 versus k = 50 and n = 100?

2.2

The Difference Between Two Rates

Now we have two different processes, producing k1 and k2 successes out of n1 and n2 trials, respectively. First, we will make the assumption the underlying rates are different, so they correspond to different latent variables θ1 and θ2 . Our interest is in the values of these rates, as estimated from the data, and in the difference δ = θ1 − θ2 between the rates. The graphical model representation for this problem is shown in Figure 2.3. The only new notation is that the deterministic variable δ is shown by a double-bordered node. The following code implements the graphical model in WinBUGS. # Difference Between Two Rates model { # Prior on Rates theta1 ~ dbeta(1,1) theta2 ~ dbeta(1,1) # Observed Counts k1 ~ dbin(theta1,n1) k2 ~ dbin(theta2,n2) # Difference between Rates delta

Suggest Documents