Psychology Science, Volume 47, 2005 (3/4), p. 391-400
An algorithm and tool for computing exact conditional probabilities of configuration frequencies MANFRED BEIER1 Abstract Traditionally the exact conditional probability of a configuration frequency is calculated with methods based on Fisher's well known formula for two by two contingency tables or its extensions for tables of higher dimensions. I present here a different, combinatorial approach that shows a much better scaling behavior for an increasing number of variables, and is in principle independent of the number of categories. Key words: Exact conditional probability, multidimensional contingency table, configural frequency analysis (CFA)
1
Dipl.-Ing. Manfred Beier, Institut für Humangenetik und Anthropologie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, D-40225 Düsseldorf, Germany; E-mail:
[email protected]
M. Beier
392
1. Algorithm Please note that in the following the term “matrix” does not refer to a contingency table but to the raw data matrix a contingency table can be constructed from. Let A = (aij) be a matrix with m rows (observations, sample size) and n columns (variables, dimension of a contingency table). What is the probability for finding k or more identical row vectors W = (wj) (configurations, patterns) in the matrix if the components (attributes) of the column vectors may be ordered at random? Let F = (fj) be a vector holding the frequencies (corresponding marginal sums of a contingency table) of wj for each column j = 1,..,n. For the first component of the first row, a11, the probability for matching w1 is: (1)
P ( a11 = w1; m, f1 ) =
f1 f , for a12: P ( a12 = w2 ; m, f 2 ) = 2 etc. m m
Assuming independence of the columns the probability for all cells of the first row being equal to W = (wj) is: (2)
(
n
fj
) ∏m .
P ( a11 ,..., a1n ) = W ; m, F =
j=1
The probability for at least having the first k rows filled with W is:
(3)
(
n
fj
f j −1
f j − k +1
n
) ∏ m ⋅ m − 1 ⋅ … ⋅ m − k +1 = ∏
P ( a11 ,..., a1n ) = … = ( ak1 ,..., akn ) = W ; m, F =
j=1
j=1
fj k =P . 0 m k 2
The maximum possible number of W is limited by fmin, i.e. the smallest component of F. Consequently, for k = fmin the formula above gives us the probability for exactly the first k rows containing W with no further occurrence in the remaining matrix. Therefore, the probability for a matrix Ak with k vectors W occurring in arbitrary rows can be computed by summing up the probabilities of all possible ways to choose k rows out of m: (4)
(5)
2
m P ( Ak ; k = f min , m, F ) = ⋅ P0 . k In the case of n = 2 with f1 = fmin = k: m ⋅ k
2
∏ j=1
fj f min k m f = min ⋅ ⋅ m f min m k f min
The minimum possible number is given by max {0, Σfj - m(n-1)}.
f2 f min = m f min
f2 f min , m f min
An Algorithm and Tool for Computing Exact Conditional Probabilities of Configuration Frequencies
393
this is equivalent to the p-value given by the hypergeometric distribution of a 2 by 2 contingency table, here shown with Fisher's formula for R1 = f1 = fmin = k and C1 = f2: O11 = k = fmin O21 = f2-fmin C1 = f2
(6)
O12 = 0 O22 = C2 C2
R1 = f1 = fmin R2 = m-fmin m
f2 ! f min !( m − f min )! f 2 !C2 ! R1 !R2 !C1 !C2 ! f 2 − f min )! ( m! m ! = = = m! O11 !O12 !O21 !O22 ! f min !0!( f 2 − f min )!C2 ! m f ! − ( min )
f2 f min . m f min
For k < fmin the probability for exactly k row vectors matching W corresponds to the probability of getting at least k rows and no further row: (7)
))
( (
m P ( Ak ; k < f min , m, F ) = ⋅ P0 ⋅ 1 − P { A1 ,..., A fmin − k }; m − k , ( f1 − k ,..., f n − k ) . k
This leads to the following recursive formula for getting k or more row vectors W: n
(8)
(
)
P { Ak ,..., A fmin }; m, F =
(
fj
∏ k j=1
)
m k
n −1
( (
⋅ 1 − P { A1 ,..., A fmin −k }; m − k , ( f1 − k ,..., f n − k )
))
+ P { Ak +1 ,..., A fmin }; m, F . At first sight this formula seems to be computationally infeasible. The number of pvalues that have to be computed grows exponentially with the distance between k and fmin. But by storing and reusing partial results for 1-P({A1,...,Afmin-i}; m-i, F-i) for all i = k,...,fmin-1, the number of interim values is reduced to Σ1≤i≤fmin-k+1 i, i.e. a quadratic growth rate. An algorithm is given below in the form of an implementation in the programming language R (www.R-project.org). In addition to the necessary “data cache” just mentioned (p1array), a second array (p0) allows the first part of the formula to be calculated using the recurrence relation n
(9)
n
fj
∏ k j=1
m k
n −1
=
n
fj
∏ k − 1 ∏ ( f j=1
m k − 1
n −1
⋅
j
)
− k +1
j=1
k ( m − k +1)
n −1
.
M. Beier
394
exact.p