An algorithm and tool for computing exact conditional probabilities of ...

0 downloads 0 Views 158KB Size Report
ables, dimension of a contingency table). What is the probability for finding k or more iden- tical row vectors W = (wj) (configurations, patterns) in the matrix if the ...
Psychology Science, Volume 47, 2005 (3/4), p. 391-400

An algorithm and tool for computing exact conditional probabilities of configuration frequencies MANFRED BEIER1 Abstract Traditionally the exact conditional probability of a configuration frequency is calculated with methods based on Fisher's well known formula for two by two contingency tables or its extensions for tables of higher dimensions. I present here a different, combinatorial approach that shows a much better scaling behavior for an increasing number of variables, and is in principle independent of the number of categories. Key words: Exact conditional probability, multidimensional contingency table, configural frequency analysis (CFA)

1

Dipl.-Ing. Manfred Beier, Institut für Humangenetik und Anthropologie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, D-40225 Düsseldorf, Germany; E-mail: [email protected]

M. Beier

392

1. Algorithm Please note that in the following the term “matrix” does not refer to a contingency table but to the raw data matrix a contingency table can be constructed from. Let A = (aij) be a matrix with m rows (observations, sample size) and n columns (variables, dimension of a contingency table). What is the probability for finding k or more identical row vectors W = (wj) (configurations, patterns) in the matrix if the components (attributes) of the column vectors may be ordered at random? Let F = (fj) be a vector holding the frequencies (corresponding marginal sums of a contingency table) of wj for each column j = 1,..,n. For the first component of the first row, a11, the probability for matching w1 is: (1)

P ( a11 = w1; m, f1 ) =

f1 f , for a12: P ( a12 = w2 ; m, f 2 ) = 2 etc. m m

Assuming independence of the columns the probability for all cells of the first row being equal to W = (wj) is: (2)

(

n

fj

) ∏m .

P ( a11 ,..., a1n ) = W ; m, F =

j=1

The probability for at least having the first k rows filled with W is:

(3)

(

n

fj

f j −1

f j − k +1

n

) ∏ m ⋅ m − 1 ⋅ … ⋅ m − k +1 = ∏

P ( a11 ,..., a1n ) = … = ( ak1 ,..., akn ) = W ; m, F =

j=1

j=1

 fj     k =P . 0 m   k 2

The maximum possible number of W is limited by fmin, i.e. the smallest component of F. Consequently, for k = fmin the formula above gives us the probability for exactly the first k rows containing W with no further occurrence in the remaining matrix. Therefore, the probability for a matrix Ak with k vectors W occurring in arbitrary rows can be computed by summing up the probabilities of all possible ways to choose k rows out of m: (4)

(5)

2

m P ( Ak ; k = f min , m, F ) =   ⋅ P0 . k In the case of n = 2 with f1 = fmin = k:  m  ⋅   k

2

∏ j=1

 fj   f min    k   m   f    =  min  ⋅   ⋅  m   f min   m        k  f min  

The minimum possible number is given by max {0, Σfj - m(n-1)}.

f2     f min   = m     f min  

f2   f min  , m   f min 

An Algorithm and Tool for Computing Exact Conditional Probabilities of Configuration Frequencies

393

this is equivalent to the p-value given by the hypergeometric distribution of a 2 by 2 contingency table, here shown with Fisher's formula for R1 = f1 = fmin = k and C1 = f2: O11 = k = fmin O21 = f2-fmin C1 = f2

(6)

O12 = 0 O22 = C2 C2

R1 = f1 = fmin R2 = m-fmin m

 f2 ! f min !( m − f min )! f 2 !C2 ! R1 !R2 !C1 !C2 !  f 2 − f min )!  ( m! m ! = = = m!  O11 !O12 !O21 !O22 ! f min !0!( f 2 − f min )!C2 !  m f ! − ( min ) 

f2   f min  . m   f min 

For k < fmin the probability for exactly k row vectors matching W corresponds to the probability of getting at least k rows and no further row: (7)

))

( (

m P ( Ak ; k < f min , m, F ) =   ⋅ P0 ⋅ 1 − P { A1 ,..., A fmin − k }; m − k , ( f1 − k ,..., f n − k ) . k

This leads to the following recursive formula for getting k or more row vectors W: n

(8)

(

)

P { Ak ,..., A fmin }; m, F =

(

 fj 

∏  k  j=1

)

 m   k

n −1

( (

⋅ 1 − P { A1 ,..., A fmin −k }; m − k , ( f1 − k ,..., f n − k )

))

+ P { Ak +1 ,..., A fmin }; m, F . At first sight this formula seems to be computationally infeasible. The number of pvalues that have to be computed grows exponentially with the distance between k and fmin. But by storing and reusing partial results for 1-P({A1,...,Afmin-i}; m-i, F-i) for all i = k,...,fmin-1, the number of interim values is reduced to Σ1≤i≤fmin-k+1 i, i.e. a quadratic growth rate. An algorithm is given below in the form of an implementation in the programming language R (www.R-project.org). In addition to the necessary “data cache” just mentioned (p1array), a second array (p0) allows the first part of the formula to be calculated using the recurrence relation n

(9)

n

 fj 

∏  k  j=1

m   k

n −1

=

n

 fj 

∏  k − 1 ∏ ( f j=1

 m     k − 1

n −1



j

)

− k +1

j=1

k ( m − k +1)

n −1

.

M. Beier

394

exact.p