On specifying correlation matrices for binary data - ePub WU

7 downloads 0 Views 260KB Size Report
Aug 2, 1999 - und Versicherungsmathematik. Technische Universitiat Wien. Wiedner HauptstraЯe 8-10/1071. A-1040 Wien, Austria http://www.ci.tuwien.ac.at.
On Specifying Correlation Matrices for Binary Data Markus Orasch Friedrich Leisch Andreas Weingessel Working Paper No. 53 August 1999

August 1999

SFB ‘Adaptive Information Systems and Modelling in Economics and Management Science’ Vienna University of Economics and Business Administration Augasse 2–6, 1090 Wien, Austria in cooperation with University of Vienna Vienna University of Technology http://www.wu-wien.ac.at/am

This piece of research was supported by the Austrian Science Foundation (FWF) under grant SFB#010 (‘Adaptive Information Systems and Modelling in Economics and Management Science’).

On Specifying Correlation Matrices for Binary Data Markus Orasch, Friedrich Leisch and Andreas Weingessel Institut f ur Statistik, Wahrscheinlichkeitstheorie und Versicherungsmathematik Technische Universitat Wien Wiedner Hauptstrae 8{10/1071 A-1040 Wien, Austria http://www.ci.tuwien.ac.at

Abstract We are interested in generating binary data via specifying the marginals and the correlation matrix only. However, from a practical point of view it is not obvious how to construct such a matrix, since it has to be positive (semi-) de nite and satisfy some special conditions as well. Hence, using R, a free implementation of the S statistical language, we give a graphical interface to input speci c marginals and correlations and to change the given `not working' correlation matrix to a possibly `working' one.

1

Introduction

Binary (or Bernoulli) variables are used in various applications. In particular, we are interested in the segmentation of marketing data, where the data are from customer questionnaires with `yes/no' questions. With the help of arti cial data we can model situations from the `real world' (cf., for example, Dolnicar, Leisch, Weingessel, Buchta and Dimitriadou (1998)). A parametric model for generating correlated binary outcomes was suggested by Bahadur (1961, cf. Section 3 in this paper). However, for practical purposes the distribution given by Bahadur is not suitable for generating high-dimensional multivariate binary outcomes. Hence, a more suitable algorithm for generating these high-dimensional multivariate binary outcomes via specifying all marginals and correlations is given by Emrich and Piedmonte (1991, cf. Section 4 in this paper). As mentioned by Prentice (1988), these correlations have to satisfy special conditions depending on the given marginals. In addition, the corresponding correlation matrix has to be positive (semi-) de nite. While some examples of possibly `working' and `not working' correlation matrices are given in Section 5, an algorithm to construct a possible correlation matrix is given in Section 6. 2

Multivariate Distributions

Along the lines of Krzanowski (1988, Chapter 7) we give some basic concepts of multivariate distributions. Let X ; X ; : : : ; Xn be random variables and de ne the n-dimensional random 1

2

1

vector X such that XT = (X ; X ; : : : ; Xn), where XT denotes the transpose of X. Throughout this paper we will use bold letters for both vectors and matrices. The distribution of X is de ned as F (x ; : : : ; xn) = IP(X  x ; : : : ; Xn  xn ). Marginal distributions are distributions of speci ed elements of X ignoring the other elements. Thus, marginal distributions are, for example, FX (xi ) = IP(Xi  xi ) or FX ;X (xi ; xj ) = IP(Xi  xi ; Xj  xj ): Furthermore, let  denote the expected value of X and  the n  n-covariance matrix with elements ij := Cov(Xi ; Xj ) = IEXi Xj IEXi IEXj . Obviously, when i = j then ii = V arXi . In vector notation as above, we can write  = IIE[(X IEX)(X IIEX)T ]: When computing AAT , where A is a diagonal matrix with elements  in the i-th row and i-th column, we get the so-called correlation matrix  of X, namely 0 1 Æ ::: Æ n 1 B Æ 1 ::: Æ n C  = B (2.1) B . C; ... ... C @ .. A Æ n Æ n ::: 1 1

2

1

1

1

i

i

j

2

2

1

ii

12

1

12

1

2

2

where Æij :=   2 [ 1; +1]. Such a matrix is always symmetric and positive (semi-) de nite. As a consequence (cf., for example, Horn and Johnson (1985, p. 397)) it can be shown that each eigenvalue is a positive (non-negative) real number. Moreover, the trace, the determinant, and all principle minors are positive (non-negative). Furthermore, a necessary and suÆcient condition that  is positive (semi-) de nite, is that the determinants of all the principle submatrices are positive (non-negative). A principle submatrix is any submatrix of  whose diagonal is part of the principle diagonal of  (cf., for example, Kraus (1987, p.120)). 2

ij

ii jj

3

Multivariate Binary Distributions

Consider the n random variables X ; : : : ; Xn with pi := IP(Xi = 1) > 0, 1  i  n, and qi := (1 pi ) = IP(Xi = 0) > 0, 1  i  n. It is well know that IEXi = pi and V ar Xi = pi qi . Furthermore, let pij := IIP(Xi = 1; Xj = 1), 1  i; j  n, and moreover, p ::n := IIP(X = 1; X = 1; : : : ; Xn = 1). Suppose we are interested in generating multivariate binary outcomes by specifying the marginals and (2nd order) correlations of X ; : : : ; Xn only. Bahadur (1961) showed that all higher order correlations are equal to zero whenever the smallest characteristic root min of the given correlation matrix  is such that 2 min  1 Pn (3.1) max( p ; q ) : 1

12

1

2

1

i

i=1

i

qi

pi

1

i

Then the joint mass function can be expressed as n  Y X (xi ppi )(xk f (x ; : : : ; x ) = px q x 1 + Æ 1

n

i

i

i

i=1

ik

i