Logistic Biplots

9 downloads 0 Views 1MB Size Report
The biplot method (Gabriel, 1971) is a simultaneous graphical representation of the rows and the columns of a given data matrix. The main uses are exploratory ...
-1-

Logistic Biplots José L. Vicente-Villardón M. Purificación Galindo-Villardón Antonio Blázquez-Zaballos

-2-

1. - INTRODUCTION The biplot method (Gabriel, 1971) is a simultaneous graphical representation of the rows and the columns of a given data matrix. The main uses are exploratory although it has also been used as a graphical representation for more formal models (Gabriel, 1998). In practice, biplot fitting occurs either by computing the singular value decomposition (SVD) of the data matrix or by performing an alternating regressions procedure (Gabriel & Zamir, 1979). Jongman et al. (1987) fit the biplot by alternating a regression and a calibration step, essentially equivalent to the alternating regressions. Gower & Hand (1996) use the term interpolation rather than calibration. When data is binary, classical linear biplots are not suitable because the response along the dimensions is linear, in the same way as linear regression is not suitable when response is binary. Multiple correspondence analysis (MCA) is commonly used for categorical data, and can be considered as a particular form of biplot for binary (or indicator) matrices. Gower & Hand (1996) and Gabriel (1995) develop the complete theory for MCA in relation to biplots. Gower & Hand (1996) propose what they call “prediction regions” as an extension of the usual linear projections. The prediction regions are based on distances from the individual points to the category points. The representation space is divided into regions that predict each category or combination of categories. In this paper we propose a linear biplot for binary data in which response along the dimensions is logistic. Each individual is represented as a point and each character as a direction through the origin. The projection of an individual point onto a character direction predicts the probability of presence of that character. The method is related to logistic regression in the same way that biplot analysis is related to linear regression. Thus we refer to the method as the logistic biplot. We take here an exploratory point of view as opposed to the modeling approach in papers by Gabriel (1998) or Falguerolles (1998). The main aim is to analyze a data matrix (individuals by variables) rather than to model a two-way (contingency) table using a bilinear model. A preliminary version of the logistic biplot was proposed by Vicente-Villardón (2001). Schein et al. (2003) propose a generalized linear model for principal components of binary data, without the biplot point of view. Our proposal is closely related to MCA and some psychometric latent variable procedures such as item response theory or latent traits. The main theoretical results are applied to a molecular classification of cancer by gene expression monitoring.

2. - CLASSICAL BIPLOTS Let X be a data matrix with I rows and J columns containing the measures of J variables (usually continuous) on I individuals. A S*-dimensional biplot is a graphical representation of a data matrix X by means of markers a1, …, aI for its rows and markers b1, …, bJ for its columns, in T

such a way that the product a ibj approximates xij as close as possible. Arranging the markers as T

row vectors in two matrices A and B, the approximation of X can be written as X≈AB .

-3-

Although the classical biplot is well known, we include here a description, in terms of alternating regressions, related to our proposal. We also describe the geometry of regression biplots.

2.1 Linear biplot based on alternating regressions/interpolations If we consider the row markers A, as fixed, the column markers can be computed by regression (1)

BT = (A T A)!1 A T X In the same way, fixing B, A can be obtained as

(2)

A T = (BT B)!1 BT X T

Alternating the steps (1) and (2) the product converges to the SVD. The algorithm can then be completed with an orthogonalization step to ensure the uniqueness of its solution. The regressions in (1) and (2) can be separated for each row and column of the data matrix. This symmetrical process is commonly used to adjust bilinear (or biadditive) models with symmetrical roles for rows and columns. For a data matrix of individuals by variables, the roles of rows and columns are non-symmetrical, nevertheless the algorithm is still valid and is interpreted as a two-step process, alternating a regression step and an interpolation/calibration step. The regression step adjusts a separate linear regression for each column (variable) and the interpolation step interpolates an individual using the column markers as the reference. The geometry of the interpolation step is described in Gower & Hand (1996).

2.2 Geometry of regression biplots The geometry of biplots for linear subspaces is described in Gower & Hand (1996) but the geometry of regression biplots is not. Let us suppose that the biplot is in two-dimensional space (a plane). Then we want to find the direction β j, in the space L spanned by the two columns of A, in such a way that the projections of the markers in A onto that direction, predict the values of variable j as closely as possible. That is, for the j-th column

x< j > ! A " j

x < j > of X: (3)

As we showed before, in (1), this direction is given by the markers of the j-th column. Without loss of generality we can assume that the variables and thus the row markers are mean centered. We add a third dimension for the j-th variable and fit the usual regression plane. Let us call it H. (see Figure 1)

-4-

(Figure 1 about here)

The set of points in H predicting a fixed value is given by the straight line that is the intersection between the regression plane H and the plane through the fixed value parallel to L. Different fixed values lead to different parallel straight lines in H. Let ξj denote the line in H normal to all those straight lines (the level curves) and intersecting the third axis (at point 0 for centred data). The line ξj can be used as a reference for prediction as shown in Gower & Hand (1996). The points in L predicting different values of the variable are also on parallel straight lines; the projection of ξj onto L is perpendicular to all these lines and it is called the biplot axis, with direction vector β j (see Figure 1). The projection of the row markers, onto the biplot axis β j = (bj1, bj2), gives the predictions in L. The biplot axis can be completed with scales. To find the marker on the biplot axis β j, that predicts a fixed value µ of the observed variable, we look for the point (x, y) that verifies

y=

b j2 b j1

x

(4)

µ = b j 0 + b j1 x + b j 2 y

and

Solving for x and y, we obtain

x=µ

b j1 2 b j1 + b 2j 2

y=µ

(5)

b j2 2 b j1 + b 2j 2

That is,

(x, y) = µ

(6)

bj b Tj b j

Therefore, the unit marker for the j-th variable is computed by dividing the coordinates of its corresponding marker by its squared length. The goodness of fit is measured by the squared correlation coefficients Rj2 for the regressions. They are interpreted as measures of the “quality of the representation” in the manner commonly used in correspondence analysis. T

The interpolation of an individual with an observed vector xi =(xi1, … , xip) , for a fixed set of column markers B, is computed as the linear combination in (2). When the columns of B are orthonormal, the combination is

a i = BT x i , i. e. the sum of vectors J

a i = ! xij b j j=1 This is the geometry of the interpolation described in Gower & Hand (1996) (see Figure 2).

(7)

-5-

(Figure 2 about here)

The unit markers for interpolation are the markers bj, i. e. the interpolant of a general point x is given by the vector sum of the unit points weighted by xi1, … , xip. Figure 2 illustrates a simple method for interpolating a point by summing the vectors for its three markers on the biplot axes β 1, β 2 and β 3. We illustrate the interpolation of the point (2, -3, 4). We first select the value on each biplot axis using the graduations and then sum the resulting vectors. G is the centroid of the three points and the interpolated point is at three (the number of Biplot axes) times the vector

OG .

Observe that the directions for interpolation are the same as for prediction, but the unit markers are different. When the columns of B are not orthonormal, we also have a linear combination, but the expressions of the unit markers are more complicated. The constraint of orthogonal B leads to the row metric preserving biplot. The same constraint can be applied to A, leading to the column metric preserving biplot (Gabriel, 1990).

3. - LOGISTIC BIPLOT

3.1. - Formulation Let X be a data matrix in which the rows correspond to I individuals and the columns to J binary characters. Let πij = E(xij) be the expected probability that the character j be present at individual i, and xij the observed probability. Usually xij is either 0 or 1, resulting in a binary data matrix. The logistic biplot is formulated as

! ij =

e

b j0 +" s b js ais

1+ e

b j0 +" s b js ais

(8)

where ais and bjs (i=1, …,I; j=1, …,J; s=1, ..., S) are the model parameters used as row and column markers respectively. The model in (8) is a generalized (bi)linear model having the logit as a link function.

logit(! ij ) = log( Where

! ij 1 " ! ij

S

) = b j 0 + # b js ais = b j 0 + a Ti b j s =1

a i = (ai1 ,…, ais )T and b j = (b j1 …b js )T .

In matrix form

(9)

-6-

(10)

logit(!) = 1b T0 + ABT

where Π is the matrix of expected probabilities, 1 is a vector of ones; b0 is the vector containing the constants, and A and B are the matrices containing the markers for the rows and columns of X. This is a matrix generalization of the logit function (Schein el al, 2003). The constants bj0 have been added because it is not possible to centre the data matrix in the same way as in linear biplots. The constant allows for calculating the probability πj0 at the point (0,0). The constant is the displacement of the gravity centre in the same way as it is the trivial axis in correspondence analysis. This axis does not affect the prediction or the final representation. The model has a close similarity to the models used in the social sciences, such as latent traits, item response theory or Rasch models. In fact, a biplot is implicit in many of these models. The main difference is that here the model is descriptive, and the main issue is either the ordination of individuals on a latent factor or the dimension reduction in order to gain a better insight into the interpretation of a complex problem. In psychometric models the aim is to estimate the parameters of a factor model to explain the correlation between variables. Latent trait models can be found in Bartholomew and Knott (1999) and item response models in Baker (1992).

3.2. - Parameter Estimation The model in (10) is similar to the latent trait or item response theory models, in that ordination axes are considered as latent variables that explain the association between the observed variables. In this framework we suppose that individuals respond independently to variables, and that the variables are independent for given values of the latent traits. With these assumptions the likelihood function is I

J

Prob(xij (b0 , A,B)) = # # ! ijij (1 " ! ij ) x

1" xij

i=1 j =1

Taking the logarithm of the likelihood function yields I

J

L = log Prob(xij (b 0 , A, B)) = ' ' #$ xij log(! ij ) + (1 " xij )log(1 " ! ij ) %&

(11)

i =1 j =1

To obtain the estimates it is necessary to take the derivatives of L with respect to all the parameters, equate them to zero and solve the 3J+2I simultaneous equations. The NewtonRaphson method can be used to solve the system of equations, but if the number of individuals or variables is large, the computation problem becomes too large. Gabriel (1998) proposed what he called generalized bilinear models and proposes an estimation procedure based on segmented models. This procedure uses an alternating least squares (criss-cross) algorithm adjusting separately the rows and the columns. The procedure is efficient when the number of rows and columns is small, and it is useful to model a contingency table, nevertheless, when the data matrix is big, the procedure is inefficient because of the size of the data matrices involved.

-7-

Our method for fitting the parameters of the logistic biplot is an iterative scheme that alternates between updates of A and B. Essentially, one set of parameters is updated while the other is held fixed, and this procedure is repeated until the likelihood converges up to a desired degree of precision. At each step the log-likelihood function in (13) can be separated into a part for each row and/or each column of the data matrix. We maximize each part separately obtaining nondecreasing values of the likelihood. This succession is expected to converge at least to a local maximum. It can be considered as a heuristic generalization of the regression interpolation procedure for the classical Biplot; moreover, if data are normally distributed and we use the identity (rather than logit) as link function, the procedure converges to the solution of the classical biplot. For A fixed, (11) can be separated into J parts, one for each variable.

J J $ I L = ! L j = ! && ! xij log(" ij ) + (1# xij )log(1# " ij ) J =1 J =1%i=1

[

]

' )) (

Maximizing each Lj is equivalent to performing a logistic regression using the j-th column of X as a response and the columns of A as regressors. This is the regression step. In the same way the probability function can be separated into several parts, one for each row of the data matrix:

I I $ J L = ! Li = ! & ! xij log(" ij ) + (1# xij )log(1# " ij ) & i=1 i=1% j=1

[

]

' ) ) (

To maximize each part we use the Newton-Raphson method. The partial derivatives with respect to ais, (s=1, …, S) are

J J ! (1$ " ij ) !Li 1 !" ij 1 = # xij + # (1$ xij ) !ais j=1 " ij !ais j=1 (1$ " ij ) !aik with

!" ij ! ais

! (1 # " ij )

= b js" ij (1 # " ij )

! ais

= #b js" ij (1 # " ij )

Then

J !Li = $ b js (xij " # ij ) !ais j=1 The second derivatives are

! 2 Li

J

= " $ b 2js# ij (1" # ij ) 2 !ais j=1

-8-

J ! 2 Li = # % b jsb js"$ ij (1# $ ij ) !ais ais" j=1 The iterative Newton-Raphson method is as follows: 1.- Set initial values for

T

!" ai1 ,…, aiS #$ , usually the row markers in the previous biplot step, and 0

t = 0. T

!" ai1 ,…, aiS #$ with t +1

2.- Update

!ai1 $ # & = #! & "aiS %t+1 ! # ) b 2j1'ˆ ij (1( 'ˆ ij ) … # j=1 +# ! " #J # ) b jS b j1'ˆ ij (1( 'ˆ ij ) … #" j=1 J

!ai1 $ # & #! & "aiS %t

Where

!ˆ ij

$(1! J

J

) b j1b jS 'ˆ ij (1( 'ˆ ij )&

& & J & ) b 2jS 'ˆ ij (1( 'ˆ ij ) & &% j=1 t

j=1

!

$ # ) b j1 (xij ( 'ˆ ij )& # j=1 & # & ! # & #J ˆ & # ) b jS (xij ( ' ij )& " j=1 %t

(12)

is the estimated probability using the parameter values in the update t.

3.- Increment the counter t=t+1. 4.- If changes in

T

!" ai1 ,…, aiS #$ are small, then finish, if not go to step 2.

Some problems with these procedures have been encountered when the response vectors are sparse or response vectors with only 0’s or 1’s. The problem can be solved using slight corrections for expected probabilities equal to 0 or 1. Nevertheless, the method without any correction has proven to work in most cases.

To summarize, the general algorithm for the logistic biplot is as follows: Step 1.- Choose initial values for the parameters A. For example, take A from the principal component analysis of X. Step 2. - Orthonormalize A, to avoid indeterminacies (optional). Step 3. - Regression step: Calculate bj0 , bj1,…, bjS using separate standard logistic regressions for each column x of X.

-9-

Step 4. - Interpolation step: Interpolate each individual separately (calculate ai1 , … aiS) using the Newton-Raphson method described above. Step 5. - If changes in the log-likelihood are small, then finish, if not go to step 2.

The orthonormalization (step 2) provides a tool for the uniqueness of the parameter estimates in the same way as unit length vectors are taken in principal components. The constraint can also be done on B. The step is optional and the orthonormalization can be done a posteriori taking T

the SVD of the expected values in AB . In any case we would obtain the same space, with different rotations of the solution. Note that steps 4 and 5 can be used to project supplementary (or illustrative) variables or individuals onto the graph, or even to produce a logistic biplot when a set of coordinates is fixed and have been obtained from another technique.

3.3. - Geometry of logistic biplots We describe the logistic biplot geometry for a two -dimensional solution: If we fix the markers in A and adjust the model in (8), we obtain a logistic response surface H as in Figure 3. In this case the third axis shows a scale for the expected probabilities. Although the response surface is non-linear, the intersections of the planes normal to the probability axis and H are straight lines on H. Similarly to the linear case, the lines for different probabilities are parallel. Suppose we select, on the response surface, a curve intersecting the third axis at the point (0, 0, pj0) and whose tangent is perpendicular, at any point, to the corresponding prediction line. That curve is a non-linear prediction axis ξ j. The projection of ξ j onto the representation space is a straight line, the biplot axis β j. The direction of β j is given by (bj1, bj2), i. e. the parameters in (8). (Figure 3 about here)

The points in L predicting different probabilities are also on parallel straight lines; this means that predictions on the logistic biplot are made in the same way as on the linear biplots, i. e., projecting a row marker ai=(ai1, ai2) onto a column marker bj=(bj1, bj2). The biplot axis β j is completed with marks for points predicting probabilities by projection; the main difference with the linear biplot is that equally spaced marks do not correspond to equally spaced probabilities. To simplify the graphical representation we propose to add marks for fixed values of the predictions, for example, .25, .50 and .75. This will look like a symmetrical box-plot and no labels are necessary. Other forms are possible; in the application we take just the points predicting 0.5 and 0.75 (showing the cut-off for presence and the direction of the vector), in order to simplify the graphical representation. The length of that vector can be interpreted as a

-10-

measure of the discriminatory power of the character, in the sense that shorter vectors correspond to variables that better differentiate individuals. To find the scale marker for a fixed probability p, we look for the point (x, y) that predicts p and is on the biplot axis, i. e. on the line joining the points (0, 0) and (bj1, bj2):

y=

bj 2 b j1

(13)

x

The prediction verifies

logit( p) = b j 0 + b j1 x + b j 2 y

(14)

Using (13) in (17) we obtain

x=

(logit( p) ! b j 0 )b j1 b 2j1 + b 2j 2

y=

(logit( p) ! b j 0 )b j 2

(15)

b 2j1 + b 2j 2

For example, the point on axis β j predicting 0.5 (logit(0.5)=0), is

x=

!b j 0b j1 b +b 2 j1

2 j2

y=

!b j 0b j 2 b 2j1 + b 2j 2

The final representation is a linear biplot interpreted by projection, even though the response surface is not linear. The result is not surprising because we are dealing with generalized linear models (the biplot is linear in the logit scale). The direct representation in the logit scale is difficult to interpret; the probability scale is simpler and easier to understand, especially for non-trained users.

4.- APPLICATION: MICROARRAY GENE EXPRESSION DATA The proposed method is useful for any binary data. We have chosen here an example taken from Golub et at. (1999), and it is related to the classification of two different kinds of leukemia. The logistic biplot has been used as a dimension reduction technique, previous to the classification. Classification of patient samples is a crucial aspect of cancer diagnosis and treatment. Although cancer classification has improved over the past years, there has been no general approach for identifying new cancer classes or for assigning tumors to known classes. Recent research uses gene expression monitoring by DNA as a tool for classification. For this example we use 38 bone marrow samples divided into two groups: Acute Lymphoblastic Leukemia (ALL), with 27 members, and Acute Myeloid Leukemia (AML), with 11 members. An additional set of 34 samples (20 ALL, 14 AML) was used to validate the classification. A more detailed description of the data can be found in Golub et at. (1999). Distinguishing ALL from AML is critical for successful treatment. Remissions can be achieved

-11-

using ALL therapy for AML (and vice versa), but cure rates are markedly dismissed, and unwarranted toxicities are encountered. Although distinction between ALL and AML has been well established, no single test is currently sufficient to establish the diagnosis. Rather, current clinical practice involves several analysis, each performed in a separate, highly specialized laboratory. Golub et al. (1999) develop a systematic approach to cancer classification based on the simultaneous expression monitoring of thousands of genes using DNA microarrays. Although the blueprint encoding all human genes are present in each cell, only a fraction of the proteins they can produce is active in any particular cell. The process of transcribing a gene’s DNA sequence into RNA (that serves as a template for protein production) is known as “gene expression”. A gene’s expression level indicates the approximate number of copies of that gene’s RNA produced in a cell; this is thought to correlate with the amount of the corresponding protein made. A signal value is calculated that assigns a relative measure of the abundance of the transcript. A “detection P-value” is evaluated to determine the “detection call”. The detection call indicates whether a transcript is reliably detected (present) or not detected (absent). The binary values are noisy indicators of the presence or absence of mRNA. The detection call is provided by the Affymetrix Gene Chip software (details may be found in http://www.affymetrix .com). Expression chips (biological chips), manufactured using technologies derived from computerchip production, can now measure the expression of thousands of genes simultaneously. For our analysis we use the binary matrix with the presence/absence of 6817 genes. The main difficulty with this kind of data is the enormous dimension of the problem. The logistic biplot is used here as a tool for dimension reduction and class identification. The most important genes for the separation of the classes are identified in the biplot. The logistic biplot is also a tool for summarizing the information of many correlated gene expressions. The first issue is to explore whether there where genes whose expression pattern was strongly correlated with the class distinction to be predicted. The 6817 genes where sorted by their degree of correlation. After the ordering process, the 15 genes most correlated with the groups were selected. The number selected is somewhat arbitrary, we use a small number for purposes of illustration. A list of the selected genes can be shown in Table 1. The logistic biplot was calculated using the alternate procedure proposed before. We used the standardized principal components of the raw data matrix as a starting point for the algorithm. The algorithm converged in 10 iterations. A rough measure of the goodness of fit is the percentage of correctly classified individuals on the projections onto the variables, i. e. an expected probability greater than 0.5 predicts a presence. The percentage of explained variance for these data in two dimensions was 96.32, so the two-dimensional biplot is an adequate summary of the binary data matrix containing 15 columns. The procedure has been applied using different numbers of genes and in all the cases a two-dimensional representation summarizes the data adequately. (Figure 4 about here)

-12-

Figure 4 shows the graphical representation. The genes are represented by arrows pointing in the direction of increasing predictions of the probabilities. The start of the arrow is the point predicting 0.5 and the end, the point predicting 0.75. The marks have been placed using equation (17) with the appropriate probabilities. The length of the segment is related to the discriminatory power of each character, i. e. to the capacity to predict presence or absence of the gene. Short arrows correspond to the genes with greater discriminatory power. This concept is also used in item response theory (see for example, Baker, 1992) and it is related to the slope of the tangent to the logistic curve at the point predicting 0.5. Greater slopes are associated with greater discriminatory power and also with shorter vectors in the graph. Large vectors predict almost constant probabilities; so they are not useful for interpretation. In this case all the variables have an acceptable quality of representation as measured by the discriminatory power. Two genes pointing in the same direction are highly correlated, two genes pointing in opposite directions are negatively correlated, and two genes forming an angle near 90º are not correlated. Table 1 shows some measures of the goodness of fit of each variable: the deviance for the comparison of the whole model with the reduced model only with a constant (p

Suggest Documents