Classi cation of Microarray Data with Penalized Logistic Regression *a
Paul H. C. Eilers
b
Judith M. Boer Hans C. van
a Department
b Department
a Houwelingen
Gert-Jan van Ommen
b
of Medical Statistics, Leiden University Medical Centre of Human Genetics, Leiden University Medical Centre
AIC, genetic expression, cross-validation, generalized linear models, multicollinearity, multivariate calibration, ridge regression, singular value decomposition.
Keywords:
ABSTRACT
Classi cation of microarray data needs a rm statistical basis. In principle, logistic regression can provide it, modeling the probability of membership of a class with (transforms of) linear combinations of explanatory variables. However, classical logistic regression does not work for microarrays, because generally there will be far more variables than observations. One problem is multicollinearity: estimating equations become singular and have no unique and stable solution. A second problem is over- tting: a model may t well to a data set, but perform badly when used to classify new data. We propose penalized likelihood as a solution to both problems. The values of the regression coeÆcients are constrained in a similar way as in ridge regression. All variables play an equal role, there is no ad-hoc selection of \most relevant" or \most expressed" genes. The dimension of the resulting systems of equations is equal to the number of variables, and generally will be too large for most computers, but it can dramatically be reduced with the singular value decomposition of some matrices. The penalty is optimized with AIC (Akaike's Information Criterion), which essentially is a measure of prediction performance. We nd that penalized logistic regression performs well on a public data set (the MIT ALL/AML data). 1. INTRODUCTION
Microarrays generate a vast amount of data. One would expect statisticians to be happy about that, but they are not. The problem is called multiplicity. In standard statistical problems there is an independent variable, like blood pressure, presence or absence of a disease or survival time, and there are explanatory variables, like age, sex, weight. If their number is relatively small and we have many observations (patients), then the machinery of classical statistics will run smoothly and provides a basis for valid statements about the signi cance and reliability of results. As a rule of thumb, the number of observations should be ve or more times the number of explanatory variables. Array data present a totally dierent situation: a relatively small number of observations, generally less than 100, and a mass of explanatory variables (thousands). Practically speaking there are in nitely many ways to choose a set of explanatory variables that give a perfect t of a (regression) model to the data. A perfectly tting model to a given set of data will not guarantee good predictions with new data. Several groups have approaches the problem by seeking methods to select moderately sized subsets of the explanatory variables which give a good balance between t to the data and predictive performance on separate test sets. In this paper we investigate another approach, inspired by the chemometric literature: penalized likelihood estimation. All explanatory variables are allowed into a regression model. From the log-likelihood a so-called ridge penalty is subtracted, that discourages regression coeÆcients to become large, unless they really contribute to the predictive performance of the model. One might say that the coeÆcients have to ght for their place under the sun. This approach has been successful in the so-called multivariate calibration problem, where the explanatory variables are the values of optical spectra at hundreds or even thousands of wavelengths. It also appears to work well with microarray data. A nice property of the penalty is that it both stabilizes the statistical problem and removes numerical degeneracy. 1,2
3{6
* Correspondence:
P.O. Box 9604, 2300RC Leiden, The Netherlands; e-mail:
[email protected];
WWW: http://www.medstat.medfac.leidenuniv.nl/ms/PE/
1
This paper is mainly methodological. We explain penalized logistic regression and the statistical estimation procedure it leads to. A very large system of equations results, but we describe techniques to reduce its size by several orders of magnitude. We apply our approach to a published data set, but we do not try to give a genetic interpretation of the results. 1
2. LOGISTIC REGRESSION FOR CLASSIFICATION
We consider the following situation. A number of biological samples (of blood or tissue) have been collected, preprocessed and hybridized to microarrays. Each sample can be in one of two classes, e.g. acute lymphoic leukemia (AML) or acute myeloid leukemia (AML) for the data set of Golub et al. we will work with in Section 5. We try to nd a procedure that uses the expression pro le measured on an array to compute the probability that a sample belongs to one of the two classes. To derive the rule a training set is present, of arrays with known classes. To test the rule a test set of arrays might be present. After deriving (and testing) the rule, it is to be used to classify any new array that comes along. Our prediction rule will be based on logistic regression. Let a variable y indicate the class of a microarray: y = 0 means that an array was measured for an ALL case, while y = 1 means that it was for an AML case. Let yi indicate the ALL/AML status of array i, i = 1 : : : m and let xij , j = 1 : : : n be the gene expression for that array. Assume that we have selected the expression of one gene as a promising candidate for discrimination between ALL and AML. We like to nd a formula that gives us the probability p that an array with measured expression x (of the chosen gene) represents an AML case. As only two possibilities are considered, the probability of ALL consequently is 1 p. A simple regression model would be p = + x and and could be estimated with a standard statistical package, by linear regression of y on x. However, this would not be a very good idea, as there will be no guarantee that 0 < p < 1, which has to be true for any decent probability. The solution is to transform p to : 1
= log The curve that computes p from ,
1
p
p
= + x:
(1)
1 ; (2) 1+e is called the logistic curve, hence the name logistic regression. Fast and stable algorithms exist to estimate the parameters in this model which is a special case of the generalized linear model (GLM). We will not go into more detail here, as it is a special case of the penalized likelihood algorithm that will be presented in Section 3. Figure 1 shows an example, taken from the ALL/AML data set. There are 27 zeros (ALL) and 11 ones (AML). (In the gure they have been moved vertically by a small random amount to avoid complete overlapping of the symbols.) The curve shows the best tting (maximum likelihood) curve. Good prediction of the type of leukemia is not possible from only the expression of this one gene. This can be seen from the raw data: over a range of x both y = 0 and y = 1 has been observed. It can also be seen from the rather at slope of the curve; a perfect curve would be very near zero over the left half of the domain of x and then rise steeply. In principle it is straightforward to extend the model with more gene expressions, introducing explanatory variables, x , x , and so on: p = log = + x + x + x +::: (3) 1 p The maximum likelihood estimation algorithm can also be straightforwardly extended to this multi-dimensional situation, so why not use all expressions on an array to build a logistic regression model for class prediction?
p=
7
2
3
1
1
2
2
3
3
3. DEGENERACY AND A SOLUTION
A rule of thumb for logistic regression | and many other classi cation algorithms | is that the number of observations, m, should at least be ve times or more the number of explanatory variables, n. This is not a guarantee for success, it only means that one is very likely to fail with less observations. Also this rule applies to moderate numbers of explanatory variables, say less than ten. If their number increases, the number of observations has to rise more than proportionally. 8
2
Logistic regression example 1
Probability
0.8
0.6
0.4
0.2
0 2
2.5
3
3.5
log(expression)
An example of logistic regression with one gene expression. The dots are the data (with a slight random vertical translation for better visibility). The (estimated) curve gives the probability of y = 1 (i.e. a case of AML) for a given value of x, the pre-processed expression. Figure 1.
One problem is easy to imagine: if n < m, there will be more unknowns (the regression coeÆcients and ) than equations; in nitely many solutions are possible then. A second problem is perfect t to the data. This seems paradoxical, as in most engineering disciplines a very good t of a formula to data is something to be proud of. This is true if the equation is simple, i.e. if has only a few parameters. But an equation with many parameters | one parameter for each observation | is something to distrust. Intuitively one feels that such an equation would not give a good prediction (p) from new data (x's), because the noise in the rst data set has not been ltered out. A third problem is multicollinearity, which often occurs long before n approaches m. A thorough analysis requires the singular value decomposition (SVD), which will be introduced below, but a more intuitive explanation is the following. As more and more explanatory variables are introduced into a regression problem, the chance grows that some of these vectors display nearly the same pattern of high and low values. If a newly added x has exactly the same pattern as one of the already present x's, it supplies no new information and we call the two vectors collinear. If several (groups) of x's show identical patterns, we have multicollinearity. Things can even go wrong in more ways: if one of the x's can be constructed as a linear combination of other x's, we're also in trouble. In the context of microarrays multicollinearity is very well conceivable: most probably there will be genes with nearly identical expression patterns. This we might call fundamental multicollinearity. We may also encounter accidental multicollinearity: the expressions are measurements with limited precision. There are many source of error in the process that leads from a colour picture of the uorescence pattern on a microarray to the numbers that we use for statistical modeling. But whatever the source, multicollinearity wreaks havoc on classical logistic regression. So, logistic regression with microarrays seems doomed to failure: far too less observations and hence an in nity of possibilities for perfect but meaningless ts. But even if we had collected thousands of arrays, multicollinearity would crash the whole enterprise. Luckily, a solution is available, that has proved its value in chemometrics. Classical chemical analysis is an accurate but time consuming way to measure concentrations of substances. On the other hand, optical instruments can measure many types of spectra under dierent conditions quickly and cheaply. This presents the multivariate calibration (MVC) problem: build a linear regression model that estimates an unknown chemical concentration from an optical spectrum. One collects m samples with concentrations y , as measured by a classical method and m optical spectra with the instrument of choice. A typical spectrum consists of 400 to 1500 numbers at dierent optical wavelengths, and the typical number of samples is 100 or less, so multicollinearity problems are sure to occur. A very readable and non-technical account of multivariate calibration is the paper by Thomas. A more technical overview paper, with an interesting discussion is the one by Frank and Friedman. 9
10
3
The multicollinearity problem is well-known to statisticians and many solutions have been proposed: variable selection, principal component regression, partial least squares and penalized estimation. We will only consider the latter. It was proposed under the name \ridge regression" by Hoerl and Kennard in a general context. It gained popularity in the chemometric community. This generated comments by Fearn who advised against it, but he was rebutted by Hoerl, Kennard and Hoerl. For our purpose ridge regression is very attractive, as will be seen. In linear regression the dependent variables y is modeled as linear combination of the explanatory variables, the columns of X : n 11
12
4
i = E(yi ) =
Xx
ij j :
j =1
(4)
The regression coeÆcients can be estimated by minimizing the sum of squares
X Xx S = (y n
i
i
leading to the estimates
ij j )
j =1
2
;
(5)
^ = (X 0X ) X 0 y:
(6)
1
When multicollinearity is present this system has no unique solution and some or all of the elements of the coeÆcient vector will be very large . Hoerl and Kennard's remedy is to add the sum of the squares of the regression coeÆcients, weighted by a parameter , to S : 11
S =
X(y X x n
i
i
j =1
ij j ) + 2
X : n
(7)
2
j =1
j
The second term is called a penalty as it discourages high values for the elements of . The estimating equations are
^ = (X 0 X + I ) X 0y;
(8)
1
where I is the identity matrix. It can be shown that a unique solution is obtained if > 0. We now describe penalized likelihood for logistic regression and the estimating equations that follow from it. Let the outcomes be yi , i = 1 : : : m and the explanatory variables xij , j = 1 : : : n. The model is
i = log
pi
1
pi
=+
X x n
j =1
j ij ;
(9)
where pi is the probability of observing yi = 1. In the jargon of generalized linear models is called the linear predictor, as it is a linear combination of the explanatory variables. It is connected to p by a non-linear so-called link function. The inverse relationship is p = 1=(1 + e ), the logistic curve. The log-likelihood is 7
l= The penalized log-likelihood is
X y log p + X(1 m
i=1
m
i
i
i=1
l = l
yi ) log(1 pi ):
X =2: n
2
j =1
j
(10)
(11)
The second term is the ridge penalty; note that the oset does not occur in it: only the regression coeÆcients are being penalized. The parameter regulates the penalty: the larger , the stronger its in uence and the smaller the elements of are forced to be. The division by 2 is only for convenience: it drops out after dierentiation. From @l =@ = 0 and @l =@ j = 0 follows the system of penalized likelihood equations
u0 (y p) = 0; X 0 (y p) = ; 4
(12) (13)
where u is an m-vector of ones. The equations are non-linear, because of the non-linear relationship between p and and . A rst order Taylor expansion gives n @p @pi ( j ~j ); (14) pi p~i + i ( ~) + @ @ j j
X =1
where a tilde as in ~ indicates an approximate solution to the penalized likelihood equations. Now
@pi = pi (1 pi ); @ @pi = pi (1 pi )xij : @ j Using this, and introducing wi = pi (1 pi ), W = diag(w), we we arrive at ~ u + u0 W ~ X = u0 (y p~ W ~ ~); u0W 0 0 0 ~ u + (X W ~ X + I ) = X (y p~ W ~ ~) XW
(15) (16)
P
(17) (18)
We now have a linearized system and iterating with it generally leads to a solution quickly. In most cases 10 iterations are enough. Suitable starting values are ~ = log[ y=(1 y)], with y = m yi =m, and ~ = 0. If we i 0 0 introduce = [j ] and Z = [ujX ], we can write the equations as ~ Z + R) = Z 0 (y p~ + W ~ Z ~); (Z 0 W (19) =1
where R is an n + 1 by n + 1 identity matrix with r set to zero, to re ect that there is no penalty on . The parameter is strongly related to the fraction of 1's in y . From @l =@ = 0 followed (12), which we can write as m m m m 11
Xy = Xp i=1
i
i=1
i
or
X y =m = y = X p =m = p: i
i=1
i=1
(20)
i
Thus the mean of p has to be equal to the mean of y , i.e, the fraction of 1's in y . In our application the ridge parameter has a crucial in uence on the solution and we need a procedure to estimate an \optimal" value from the data. Such a procedure should be based on the predictive performance of the model. After all, the t to a training set is not the interesting issue. What really matters is using the set of expressions measured on a new microarray to classify whether it is a case of ALL or AML. On possible approach is to set apart some of the data, t a model to the rest and see how well it predicts. This is called cross-validation and several schemes can be conceived. One is to set apart, say, one third of the data. More complicated is \leave-one-out" cross-validation: each of the m observations is set apart in turn and the model is tted to the m 1 remaining ones. This is rather expensive as the amount of work is proportional to m(m 1). To measure performance in cross-validation, several measures can be used : 13
The fraction of misclassi cation.
P
P
The strength of the prediction: the Brier score i (yi p^ i ) or the prediction log-likelihood i [yi log p^ i + (1 yi ) log(1 p^ i )]. Here the subscript i in p^ i indicates that observation i was left out and that the probability was estimated using a model for the remaining ones. 2
A dierent approach is to use Akaike's Information Criterion (AIC). Akaike showed that the predictive likelihood of a model can be estimated by subtracting the number of parameters of a model from the maximized log-likelihood. AIC strikes the right balance between complexity and delity to the data. Burnham and Anderson are a good source on AIC, both on an introductory and a technical level. AIC is de ned as 14
AIC = Dev(y jp^) + 2Dim;
(21)
where Dev(.) is the deviance (equal to 2l) and Dim the eective dimension of the model. In penalized estimation, Dim is not equal to the length of the parameter vector. Hastie and Tibshirani discussed this. They gave good arguments to estimate it as Dim = trace[Z (Z 0 W Z + R) W Z 0 ]: (22) This is the expression we will use. 15
1
5
4. COMPUTATIONAL ASPECTS
The system of equations (17, 18) is large: thousands of equations, with an equal number of unknowns. Even for modern computers this can be problematic: only storing the equations in memory, assuming 8 byte oating point numbers already takes over 100 Mb! Solving the system takes a fair amount of time. With larger microarrays, which might easily contain 10 expressions, memory size has to be over 1 Gb and computation time gets out of control. Fortunately there are several ways to reduce the number of equations enormously. The rst option is to use the singular value decomposition (SVD) of X : X = USV 0 , with U and S of size m by m, V of size n by m, U 0 U = Im , V 0 V = Im and S diagonal. In modern systems for numerical or statistical computation, like Matlab or S-Plus, the SVD is a standard function. Assume that = V . Then we can write (17, 18) as 4
16
~ u + u0 W ~ USV 0 V = u0 (y p~ W ~ ~); u0 W 0 0 0 0 ~ u + (V SU W ~ USV V + V ) = V SU (y p~ W ~ ~) V SU W Multiplying the second system by V 0 we get ~ u + u0 W ~ US = u0 (y p~ W ~ ~); u0 W 0 0 0 ~ u + [(US ) W ~ US + I ] = (US ) (y p~ W ~ ~) (US ) W
(23) (24)
(25) (26)
The length of is m, the number of microarrays. This also the size of the system in (26). Only a small amount of memory space is needed and the equations can be solved quickly. Eectively the matrix X with m rows and n columns is being replaced by the matrix US with m rows and m columns. A second approach is to start with (12): it says that = X 0 (y p)= = X . Introducing C = XX 0 and rewriting (9) as
i = log
1
pi
pi
=+
Xc n
j =1
ij j ;
(27)
and going through analogous derivations as the one that led from (12) to (17, 18) we get ~ u + u0 W ~ C 0 V = u0 (y p~ W ~ ~); u0 W 0 0 0 ~ ~ ~ ~) C W u + (C W C + C ) = C (y p~ W
(28) (29)
Like , has length m and we have a small system of equations again. A possible advantage of the SVD is that allows data reduction. If the diagonal elements of S , the singular values, are ordered form large to small (and also the corresponding columns of U and V ),
x^ij =
Xs u t
k=1
k ik vjk
(30)
is the optimal t-dimensional approximation to X . For the application to be presented below, such a reduction is not necessary, so we use the full SVD of X . But in future studies the number of arrays might grow into the hundreds and then approximation might be very useful to keep the size of the systems of equations limited. Of course, only analysis of real data can determine how large t has to be. A good approximation to X itself does not necessarily mean good performance as explanatory variables. 5. RESULTS
To test our approach we used the MIT data set (see www.genome.wi.mit.edu/MPR) on acute leukemia. See the paper by Golub et al. for details. The website and the paper are not clear about preprocessing, but in a report by Dudoit et al. we found a description, attributed to personal communication with Tamayo, one of the authors of the paper mentioned. The recipe is: 1
17
1
Threshold the expressions in X with a oor of 100 and a ceiling of 16000. 6
Akaike’s Information Criterion
Effective dimension
65
16 14
60 12 10
AIC
Dim
55
50
8 6 4
45 2 40
0 0
1
2
3
4
5
0
logλ Figure 2.
1
2
3
4
5
logλ
Akaike's Information Criterion (AIC) as a function of the penalty parameter .
Eliminate columns of X with max = min 5 and max min 500. Take logarithms to base 10.
After applying this procedure we are left with a 3571 genes, exactly the number Dudoit et al. report (the raw data contain 7129 genes). The training data set consists of 25 ALL and 11 AML cases to which we applied our model. To nd an optimal value of , it was varied in steps over a large range (1 10 ), using 25 linearly spaced values of log . The left panel of Figure 2 shows how AIC changes with . The optimal value turned out to be 400. As the curve of AIC is rather at near the minimum, there is no need to improve on the grid search with a more sophisticated algorithm. The right panel of Figure 2 shows how the eective dimension of the model decreases at higher values of . When = 400, Dim = 4:5. The left panel of Figure 3 gives an impression of the estimated coeÆcients. The right panel shows a graph of ^j ^j , where ^j is the estimated standard deviation of column j of X . As the contribution to the linear predictor i is xij ^j , ^j ^j is a good indicator of the in uence of column j . Figure 4 shows histograms of these results. They have a more or less normal shape, but the tails appear to be heavier. The left panel of Figure 5 shows the observed y vs the estimated probabilities (of AML). A small random vertical shift has been used to avoid complete overlapping of symbols. A dotted vertical line has been drawn at the position p = y. Without any knowledge of the expressions, but knowing that a fraction y of the arrays is from AML cases, one would classify an array as AML with probability y. Thus a very simple decision rule would be to compute p^ and see if it is lower of higher than y. The gure shows that with optimal value of AIC a model results that separates ALL and AML cases with this simple decision rule. The verdict is not strong: ideally one would see probabilities that are very near to either zero or one. Of course, the real test is the classi cation of new data, that were not used for the estimation of the model. This is shown in the right panel of Figure 5. The test set consist of 20 cases of ALL and 14 of AML; it was also obtained from the MIT website. The dotted line again represents a threshold that could be used for a simple decision rule, as described in the preceding paragraph. We see that this rule would put three arrays in the wrong class Figure 6 shows how the model probabilities (left panel) and the prediction probabilities (right panel) change with the penalty parameter . The latter graph is intriguing. AIC has a reputation of undersmoothing, i.e. choosing models with generally too large an eective dimension. Various procedures have been developed to counteract this behaviour. Apparently undersmoothing is not the case here: the eective dimension (4.5) is rather low. Figure 6 17
5
14
7
Weighted regression coefficients
Regression coefficients 0.02
0.02
0.015
0.015
0.01
0.01
0.005
0.005
0
0
−0.005
−0.005
−0.01
−0.01
−0.015
−0.015
−0.02
−0.02 0
1000
2000 Gene number
3000
4000
0
1000
2000 Gene number
3000
4000
Estimated regression coeÆcients ^ (left) and the weighted regression coeÆcients ^^ (right) for each of the 3571 expressions. Figure 3.
Regression coefficients
Weighted regression coefficients
350
900 800
300
700 250 600 200
500
150
400 300
100 200 50
100
0 −0.02 Figure 4.
(right).
−0.01
0
0.01
0 −0.01
0.02
−0.005
0
0.005
0.01
Histograms of the estimated regression coeÆcients ^ (left) and the weighted regression coeÆcients ^^
8
Training set
Test set
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0
0.2
0.4
0.6
0.8
1
0
0.2
Probability of AML
0.4
0.6
0.8
1
Probability of AML
Actual class vs probability of AML for the training set (left panel) and the test set (right panel); 0 = ALL, 1 = AML (with a random vertical shift for visibility).
Figure 5.
Training set
Test set
Test set −4
1
1
0.8
0.8
−6 −8
0.4
Log−likelihood
0.6
Probabilities
Probabilities
−10
0.6
0.4
−12 −14 −16 −18
0.2
0.2 −20
0
−4
−22
0
−2
0
2
4
−4
−2
0
logλ
logλ
2
4
−24 −4
−2
0 logλ
2
4
Classi cation probabilities as a function of the penalty parameter ; left panel for the training set and middle panel for the test set. ALL case are marked with `o', AML cases with '+'. The right panel shows the log-likelihood of the predictions for the test set. Figure 6.
gives a strong impression that better results are possible with a smaller penalty. If we had used the test set for cross-validation a smaller would certainly be indicated. This will be a subject of further research. It is interesting to use the model output to display the data in a special way. We sort ^ ^ and order the corresponding preprocessed expressions correspondingly. Figure 7 (left panel) present the top 50 rows and the (right panel) the bottom 50 rows. A light spot indicates strong expression. (Note that here we follow the convention to have rows represent genes and columns arrays; in the statistical part it was the other way around.) The pictures clearly indicate the dierences between the two groups. Arrays numbered 1 to 27 are ALL and 28 to 38 are AML. It is also clear that there are no expressions that uniformly show high expression in one group. 9
Most positive coefficients
5
5
10
10
15
15
20
20
Position
Position
Most negative coefficients
25
25
30
30
35
35
40
40
45
45
50
50 5
10
15 20 25 Array number
30
35
5
10
15 20 25 Array number
30
35
Preprocessed expression levels, selected and sorted according to the most negative (left panel) and most positive (right panel) weighted coeÆcients.
Figure 7.
6. DISCUSSION
We have shown that a ridge penalty can make logistic regression with array data an eective tool for classi cation. Projection on singular vectors makes it practical, reducing the size of the system of estimating equations enormously. Akaike's Information Criterion, which is easily calculated, is a reasonable indicator of predictive performance, as was borne out by testing the classi cation of a second data set. The use of penalties in logistic regression is not new. le Cessie and van Houwelingen applied it in the context of a survival problem and Marx and Eilers presented a general approach to generalized linear modeling under multicollinearity. In these applications however, the explanatory variables are ordered: histograms and optical or acoustic spectra. This oers additional possibilities to strengthen the solution, by exploiting penalties that enforce smoothness: regression coeÆcients for neighboring spectral channels or histogram bins are forced to have nearly identical values. Microarray data have no order: statistically speaking, each gene stands on its own. In a Bayesian context a penalty can be interpreted as the logarithm of the prior density of the parameters. The parameter plays the role of the inverse of the variance of the prior. Eectively we used AIC to estimate this unknown variance from the data. West et al. take a fully Bayesian approach, using Monte Carlo Markov Chain simulation to estimate the parameters of the model. This is computationally much more intensive than our approach, but it has the advantage that estimation uncertainties are correctly quanti ed within the Bayesian framework. Standard errors can also be estimated in the penalized likelihood case, but as far as we know there exists no good theory to account for the uncertainty in the optimal as determined by AIC. Our penalty is based on the sum of squares of coeÆcients. It has nice mathematical properties: only an extra diagonal matrix I is introduced in the estimation algorithm. A disadvantage is that all coeÆcients shrink towards zero, but they never disappear. A zero coeÆcient indicates that a gene has no in uence on the predicted probability. A large number of zero coeÆcients is attractive, as this means a compact model. Of course, one hopes that the statistically irrelevant genes are also biologically irrelevant. A penalty based on the sum of the absolute values of the coeÆcients achieves the desired elimination of many coeÆcients, but it leads to a complicated optimization problem. Like we do, West et al. use the singular value decomposition to reduce the size of the system of equations and to speed up the computations. Principal Component Regression implicitly exploits these advantages, but the combination with a penalty is rare, although the singular value decomposition is a standard tool to explain the eect 18
6
19
20
19
10
of a ridge penalty in linear estimation. We use singular vectors only as a computational tool. Alter et al., propose a deeper meaning for \eigengenes" and \eigenarrays". We do not pursue this topic here. The regression framework is more attractive than (penalized) classi cation methods, because it can be extended straightforwardly to other types of response. In the present application it is a simple yes (AML) or no (ALL), but in other situations it might be a continuous measure, a count or a (censored) survival time. For all these types of variables well-established regression frameworks exist. By adding a penalty they can be adapted to microarray data. The ALL/AML data represent absolute expression levels. In the raw data always a number is reported, although it might be too low to be meaningful. The preprocessing steps eliminate nearly half of the data, so that only genes with relatively large and relatively variable expressions survive. An interesting subject for further study is relaxing the preprocessing criteria. If a penalty does its work well, it gives \junk" expressions no chance and one would expect preprocessing not to be very critical. Array data that represent ratios of two uorescence levels (of red and green) generally show a large proportion of missing data. It is ineÆcient, when estimating the logistic model, to leave out a whole gene just because its expression could not be measured in only a few arrays. Another type of missing value problem can occur after tting the model (with genes that were all completely measured): in a new array some of the expressions might be missing. It is not an attractive option to re-estimate the model with only the genes that are complete over all arrays. So, an important goal of further research will be the development of systematic ways for handling missing data. The output of our algorithm is a vector of thousands of regression coeÆcients. For a statistician the job might end there, but for the geneticist this result is hardly useful. A whole array of additional tools, for display, sorting and selection and labeling with gene names is necessary. A large additional software engineering eort is necessary, for which statisticians are not optimally equipped. There is a need for general frameworks that oer the interface and the tools for interaction of the geneticist with the output of large-scale statistical models. 21
22
Acknowledgments
We thank Bart Mertens for careful reading of and commenting on the manuscript in various stages. REFERENCES
1. T. R. Golub, D. K. Slonim, and P. T. et al., \Molecular classi cation of cancer: Class discovery and class prediction by gene expression monitoring," Science 286, pp. 531{537, 1999. 2. T. R. Hastie, R. J. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C. Chan, D. Botstein, and P. O. Brown, \`gene shaving' as a method for identifying distinct sets of genes with similar expression patterns," Genome Biology 1, pp. 3{21, 2000. 3. H. Martens and T. Naes, Multivariate Calibration, Wiley, Chichester, 1989. 4. A. E. Hoerl, R. W. Kennard, and R. W. Hoerl, \Practical use of ridge regression: a challenge met," Applied Statistics 34, pp. 114{120, 1985. 5. B. D. Marx and P. H. C. Eilers, \Generalized linear regression on sampled signals with penalized likelihood," in Proceedings of the 11th International Workshop on Statistical Modelling, A. F. et al., ed., pp. 259{266, Graphos, (Cita di Castello), 1996. 6. B. D. Marx and P. H. C. Eilers, \Generalized linear regression on sampled signals and curves: A p-spline approach," Technometrics 41, pp. 1{13, 1999. 7. P. McCullagh and J. A. Nelder, Generalized Linear Models, 2nd edition, Chapman and Hall, London, 1989. 8. P. Peduzzi, C. Concato, and E. K. et al., \A simulation study of the number of events per variable in logistic regression analysis," Journal of Clinical Epidemiology 49, pp. 1373{1379, 1996. 9. E. V. Thomas, \A primer on multivariate calibration," Analytical Chemistry 66, pp. 795A{804A, 1994. 10. I. E. Frank and J. H. Friedman, \A statistical view of some chemometric regression tools," Technometrics 35, pp. 109{148, 1993. 11. A. E. Hoerl and R. W. Kennard, \Ridge regression: Applications to nonorthogonal problems," Technometrics 17, pp. 69{82, 1970. 12. T. Fearn, \A misuse of ridge regression in the calibration of a near infrared re ectance instrument," Applied Statistics 32, pp. 73{79, 1983. 11
13. J. C. van Houwelingen and S. le Cessie, \Predictive value of statistical models," Statistics in Medicine 9, pp. 1303{1325, 1990. 14. K. P. Burnham and D. R. Anderson, Model Selection and Inference, Springer, New-York, 1998. 15. T. J. Hastie and R. J. Tibshirani, Generalized Additive Models, Chapman and Hall, London, 1990. 16. D. S. Watkins, Fundamentals of Matrix Computations, Wiley, New York, 1991. 17. S. Dudoit, J. Fridlyand, and T. P. Speed, \Comparison of discrimination methods for the classi cation of tumors using gene expression data," Tech. Rep. #576, Department of Statistics, University of California, Berkeley, June 2000. 18. S. le Cessie and J. C. van Houwelingen, \Ridge estimators in logistic regression," Applied Statistics 41, pp. 191{ 201, 1990. 19. M. West, J. R.Nevins, and J. R. M. et al., \Dna microarray data analysis and regression modeling for genetic expression pro ling," Tech. Rep. 00-15, Institute of Statistics and Decision Sciences, Duke University, 2000. 20. R. J. Tibshirani, \Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society, Series B 58, pp. 267{288, 1996. 21. R. H. Myers, Classical and Modern Regression with Applications, Duxbury Press, Boston, 1986. 22. O. Alter, P. O. Brown, and D. Botstein, \Singular value decomposition for genome-wide expression data processing and modeling," Proceedings of the National Academy of Sciences 97, pp. 10101{10106, 2000.
12