Assessment of reliability of microarray data and ... - Semantic Scholar

5 downloads 0 Views 485KB Size Report
Apr 27, 2004 - The bivariate analysis approach was found to be more accurate than the univariate approach; both approaches were superior to a conventional.
Published online April 27, 2004 Nucleic Acids Research, 2004, Vol. 32, No. 8 2323±2335 DOI: 10.1093/nar/gkh544

Assessment of reliability of microarray data and estimation of signal thresholds using mixture modeling Musa H. Asyali*, Mohamed M. Shoukri, Omer Demirkaya and Khalid S. A. Khabar1 Department of Biostatistics, Epidemiology, and Scienti®c Computing and 1Department of Biological and Medical Research, King Faisal Specialist Hospital and Research Center, PO Box 3354, MBC-03, Riyadh, 11211, Saudi Arabia Received December 28, 2003; Revised March 15, 2004; Accepted March 24, 2004

ABSTRACT DNA microarray is an important tool for the study of gene activities but the resultant data consisting of thousands of points are error-prone. A serious limitation in microarray analysis is the unreliability of the data generated from low signal intensities. Such data may produce erroneous gene expression ratios and cause unnecessary validation or postanalysis follow-up tasks. In this study, we describe an approach based on normal mixture modeling for determining optimal signal intensity thresholds to identify reliable measurements of the microarray elements and subsequently eliminate false expression ratios. We used univariate and bivariate mixture modeling to segregate the microarray data into two classes, low signal intensity and reliable signal intensity populations, and applied Bayesian decision theory to ®nd the optimal signal thresholds. The bivariate analysis approach was found to be more accurate than the univariate approach; both approaches were superior to a conventional method when validated against a reference set of biological data that consisted of true and false gene expression data. Elimination of unreliable signal intensities in microarray data should contribute to the quality of microarray data including reproducibility and reliability of gene expression ratios. INTRODUCTION DNA microarray technology provides a powerful and ef®cient means for measuring the relative abundance of expressed genes in a variety of applications (reviewed in 1±3). In any microarray experiment, only a small fraction of the genes become expressed as a result of the investigated conditions. Thus, a large portion of the microarray data is comprised of low signal intensities that cause variability or impair reproducibility of the measured ratios between control and experimental samples. There are also other situations that give rise to low signal values such as the deposition of sub-optimal

amounts of the probes, quality of probes, or incorrect segmentation of the spots. The aim of the approach described in this paper is to ®nd the optimal signal intensity thresholds to identify and eliminate the unreliable low signal intensities so that only reliable measurements are acquired from microarray data and, consequently, accurate ratios of gene expression are obtained. Here, we describe a statistical approach based on normal mixture modeling (4) to segregate the microarray data into two classes, namely classes of reliable and unreliable data. Normal mixture models are used to model data that contain more than one population conforming to normal distribution and ®nd use in many biological applications. We used an expectation maximization (EM) algorithm to estimate the weighted normal mixture components, i.e. class posterior probabilities (4,5). Subsequently, we applied Bayesian decision theory (6,7) to ®nd the optimal signal thresholds that discriminate between the reliable and the unreliable (low) signal intensity populations, based on the estimated posterior probabilities. Since microarray data are often generated from two dye channels, we used both univariate and bivariate (8,9) mixture modeling to deal with two-channel data separately or together, respectively. We applied the proposed techniques on data from three different microarray experiments and compiled a reference set of genes known to be expressed under the same experimental conditions, to validate our methods. A statistically designed approach for estimating signal thresholds, such as the described method, should eliminate unreliable signal intensity observations in microarray data, and yield more informative and accurate gene expression ratios. MATERIALS AND METHODS Microarray preparation We used complementary DNA (cDNA) microarray, which contained ~2000 cDNA distinct probes and a total of ~4000 elements (10,11). The cDNA probes were generated via PCR ampli®cation of the clone-containing glycerol bacterial culture stocks using M13 primers and spotted (100 mm diameter) at least in duplicate on poly-L-lysine glass slides using an OmniGrid robot (GeneMachines, San Carlos, CA). Total RNA samples (20 mg, 5 mg/ml) from a THP-1 monocytic

*To whom correspondence should be addressed. Tel: +966 1 464 7272, ext. 39211; Fax: +966 1 442 7854; Email: [email protected]

Nucleic Acids Research, Vol. 32 No. 8 ã Oxford University Press 2004; all rights reserved

2324

Nucleic Acids Research, 2004, Vol. 32, No. 8

cell line that was stimulated with 10 mg/ml endotoxin (Sigma, St Louis, MO) were reverse transcribed using SuperScript II (Gibco) and a primer provided by the Genisphere kit (Genisphere, Inc., Hat®eld, PA). The cDNAs from the two samples (i.e. control and treatment) were combined and ethanol precipitation was performed. The hybridization buffer supplied by the Genisphere kit was heated to 55°C for 10 min and mixed with two blocking solutions, oligo(dT) and LNA dT for non-speci®c hybridization of labeled cDNA to elements containing poly(dA) sequences and for poly A containing elements including spotted poly(dA) sequences, respectively. The cDNA pellets were incubated in the hybridization buffer with the blockers as `hybridization mixes' of 40% formamide, 43 SSC, 1% SDS and 23 Denhardt's solution, and were then applied to the microarray. The microarrays were covered with coverslips for 18 h at 55°C in a sealed, humidi®ed chamber. After hybridization, the slides were washed brie¯y by a series of washes to remove unbound cDNA. Subsequently, the microarrays were subjected to dendrimer hybridization using Genisphere DNA dendrimer capture reagent labeled with Cy3 (control) or Cy5 (experiment). The hybridization mixes were ®rst incubated at 75±80°C for 10 min, and then at 50°C for 20 min, and then added to pre-warmed microarray slides. The microarrays were covered with coverslips and incubated for 3 h in a dark humidi®ed chamber at 45±50°C. After hybridization the slides were washed brie¯y several times to remove unbound dendrimer molecules and the microarrays were dried in a centrifuge. Image acquisition and intensity extraction The cDNA microarrays were scanned at 10 mm resolution using ScanArray 5000 (GSI lumonics, Moorpark, CA) scanner that excites the Cy3 (green) and Cy5 (red) ¯uorophores at optimal wavelengths of 532 and 635 nm, respectively. The median intensities of Cy3 and Cy5 ¯uorescent signals from each cDNA spot on the microarray image were extracted using the adaptive circle algorithm (QuantArray software). The spot intensities with large local background were excluded if the median intensity of the spot is 0 is the `weight' or `mixing coef®cient' of hi(x; qi), and {x = xj, j = 1, 2, ¼, n} is the set of n d 3 1 data points or observations. In the case of normal mixtures, the density component hi(x; qi) is a multivariate normal probability density, i.e. hi(x; qi) = N(x; mi, Si) = (2p)±d/2|Si|±1/2exp{±(x ± mi)TSi±1(x ± mi) / 2}, where mi is the d 3 1 mean vector and Si is the d 3 d covariance matrix, and qi consists of the distinct elements of mi and Si. The corresponding length of qi is m = d + (1/2)d(d + 1), because of the symmetry of the covariance matrix. In addition, Ski=1 pi = 1 so that f(x) is a valid probability density. Therefore, the total number of parameters estimated in equation 1 was M = km + k ± 1. In order to estimate these parameters with acceptable variance/accuracy, one must have M ! n. Here, we assumed there were two probability density components, i.e. k = 2, embedded in the probability density of our data, corresponding to the classes of unreliable and reliable signal intensity observations. Once the mixture model parameters qis were estimated, we calculated the posterior probability of any data point x, belonging to ith class as f(x; x Î Class i) = pihi(x; qi) / f(x) and compared these probabilities to decide which class the data point x belongs to (7). We used the EM algorithm for estimating the mixture parameters; EM is suitable for optimizing the likelihood function in situations where there is some missing information such as the class memberships of the data points (18±20). We started the algorithm with an initial estimate of the mixture parameters qis obtained by the k-means algorithm (4,7,21) and iterated the expectation and maximization steps until the changes in all of the estimated parameters were smaller than a small preset tolerance or a certain number of iterations were reached. We selected the parameter tolerance and the maximum iteration count as 0.0001 and 300, respectively. We performed all the computations using in-house software

Nucleic Acids Research, 2004, Vol. 32, No. 8 developed under MatlabÔ (The MathWorks Inc., Natick, MA) environment. Bayesian classi®cation In a two-class classi®cation problem, if x denoted an observation and i = 1, 2 was the class index, and the class prior probabilities, Pi = P(x Î Class i), and the class conditional densities, f (x: x v Class i) were known, then according to the Bayes's theorem, the posterior probabilities, f (x Î Class i; x), were: f …x Î Class i; x† ˆ

f …x; x Î Class i†Pi f …x†

where f (x) is a normalizing factor that ensures the posterior probabilities add up to 1.0. Bayesian decision rule simply assigns an observation to the class that has the highest posterior probability, i.e. decides that x Î Class 1 if f (x Î Class 1; x) is larger than f (x Î Class 2; x) and vice versa. (This method is also known as maximum a posteriori classi®cation.) Hence, for any given value of x probability of error, P(e; x) was minimized, because  f …x Î Class 1; x† if x Î Class 2 is decided P…e; x† ˆ f …x Î Class 2; x† if x Î Class 1 is decided ˆ min ff …x Î Class 1; x†; f …x Î Class 2; x† Detailed aspects of Bayesian classi®cation techniques are described in Duda et al. (7). The optimal decision boundary was found by equating the posterior probabilities. In the case of multivariate normal densities, the optimal decision boundary is a hyper-quadratic, whose form and position are determined by the prior probabilities, the mean vectors and the covariance matrices of the distributions, i.e. mixture components. Assuming that the class conditional densities and prior probabilities were estimated through mixture modeling utilizing the EM algorithm, this optimal classi®cation technique can be applied to minimize the probability of error in the microarray signal intensity classi®cation. RESULTS We performed two independent experiments of microarray gene expression using the same cell system in order to

2325

demonstrate the present approach. In addition, we used a publicly available data set (13). Background-subtracted and normalized data (¯uorescence signal intensities) were ®rst natural log-transformed to obtain better approximation to normal distribution. Microarray data generally consist of signal intensities generated from two different channels, Cy3 (green) and Cy5 (red). Table 1 shows summary statistics for the two channel data in the three data sets including mean, median, standard deviation (SD) and the correlation between the two channels. The table also includes the median and the SD of the raw (not log-transformed) background intensities; these values were needed to implement Fielden's thresholding technique (22) for comparison with our approach. We then applied normal mixture modeling, typically ®tted by maximum likelihood using the EM algorithm, to segregate microarray data into two populations of unreliable and reliable components. We employed this approach with either univariate or bivariate analysis, in which Cy3 (green) and Cy5 (red) channel data were either independently or dependently analyzed, respectively. Subsequently, a biologically validated reference set of genes was used to con®rm the methods. Univariate normal mixture modeling In the univariate approach, we processed the log-transformed data from each channel (Cy3 and Cy5) separately and assumed that their probability density functions consisted of two normal components. That is fg(x) = pgN(x; m1g, s21g) + (1 ± pg)N(x; m2g, s22g)

2a

fr(x) = prN(x; m1r, s21r) + (1 ± pr)N(x; m2r, s22r)

2b

where f(x) is the probability density, x is the log-transformed signal intensity, p is the weight, N(x; mi, si), i =1, 2 are the normal density components, and the subscripts g and r denote Cy3 and Cy5 channels, respectively. The weighted normal density components of each channel, i.e. pN(x; m1, s1) and (1 ± p) N(x; m2, s2), represent the posterior probabilities of potentially unreliable and reliable intensity classes of data, respectively. Figure 1 shows the histogram of each channel from the three data sets representing the probability density functions fg(x) and fr(x) of the entire population of signal intensities. In addition, the ®gure shows the plots of the two probability density functions representing the components that were estimated for each channel data by normal mixture modeling. Thus, the normal mixture approach resulted in two

Table 1. Summary statistics for microarray gene expression data Data set 1 Cy3 n (number of data points) Mean Median SD Correlation coef®cient (rCy3,Cy5) Median SD

3027 6.28 5.99 1.08 0.93 93.83 138.92

Cy5 6.16 5.81 1.16 54.08 67.33

Data set 2 Cy3 3040 6.21 5.95 1.05 0.92 90.58 153.13

Cy5 6.08 5.77 1.15 60.12 107.36

Data set 3a Cy3 39 618 6.53 6.43 0.93 0.90 NA NA

Cy5 6.57 6.41 1.01

Background and normalized signal intensities of the spots, generated using default settings of the QuantArray image processing program, were analyzed. Background signal intensities were generated from the local background readings of the entire microarray data set. NA, not applicable. set 3 is the publicly available data set obtained from the study of Chang et al. (13).

aData

2326

Nucleic Acids Research, 2004, Vol. 32, No. 8

normal density components, a component with a lower mean (Component 1) and a component with a higher mean (Component 2), corresponding to populations of data with

low (potentially unreliable) and higher signal intensity. To verify that the normal mixture model ®ts the underlying distribution of the data well, we constructed quantile±quantile

Nucleic Acids Research, 2004, Vol. 32, No. 8

2327

Figure 1. Univariate normal mixture histograms and threshold plots. Histograms in 60 bins from three different data sets (A±C) represent estimated normal density components for Cy3 green channel (top) and Cy5 red channel (bottom) channels. The estimated weighted normal density components for each channel, i.e. components 1 and 2 or pN(x, m1, s1) and (1 ± p)N(x, m2, s2) are shown by dashed lines and dotted lines, respectively. The statistical parameters for each component are shown in Table 2. The sum of the weighted densities, fg(x) and fr(x) (solid lines in top and bottom panels) closely mimics the density or histogram for the Cy3 and Cy5 channels, respectively. The vertical dashed line shows the optimal signal intensity threshold for each channel in the log-domain as explained in the text. The optimal thresholds can be converted to the raw-data domain by taking their exponentials.

(qq) plots. Indeed, the qq plots showed that the normal mixture model for each channel data ®ts well the underlying distribution of the actual data (Fig. 2). This was consistent in all the three microarray data sets that were analyzed (Fig. 2A±C). An important feature observed here is that the majority of the signal intensity data (~75%), regardless of the channel, belong to Component 1 (p ~ 0.75; Table 2). This is consistent with the fact that most of the microarray data fall into low signal intensities representing genes that are not expressed in any given experimental condition. In contrast to Component 2, it appears that Component 1 has lesser coef®cient of variation (~8%). The relatively larger variability (coef®cient of variation ~16%) of Component 2 (reliable signal population) may re¯ect the heterogeneity of the expression levels of the different genes. In the larger data set (Table 2, third data set) of approximately 40 000 data points and more than 3000 differentially expressed genes, the variability of Component 2 (reliable signal population) was ~12% (coef®cient of variation), which was also larger than the variability of the low signal component. Bivariate normal mixture modeling We applied bivariate analysis in the normal mixture modeling by considering Cy3 and Cy5 channels together. In general, microarray channel readings are highly correlated, as clearly seen in the scatter plot of the two channel data points (Fig. 3).

Indeed, the correlation coef®cient was >0.9 in all three cases (Table 1). We have exploited this correlation feature and modeled the probability density of the microarray data as follows: f(x) = pN(x; m1, S1) + (1 ± p)N(x; m2, S2)

3

Again, the weighted bivariate normal density components, i.e. pN(x; m1, S1) and (1 ± p)N(x; m2, S2) correspond to the posterior probabilities of the low (unreliable) and high potentially reliable intensity classes of data, respectively. The bivariate mixture modeling yielded two bivariate normal components embedded in two-dimensional probability density of the data depicted as ellipses overlaid in the scatter plot (Fig. 3). The larger the variance of the bivariate density in each direction, the larger axes length of the ellipsoids in the corresponding direction. The results of bivariate mixture estimation, i.e. mean vector m, covariance matrix S, and the weight p of the two bivariate density components for each data set are shown in Table 2 (bivariate analysis). As in the case of univariate analysis, the bivariate Component 1 representing low signal intensities, i.e. had lower signal intensity means (mean vector m) of both channels and included most of the signal intensity data (p ~ 0.75) in contrast to Component 2 (Table 2). This is also consistent with the fact that not all the genes are expressed in any given biological condition. Also, the variance of Component 1 (potentially unreliable) is

2328

Nucleic Acids Research, 2004, Vol. 32, No. 8

Figure 2. Analysis of the goodness of the ®t between the Cy3 and Cy5 data channels and their two-component normal mixture model, using qq plots. The qq plots from three different datasets (A±C) show quantiles of the actual Cy3 green channel (left) or Cy5 red channel (right) data against the mixture model of each. The dashed line in each plot corresponds to the case of perfect agreement between the compared samples. If the `+' marks, indicating the location of the matching quantiles, roughly lie on the straight line, then the distributions of the samples have the same shape except for a possible shift and rescaling.

smaller than the variance in Component 2 (potentially reliable signal population) indicating that heterogeneity of the level of gene expression (Table 2, covariance matrix S). The scatter plot shows also the ratios of gene expression; in any typical experiment, a small proportion of genes have signi®cant

expression ratios (for example Cy5 signal intensity/Cy3 signal intensity) that are different from unity (Cy3 = Cy5 readings). Thus, ratios that are dependent on the signal intensities located in the ellipse of the potentially reliable Component 2, are more likely to represent true expression ratios.

Nucleic Acids Research, 2004, Vol. 32, No. 8

2329

Table 2. The results of univariate and bivariate mixture modeling Data set

1

Univariate analysis Cy3 Component 1 Mean (m or m)

m

Var. (s2 or S)

s2

Weight (p) 2

Component 2

  5:826 5:656   0:229 0:17 0:172 0:25

  7:530 7:554   1:6431:518 1:5181:697

0.735   5:797 5:608   0:235 0:19 0:197 0:30

0.265   7:494 7:541   1:634 1:502 1:502 1:671

0.755   6:060 6:045   0:418 0:39 0:390 0:39

0.245   7:300 7:434   0:665 0:538 0:538 0:840

0.622

0.378

Component 2

Component 2

5.846

7.649

5.636

7.450

m

0.249

1.628

0.239

1.718

S

0.761

0.239

0.712

0.288

Mean (m or m)

m

5.807

7.567

5.588

7.422

m

Var. (s2 or S)

s2

0.246

1.615

0.285

1.717

S

0.769

0.231

0.731

0.269

Weight (p) 3a

Bivariate analysis Component 1

Cy5 Component 1

Mean (m or m)

m

6.119

7.312

5.993

7.317

m

Var. (s2 or S)

s2

0.446

0.755

0.342

0.898

S

0.656

0.344

0.564

0.436

Weight (p)

Two normal mixture density components were generated by normal mixture modeling. The EM algorithm was used to estimate the statistical parameters of the components using either univariate or bivariate analysis. aData set 3 is the publicly available data set obtained from the study of Chang et al. (13).

Optimal classi®cation for signal intensity reliability Because there is an overlap region between the two normal components resulting from both univariate and bivariate normal mixture analyses, one needs to ®nd an optimal decision boundary for classi®cation of signal intensities as reliable or unreliable. In univariate analysis, the optimal classi®cation threshold for each channel was found by equating the class posterior probabilities. From equation 2a, the g-threshold (Cy3 signal intensity threshold) was the solution of: pgN(x; m1g, s21g) = (1 ± pg)N(x; m2g, s22g) and similarly from equation 2b, the r-threshold (Cy5 signal intensity threshold) was the solution of: prN(x; m1r, s21r) = (1 ± pr)N(x; m2r, s22r) As seen in Figure 1, for each channel, the optimal threshold was exactly at the intersection of the two weighted normal density components. For example, the threshold for data set 1 (Fig. 1A) was 1001 and 761 for the Cy3 and Cy5 channels, respectively. It is clear that each data set, as a result of the different hybridization conditions and the level of global background, have their unique thresholds. For example, the Cy3 and Cy5 thresholds for data set 2, were 963 and 789 (Fig. 1B). Thus, an observation x = [xg xr]T should be ¯agged as reliable, if either of the channels is larger than its respective threshold, i.e. if xg > g-threshold or xr > r-threshold, then x is identi®ed as belonging to the class of reliable signal intensities. In bivariate analysis, an optimal classi®cation decision boundary was generated instead of the optimal ®xed signal thresholds for each channel in the univariate analysis. This

boundary was determined by equating the two weighted mixture components or posterior probabilities that are given in equation 3: pN(x; m1, S1) = (1 ± p)N(x; m2, S2) The solution of this equation de®nes a quadratic curve and consequently the decision regions in the two-dimensional observation space (Fig. 3). On one side of this decision curve, observations belong to one class (e.g. the population of low or unreliable signal intensities), whereas on the other side is the class of reliable signal intensities. In other words, the region for the class of unreliable data is the interior of the decision ellipseÐdenoted by the thick ellipsoid line in Figure 3Ðwhile those signal intensities belonging to the reliable class fall outside the decision ellipse. Comparison of classi®cation performance against the reference validation sets We constructed and used reference validation sets from our experimental data to assess and compare how well different classi®cation methods, including a common thresholding technique (22), are performing. We used human monocytes (monocytic leukemia cell line, THP-1) induced by the endotoxin, LPS. This is a common model in in¯ammation research (23±25). Thus, prior knowledge about the expression of many genes would facilitate the discrimination between true and false positives in microarray data. For each data set we used to test the proposed methods, a reference validation set of ~50 genes, was compiled. The validation sets contained both true signals and false signals based on prior knowledge of their expression in literature, and the results agreed in large-

2330

Nucleic Acids Research, 2004, Vol. 32, No. 8

scale expression analysis in both SAGE and microarray data (10,12). The true signals are those in which their microarray elements agree with prior knowledge on their expression, while false signals are those in which their microarray elements (due to different problems such as failure of ampli®cations, other low probe concentration or under-printing, etc.) are low in signal intensity and are not consistent with the expected expression or inducibility of these elements. An example is interleukin-8 gene, which is highly and consistently inducible in endotoxin-stimulated monocytes (24), the same cellular model from which the microarray data derived from. In the arrays, IL-8 probe was generated from two different sources of cDNAcontaining cultures, which resulted in optimal and suboptimal probe concentrations on the arrays. Another example is the eukaryotic elongation factor 2 gene that is ubiquitously expressed, and thus, acts as a housekeeping gene; their elements were present both as true signals or

false signals due to optimal and sub-optimal probe concentrations, respectively. To quantitatively assess how well our method classi®es the signal intensities in the reference/validation gene set into the two components of reliable and unreliable signal intensities, we computed the kappa coef®cient of agreement (26,27), along with the sensitivity, speci®city and observed agreement rates (Table 3). Both univariate and bivariate methods resulted in the ability to classify almost all of the true positives (sensitivity is 96±100%) and true negatives (speci®city, 93± 100%, depending on the data set) in the reference gene set as reliable and unreliable signal intensities, respectively (Table 3). The accuracy of the method can also be statistically evaluated using the kappa (k) coef®cient of agreement in which a value of 1 indicates an excellent (e.g. 100%) agreement. As the kappa values indicated, the best agreement with the reference classi®cation information available in the validation sets was obtained using our classi®cation technique

Nucleic Acids Research, 2004, Vol. 32, No. 8

based on bivariate modeling, with 100% observed agreement rate compared to 98 and 94% of univariate modeling for data sets 1 and 2, respectively (Table 3). Using the validation sets, we were also able to compare our approach with Fielden's method, a widely used method for exclusion of unreliable signal intensities. In this method, the threshold for each channel is obtained as: MedianBackG + 3SDBackG where MedianBackG and SDBackG refer to the median and SD of the signal intensity estimated from the spot backgrounds. Both the univariate and bivariate normal mixture modeling resulted in superior results when compared to Fielden's method. Unlike Fielden's method, which behaved particularly poorly in data set 1 (agreement rate of only 57%), our method performed superiorly and equally well in both data sets (Table 1). Table 4 shows examples of utility of our bivariate analysisbased classi®cation method using four expressed genes (of the ~50 genes in the reference set) that are well known to be

2331

induced in human monocytes by endotoxin (LPS), in addition to data on one housekeeping gene. DISCUSSION Microarray gene expression data usually bear a substantial number of false positives due to different possible sources of error, but, mostly due to the large population of low signal intensities resulting from low or lack of expression of the majority of the genes being investigated. This is due to the fact that, though cells harbor the same ~50 000 genes, only a fraction of genes become expressed at any time point or any condition. In addition, those low signal intensities that are near to the background levels increase variability or impair reproducibility of the measured ratios between control and experimental samples. In this study, we developed an approach that was derived from normal mixture modeling (4,5) using univariate and bivariate analysis, which segregated the microarray data into two components. With the application of the optimum classi®cation algorithm, two classes, namely,

2332

Nucleic Acids Research, 2004, Vol. 32, No. 8

Figure 3. Bivariate normal mixture scatter plots, classi®cation plots and validation plots. Cy3 (green channel) and Cy5 (red channel) scatter plots for the three data sets (A±C) are shown. The gray `+' marks denote the individual signal intensity data points, whereas the up and down triangles denote true positives and true negatives (as provided in the reference/validation sets), respectively. The large black dots and ellipses indicate the centers (means) and the variances of the two bivariate normal mixture component ellipses 1 and 2 as indicated by dashed and dotted lines, respectively (Table 2). The optimal thresholds for Cy3 and Cy5 channels obtained using univariate analysis are shown by the vertical and horizontal solid lines, respectively. The dashed vertical and horizontal lines show the optimal thresholds obtained using Fielden's method for Cy3 and Cy5, respectively. The region of unreliable signal observations in univariate analysis is the lower-left rectangular corner of the two-dimensional space, bounded by the thresholds. The thick ellipse shows the optimal decision boundary obtained by using the bivariate mixture modeling approach. The region of the unreliable observations is the interior of the decision ellipse.

classes of reliable and unreliable (low) signal intensity data, were identi®ed, and the proposed method was validated. We employed normal mixture modeling to microarray data because this procedure is well ®tted for clustering the observations together into groups for classi®cation and also provides a convenient and ¯exible class of models for estimating distributions (4,15,28). Both univariate and bivariate normal mixture modeling were used; in univariate analysis, data from the two microarray channels were considered, and optimal signal thresholds were established for each channel separately. Whereas in bivariate analysis, the two-dimensional nature of the data was exploited and data were directly classi®ed using posterior probabilities computed

from the bivariate mixture components. The latter approach, which takes into account the correlation of the two channels, leads to further accurate classi®cation of the data to reliable and unreliable components, as demonstrated with the validation reference sets. The validity of our approach in identi®cation of reliable signal intensities and in thresholding/classi®cation procedures was asserted through several statistical assessments and observations. We estimated the mixtures components and their weights using the EM algorithm (18±21) which is an iterative method for optimizing the likelihood function in situations where data is incomplete; in our case, the missing information was the class memberships of the data points as

Nucleic Acids Research, 2004, Vol. 32, No. 8

2333

Table 3. Classi®cation results and validation with the reference set Data set 1

2

Univariate analysis Reliable True positive 26 True negative 1 Total 27 Sensitivity (%) 100 Speci®city (%) 96 Observed agreement (%) 98 k (95% CI) 0.962 (0.889±1.036) Cy3 Ch. threshold 1001 Cy5 Ch. threshold 761 True positive 26 True negative 2 Total 28 Sensitivity (%) 96 Speci®city (%) 93 Observed agreement (%) 94 k (95% CI) 0.889 (0.767±1.011) Cy3 Ch. threshold 963 Cy5 Ch. threshold 789

Bivariate analysis Unreliable Total Reliable 0 26 26

26 27 53

1 25 26

27 27 54

Fielden's method Unreliable Total Reliable

26 0 0 27 26 27 100 100 100 1.000 (1.000±1.000) NA NA 27 0 0 27 27 27 100 100 100 1.000 (1.000±1.000) NA NA

26 27 53

27 27 54

Unreliable Total

26 0 23 4 49 4 100 15 57 0.146 (±0.117±0.408) 511 256 27 0 7 20 34 20 93 96 94 0.741 (0.562±0.920) 550 382

26 27 53

27 27 54

Classi®cation of the microarray signal intensities into reliable and unreliable populations using optimum ®xed threshold (univariate analysis; Fig. 1) and optimum classi®cation ellipse (Fig. 3), was performed. True positives and true negatives were determined using reference sets of expressed genes with a well known status of their expression from the literature. Subsequently, sensitivity, speci®city and agreement rates were determined. NA, not applicable; Ch, channel.

Table 4. Examples of validation of the algorithm using a reference list of genes known to be expressed in endotoxin-stimulated human monocytes ID

Gene

Cy3

Cy5

Cy5/Cy3

Expression statusa

Algorithm validationb

Hs.624

Interleukin-8 (LPS inducible gene)

Hs.1722

Interleukin 1a (LPS inducible gene)

Hs.211600

TNF-a induced protein 3 (LPS inducible gene)

Hs.196384

Cyclooxygenase 2 (LPS inducible gene)

Hs. 75309

Eukaryotic translation elongation factor 2 (housekeeping gene)

1501 2084 338 492 1805 2220 348 291 1989 1554 2585 2652 441 641 751 662 5086 751 320 3748 478 320 9808 13254 623 358

9252 10287 605 415 3013 3500 459 364 8891 6107 8281 7929 819 829 677 617 7166 677 469 5448 541 469 11188 13733 313 104

6.16 4.93 1.79 0.84 1.7 1.6 1.3 1.2 4.47 3.93 3.20 2.99 1.85 1.29 0.92 0.93 1.41 0.90 1.47 1.45 1.13 1.47 1.1 1.0 0.5 0.3

True True False False True True False False True True True True False False False False True False True True False True True True False False

Reliable Reliable Unreliable Unreliable Reliable Reliable Unreliable Unreliable Reliable Reliable Reliable Reliable Unreliable Unreliable Unreliable Unreliable Reliable Unreliable Reliable Reliable Unreliable Reliable Reliable Reliable Unreliable Unreliable

aGene expression status of known endotoxin (LPS) inducible genes: if the gene is known to be upregulated (>1.4) in LPS-stimulated human monocytes then, the status is true. bResults of the utility of our described algorithm, using bivariate normal mixture modeling, in discriminating reliable from unreliable signals and subsequently producing accurate gene expression ratios.

being reliable and unreliable. We used the `k-means' algorithm to initialize the EM to ensure that it will ®nd a `good' local maximum (4); this is often suf®cient in practical applications. Though there are other ways to ®t or estimate mixture models, the EM is relatively easy to implement, does

not require any gradient computation, and is numerically stable. This statistically designed approach leading to segregation of the data into two components resulted in a situation that mimics two important features of microarray data. First, the

2334

Nucleic Acids Research, 2004, Vol. 32, No. 8

majority of the signal intensities microarray data are not applicable for any analysis because the majority of the genes are not expressed at all or at very low basal levels of expression. We found that our approach succeeded in segregating the microarray data into one large component (75%) and a smaller component (25%) containing reliable data but with a larger mean value. This is consistent with the fact that most of the microarray data fall into low signal intensities representing genes that are not expressed in any given experimental condition. Also, low signals can arise from non-biological causes such as low quality of the hybridization probes on oligonucleotide microarrays or sub-optimal ampli®cation/printing of cDNA probes in cDNA microarray procedures. It should be noted that even among the reliable data, not all the genes are differentially expressed, as assessed by ratios of signal intensities. The majority of the microarray elements fall in the ratios of 1.0 (slope of 1.0 at microarray scatter plots of the two channels). Thus, our approach will yield a more accurate assessment of gene expressions ratios by eliminating the false positives ratios due to unreliable signal intensities. Since most of the error occurs when working with ratios based on low (unreliable) signal intensities, it is expected that the ratios are more accurate when based on `reliable signals' that our methods succeed to discriminate. Another feature that validates our procedures is that variation of the class of reliable signal intensity as statistically assessed re¯ects the nature of the microarray data and the true heterogeneity of the levels of expression of the different genes. An important component of this study is the use of a validation reference set of expressed genes to assess the accuracy of the classi®cation of reliable and unreliable data. We observed that both techniques, when applied to validation sets, have produced excellent results in terms of sensitivity, speci®city and kappa coef®cient of agreement with the reference classi®cation information that was available in the validation sets. In particular, the classi®cation based on bivariate mixture modeling, fully exploiting the bivariate nature of the dataÐthere is always excellent correlation between Cy3 and Cy5 channels of dataÐyielded even superior results with 100% sensitivity and speci®city. We also compared the results of our classi®cation method with those of a currently utilized simple thresholding method (22) and observed that our classi®cation algorithm performed better as assessed with the sensitivity, speci®city and agreement statistics (kappa coef®cient). Our method overcomes many of the limitations of the existing methods. A major advantage of the proposed approach here is that the normal mixture model parameters that determine the optimal thresholds are estimated from the background-corrected signal intensities and not from the background levels alone. Thus, the proposed approach takes into consideration both the distribution of the entire signal intensity data and the global background status. Moreover, our method deals with the spot intensities themselves while several other methods (29,30) deal with gene expression ratios, which may have already been derived from unreliable signal intensities to start with. A clear advantage of our method is that it assigns reliability on the spots themselves, regardless of whether the genes are induced or repressed; unlike those, for example in the Hughes et al. study (31), which only assign con®dence levels on ratios that signi®cantly

change. In addition, these methods require repeated measurements, which may not be always practical. Nevertheless, few studies have addressed the identi®cation of reliable signal intensities, but still with limitations. Many of the commercial microarray processing softwares utilize a ®xed signal threshold, which is arbitrary and independent neither on the nature of the distribution of the entire data nor on the different background status of each array experiment. A commonly used method suggested the elimination of spot intensities whose medians were below a threshold that was de®ned as the median background intensity plus a constant (suggested default value was 3) times the SD of the spot background (22). This method is subjective; it involves an arbitrary constant and does not take into account the distribution of the actual data, and more importantly, this method was not validated with a reference gene data set, as was ours. In another approach, Bilban et al. introduced a `receiver operating characteristic'-based analysis of a set of positive and negative controls to establish the signal thresholds (32). However, this method may not be as accurate due to the reliance on the signals generated from these controls and needs to be overly repeated for meaningful signi®cance. Tran et al. (33) proposed the use of correlation between the mean and median spot intensities to eliminate inaccurate signals. Although this method is well suited to discriminate spots with good morphology, this may not be the case for low signals that still have good morphology; besides the aforementioned limitations, this approach suffers from an arbitrarily de®ned threshold on the correlation. The method should prove useful in any microarray experiment with one or two channels and is applicable to cDNA microarrays, as demonstrated by the use of independent inhouse and publicly available data sets, or oligonucleotidespotted microarray data (M. H. Asyali and K. S. A. Khabar, unpublished observations). The assessment of signal intensity reliability and estimation of threshold is adaptable to any data set due to the different hybridization conditions and level of global background. Thus, each set of microarray data will have its own unique mixture model and corresponding optimal classi®cation thresholds. The method proves superior to many of the threshold techniques and has been validated with a true and false reference list of expressed genes. The use of the proposed classi®cation approach will signi®cantly improve the reliability and reproducibility of microarray data, and eliminate erroneous ratios of gene expression.

ALGORITHM AVAILABILITY The approach described in this paper is available as a standalone executable WindowsÔ program with an easy to use graphical user interface. It is available for academic users upon signing a standard academic use agreement. Commercial users need to obtain a license. For further details please see http://rc.kfshrc.edu.sa/bssc/staff/MusaAsyali/Downloads.asp.

ACKNOWLEDGEMENTS The authors like to thank Mr Mohammad Naemmuddin and Mr Mohammed Dhalla for their excellent technical assistance.

Nucleic Acids Research, 2004, Vol. 32, No. 8

REFERENCES 1. Nguyen,D.V., Arpat,A.B., Wang,N. and Carroll,R.J. (2002) DNA microarray experiments: biological and technological aspects. Biometrics, 58, 701±717. 2. Ramaswamy,S. and Golub,T.R. (2002) DNA microarrays in clinical oncology. J. Clin. Oncol., 20, 1932±1941. 3. Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,J.P., Coller,H., Loh,M.L., Downing,J.R., Caligiuri,M.A. et al. (1999) Molecular classi®cation of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531±537. 4. McLachlan,G.J. and Basford,K.E. (1988) Mixture Models, Inference and Applications to Clustering. Marcel Dekker, New York. 5. McLachlan,G.J. (1982) In Krishnaiah,P.R. and Kanal,L.N. (eds), The classi®cation and mixture maximum likelihood approaches to cluster analysis. Handbook of Statistics. North-Holland, Amsterdam, Vol. 2, pp. 199±208. 6. Schwartz,G. (1978) Estimating dimensions of a model. Ann. Stat., 6, 461±464. 7. Duda,R., Hart,P. and Stork,D. (2000) Pattern Classi®cation, 2nd Edn. Wiley, New York. 8. Symons,M. (1981) Clustering criteria and multivariate normal mixtures. Biometrics, 37, 35±43. 9. Wolfe,J. (1970) Pattern clustering by multivariate mixture analysis. Multivariate Behav. Res., 5, 329±350. 10. Frevel,M.A., Bakheet,T., Silva,A.M., Hissong,J.G., Khabar,K.S. and Williams,B.R. (2003) p38 Mitogen-activated protein kinase-dependent and -independent signaling of mRNA stability of AU-rich elementcontaining transcripts. Mol. Cell. Biol., 23, 425±436. 11. Bakheet,T., Williams,B.R. and Khabar,K.S. (2003) ARED 2.0: an update of AU-rich element mRNA database. Nucleic Acids Res., 31, 421±423. 12. Suzuki,T., Hashimoto,S., Toyoda,N., Nagai,S., Yamazaki,N., Dong,H.Y., Sakai,J., Yamashita,T., Nukiwa,T. and Matsushima,K. (2000) Comprehensive gene expression pro®le of LPS-stimulated human monocytes by SAGE. Blood, 96, 2584±2591 13. Chang,H.Y., Sneddon,J.B., Alizadeh,A.A., Sood,R., West,R.B., Montgomery,K., Chi,J.T., Rijn Mv,M., Botstein,D. and Brown,P.O. (2004) Gene expression signature of ®broblast serum response predicts human cancer progression: similarities between tumors and wounds. PLoS Biol., 2, E7. 14. McManus,I.C. (1983) Bimodality of blood pressure levels. Stat. Med., 2, 253±258. 15. McLachlan,G.J. and Gordon,R.D. (1989) Mixture models for partially unclassi®ed data: a case study of renal venous renin in hypertension. Stat. Med., 8, 1291±1300. 16. McLachlan,G.J., Bean,R.W. and Peel,D. (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics, 18, 413±422. 17. Shoukri,M.M. and McLachlan,G.J. (1994) Parametric estimation in a genetic mixture model with application to nuclear family data. Biometrics, 50, 128±139.

2335

18. Dempster,A., Laird,N. and Rubin,D. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Stat. Soc., B39, 1±38. 19. Redner,R. and Walker,H. (1984) Mixture densities, maximum likelihood and the EM algorithm. Siam Rev., 26, 195±202. 20. Moon,T.K. (1996) The expectation-maximization algorithm. IEEE Signal Process. Mag., 13, 47±60. 21. Martinez,W.L. and Martinez,A.R. (2001) Computational Statistics Handbook with MATLAB. CRC Press, Boca Raton. 22. Fielden,M.R., Halgren,R.G., Dere,E. and Zacharewski,T.R. (2002) GP3: GenePix post-processing program for automated analysis of raw microarray data. Bioinformatics, 18, 771±773. 23. Murayama,T., Ohara,Y., Obuchi,M., Khabar,K.S., Higashi,H., Mukaida,N. and Matsushima,K. (1997) Human cytomegalovirus induces interleukin-8 production by a human monocytic cell line, THP-1, through acting concurrently on AP-1- and NF-kappaB-binding sites of the interleukin-8 gene. J. Virol., 71, 5692±5695. 24. Al-Humidan,A., Edwards,C.K., Al-So®,A., Dzimiri,M., Al-Sedairy,S.T. and Khabar,K.S. (1998) A carbocyclic nucleoside analogue is a TNF-alpha inhibitor with immunosuppressive action: role of prostaglandin E2 and protein kinase C and comparison with pentoxifylline. Cell. Immunol., 188, 12±18. 25. Kastelic,T., Schnyder,J., Leutwiler,A., Traber,R., Streit,B., Niggli,H., MacKenzie,A. and Cheneval,D. (1996) Induction of rapid IL-1 beta mRNA degradation in THP-1 cells mediated through the AU-rich region in the 3¢UTR by a radicicol analogue. Cytokine, 8, 751±761. 26. Cohen,J. (1960) A coef®cient of agreement for nominal scales. Educ. Psychol. Measurement, 20, 37±46. 27. Landis,J. and Koch,G. (1977) The measurement of observer agreement for categorical data. Biometrics, 33, 159±174. 28. Ridol®,A. and Idier,J. (2002) Penalized Maximum Likelihood Estimation for Normal Mixture Distributions. School of Computer and Information Sciences Technical Report 200285. EÂcole Polytechnique FeÂdeÂrale de Lausanne, EPFL, School of Computer and Information Sciences; Lausanne, Switzerland. 29. Brody,J.P., Williams,B.A., Wold,B.J. and Quake,S.R. (2002) Signi®cance and statistical errors in the analysis of DNA microarray data. Proc. Natl Acad. Sci. USA, 99, 12975±12978. 30. Tusher,V.G., Tibshirani,R. and Chu,G. (2001) Signi®cance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116±5121. 31. Hughes,T.R., Marton,M.J., Jones,A.R., Roberts,C.J., Stoughton,R., Armour,C.D., Bennett,H.A., Coffey,E., Dai,H., He,Y.D. et al. (2000) Functional discovery via a compendium of expression pro®les. Cell, 102, 109±126. 32. Bilban,M., Buehler,L., Head,S., Desoye,G. and Quaranta,V. (2002) De®ning signal thresholds in DNA microarrays: exemplary application for invasive cancer. BMC Genomics, 3, 19. 33. Tran,P.H., Peiffer,D.A., Shin,Y., Meek,L.M., Brody,J.P. and Cho,K.W. (2002) Microarray optimizations: increasing spot accuracy and automated identi®cation of true microarray signals. Nucleic Acids Res., 30, e54.

Suggest Documents