Reference alignment of SNP microarray signals for ... - Oxford Journals

0 downloads 0 Views 859KB Size Report
Nov 14, 2008 - Associate Editor: Trey Ideker. ABSTRACT. A new procedure to align single nucleotide polymorphism (SNP) microarray signals for copy number ...
BIOINFORMATICS

ORIGINAL PAPER

Vol. 25 no. 3 2009, pages 315–321 doi:10.1093/bioinformatics/btn624

Gene expression

Reference alignment of SNP microarray signals for copy number analysis of tumors Stan Pounds1,∗ , Cheng Cheng1 , Charles Mullighan2 , Susana C. Raimondi2 , Sheila Shurtleff2 and James R. Downing2 1 Department

of Biostatistics and 2 Department of Pathology, St. Jude Children’s Research Hospital, 262 Danny Thomas Place, Memphis, TN 38105, USA

Received on September 16, 2008; revised on November 14, 2008; accepted on November 29, 2008 Advance Access publication December 3, 2008 Associate Editor: Trey Ideker

ABSTRACT A new procedure to align single nucleotide polymorphism (SNP) microarray signals for copy number analysis is proposed. For each individual array, this reference alignment procedure (RAP) uses a set of selected markers as internal references to direct the signal alignment. RAP aligns the signals so that each array has a similar signal distribution among its reference markers. An accompanying reference selection algorithm (RSA) uses genotype calls and initial signal intensities to choose two-copy markers as the internal references for each array. After RSA and RAP are applied, each array has a similar distribution of signals of two-copy markers so that across-array signal comparisons are biologically meaningful. An upper bound for a statistical metric of signal misalignment is derived and provides a theoretical basis to choose RSA-RAP over other alignment procedures for copy number analysis of cancers. In our study of acute lymphoblastic leukemia, RSA-RAP gives copy number analysis results that show substantially better concordance with cytogenetics than do two other alignment procedures. Availability: Documented R code is freely available from www.stjuderesearch.org/depts/biostats/refnorm. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

1

INTRODUCTION

Normalization is a critical component of the analysis of microarray data. Numerous methods have been proposed for normalization. Steinhoff and Vingron (2006) reviewed several of these methods. Broadly speaking, normalization consists of two major steps: (i) signal summarization and (ii) signal alignment. Signal summarization reduces the microarray image data into a set of signals representing the image intensity for each microarray probe set or marker. Signal alignment transforms the summary signals of each array so that comparisons of signals across arrays are biologically meaningful. Signal alignment procedures transform the summary signals so that all arrays have a similar value for some characteristic of the signal distribution. For example, some methods transform ∗ To

whom correspondence should be addressed.

the summary signals so that each array has the same mean, trimmed mean or median signal across the entire genome. Quantile normalization ensures that the quantiles of the signal distribution are matched across arrays. Such global-alignment methods transform the data so that each array has a similar distribution of signals among all probe sets or markers. Global alignment methods have been widely used with much success for gene expression microarray studies. However, we observed that global-alignment methods were unsuitable for our study of copy number alterations in acute lymphoblastic leukemia (ALL; Mullighan et al., 2007). Cytogenetics data indicate the proportion of the genome that is amplified, deleted or unaltered varies markedly from tumor to tumor (Raimondi et al., 1998). Some tumors have multiple full-chromosome amplifications or deletions, while others have abnormalities involving only one or two cytobands. For the control arrays (i.e. from non-cancerous tissue specimens), most markers are presumably in the typical two-copy state. It did not seem sensible to align the signal distribution among all markers across arrays because the copy number distribution among all markers varies substantially between tumors and controls and from tumor to tumor. Therefore, we developed a reference alignment procedure (RAP) to align the signals so that each array has a similar signal distribution among the markers in the two-copy state. In the course of our study, we observed that RAP greatly improved the accuracy of our copy number analysis (Supplementary Materials, Fig. S1). Here, we describe RAP in detail and the statistical principles underlying its success. In Section 2, we describe two widely used alignment procedures, RAP, and a reference selection algorithm (RSA) that accompanies RAP. In Section 3, we propose statistical metrics of misalignment, derive an upper bound for one of those metrics, and describe how the bound provides a theoretical basis for the success of RAP in our application. Section 4 compares the performance of RAP to that of other alignment procedures in the context of our application. Finally, the discussion is given in Section 5.

2 2.1

ALIGNMENT PROCEDURES Quantile alignment

For i = 1,...,n arrays and j = 1,...,m markers, let yij represent the unaligned signal value for array i and marker j. Also, for each i, let (j) index the

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

315

S.Pounds et al.

markers so that the signal values for array i are arranged in increasing order of the unaligned signal values, i.e. yi(1) ≤ yi(2) ≤ ··· ≤ yi(m) . The QA signals are given by n 1 yi(j) . (1) y˜ i(j) = n i=1

The QA signals satisfy the property y˜ i(j) = y˜ i (j) for all i, i and (j). The QA method has been used in gene expression studies with much success (Bolstad et al., 2003). Subsequently, QA has been implemented in several software packages to prepare single nucleotide polymorphism (SNP) microarray data for copy number analysis.

2.2 Alignment of the rank-invariant set Li and Wong (2001) proposed ‘invariant-set normalization’ as a procedure to align the probe set signals. In this article, the procedure is called ‘invariant set alignment’ (ISA) to clarify that this is a signal alignment procedure and not a signal summarization procedure. For each array i, define   m 1  1 I(yil ≤ yi(j) ) − , (2) qi(j) = m 2m l=1

where I(·) is the indicator function that equals 1 if the enclosed statement is true and equals 0 otherwise. By definition, qi(j) is the (adjusted) proportion of all signals yi(1) ,yi(2) ,...,yi(m) less than or equal to yi(j) for each i. Let qij correspond to the values of (2) in the original ordering. Let i0 be the index of a selected baseline array. For each other array i, let dij = |qij −qi0 j |.

(3)

For array i, the rank-invariant set is given by the set Ji of j such that dij is less than or equal to a chosen threshold τ . The ISA signals are given by y´ij = T´ i (yij )

(4)

where T´ (·) is obtained by fitting a local regression curve to the pairs (yij ,yi0 j ) for all j in Ji .

2.3

Reference alignment procedure

For purposes of copy number analysis, it is clearly advantageous to align the signal distributions of the two-copy markers across arrays. If the two-copy markers’ signal distributions are quite dissimilar, then the acrossarray comparisons will not be biologically meaningful. RAP is specifically designed to achieve the objective of aligning the signal distribution of two-copy markers across arrays. RAP requires that a set of markers be specified to serve as internal references for each array. Each array may have a distinct set of reference markers or all arrays may have the same set of reference markers. Given the specified sets of reference markers, RAP aligns the signals so that each array has a similar signal distribution among its selected reference markers. Section 2.4 describes strategies to select reference markers that represent the two-copy state. RAP ensures that each array with a well-chosen set of reference markers has similar distribution of aligned signals among two-copy markers. Section 3 discusses this desirable property of RAP in greater detail. For i = 1,...,n and j = 1,...,m, let rij = 1 if marker j is selected as an internal reference marker for array i and let rij = 0 otherwise. Also, for each i, let Ri be the set of j such that rij = 1. Additionally, for each i, define ri· = m j=1 rij , which is the number of reference markers for array i. Now, define  m  1  1 ui(j) = ril I(yil ≤ yi(j) ) − . (5) ri· 2ri· l=1

For each array i, ui(j) is the (adjusted) proportion of reference markers with signal value less than or equal to yi(j) . For each (j) in Ri , there is a pair (qi(j) ,ui(j) ) where qi(j) is the proportion of all markers with signal value less than or equal to yi(j) as defined in (2). Also, for each i, let  Ti (x) be defined by the linear interpolation of the series of points (0,0), (qi(j) ,ui(j) ) for (j) in Ri ,

316

and (1,1). Note that  Ti (x) maps the adjusted proportion of all markers with unaligned signal less than or equal to x to a value approximately equal to the adjusted proportion of reference markers with unaligned signal less than or equal to x. The adjustment of ui(j) by 1/2ri· in (5) and the adjustment of qi(j) by 1/2m in (2) improve the performance of transformation  Ti (x) in the tails of the distribution. The adjustments ensure that 0 < ui(j) < 1, 0 < qi(j) < 1, and Ti (x) satisfies the properties 0 0 by definition. By the definition of RAP, G(·) is approximately equal to the EDF of the reference markers’ RAP signals. By definition (8),  Fi (·) is the EDF of the two-copy markers’ signals. By the probability integral transform,  δi = 0 if and only if  Fi (x) = G(x) for all x. In the limiting case that the two distributions have no overlap, i.e. there δi = 1. exists some x such that  Fi (x) = 1 and G(x) = 0 (or vice-versa),  For each pair of arrays i1 and i2 , define the pairwise misalignment of RAP signals as  1  −1  (i1 ,i2 ) = Fi2 G−1 (u) du. (9) Fi1 G (u)−  0

The pairwise misalignment  (i1 ,i2 ) measures how much the signal distribution of the two-copy markers of array i1 differs from that of the two(i1 ,i2 ) = 0 if and only if  Fi1 (x) =  Fi2 (x) copy markers of array i2 . Note that  for all x because the supports of  Fi1 (·) and  Fi2 (·) are each a subset of the support of G(·). The condition regarding the supports holds because all RAP (i1 ,i2 ) = 1 signals are mapped by G−1 (·) in the Equation (6). Additionally,  only in the limiting case that the distributions Fi1 and Fi2 have no overlap. Thus, a large value of the pairwise misalignment indicates that comparisons of the signals across the two arrays are not biologically meaningful because the two-copy markers’ signals are not on the same scale. Importantly, for any pair of arrays i1 and i2 , the pairwise misalignment is bounded above by the average of the two arrays’ IMAs. By the triangle inequality, (7) and (9) imply that  δi1 + δi 2 . (10) 2 Therefore, in practice, selecting references for array i1 and i2 in such a way that each array’s internal alignment is very small ensures that the pairwise  (i1 ,i2 ) ≤

misalignment is also small. This observation motivates the use of arrayspecific reference selection methods (such as RSA) in practice. Improving the reference selection for each array (in the sense that the IMA is reduced) also reduces the upper bound for each pairwise misalignment. The upper bound of (10) is applicable to QA and ISA signals as well because IMA and pairwise misalignment can be analogously defined. The IMA measures the difference between the distributions of the signals of the two-copy markers and the markers used to determine the alignment transformation. The pairwise misalignment measures the difference between two arrays’ distributions of two-copy markers’ signals. Thus, the implicit reference selection strategy of an alignment procedure is quite relevant in practice.

4

AN EXAMPLE

Mullighan et al. (2007) used Affymetrix SNP microarrays to study copy number abnormalities in the tumors of 242 ALL subjects. For comparative controls, SNP array data were collected from bone marrow samples obtained during remission from 61 subjects with acute myeloid leukemia. This example uses the Xba genotyping array data collected in this study. Unaligned summary signals were computed using dChip SNP software (www.dchip.org). Genotype calls were generated using GTYPE software, version 4.0 (Affymetrix, Santa Clara, CA, USA). The example dataset is freely available from www.stjuderesearch.org/depts/biostats/refnorm. Four different alignment procedures were applied to the data. A set of QA signals was obtained by computing Equation (1) with the log-transformed unaligned signal values. To obtain ISA signals, dChip software was used with default settings. A set of ‘cyto-RAP’ signals was obtained by using cytogenetics data to select reference markers as described in Section 2.4 and then applying RAP. A set of RSA-RAP signals was obtained by using RSA (with η = 0.15) to select reference markers and then applying RAP. The normal (0,1) distribution was selected as the target distribution for cyto-RAP and RSA-RAP signals. This resulted in four sets of aligned signals. For each set of signals, a profile of standardized differences were computed for each tumor by comparing the signals of the tumor to the signals of the control arrays (Supplementary Materials, Section A). After ordering markers by chromosome and position, Thomas (2003, 2005) algorithm was used to segment the standardized difference profile of each chromosome for each tumor. Each resulting segment was inferred as a region of deletion, no copy number change, or amplification on the basis of the proportion of markers with a positive standardized difference (PPSD). Notably, PPSD equals 0.5 if and only if the median standardized difference of a segment equals zero. Therefore, given a threshold 0 ≤ γ ≤ 0.5, a segment was declared a region of deletion if PPSD < 0.5−γ , a region of no copy number change if 0.5−γ ≤ PPSD ≤ 0.5+γ , or a region of amplification if PPSD > 0.5+γ . The value of γ defines the interval [0.5−γ ,0.5+γ ] for which a segment’s PPSD is considered to indicate the two-copy state. Increasing γ widens this interval and thus increases the number of segments that are inferred to be in the two-copy state. Setting γ = 0 declares all segments as gain or loss because no segment has a PPSD exactly equal to 0.5. Setting γ = 0.5 declares all segments as being in the two-copy state because 0 ≤ PPSD ≤ 1 by definition. Except where explicitly stated otherwise, γ = 0.1 is used. Figure 1 shows the results of QA for one hyperdiploid tumor array (Hyperdip > 50-SNP#12). The QA signals of the control arrays are stochastically greater than the QA signals of markers

317

9.0

B

2 1 0

Tumor Quantile

−2

−1

8.0 7.5 7.0 6.5

Tumor Quantile

8.5

A

3

S.Pounds et al.

6.5

7.0

7.5

8.0

8.5

−2

9.0

−1

Control Quantile

1

2

4 2 d

−6

−4

−2

0 −6

−4

−2

0

2

4

6

D 6

C

d

0 Control Quantile

0

10000

20000

30000

40000

50000

60000

Marker No.

0

10000

20000

30000

40000

50000

60000

Marker No.

Fig. 1. Alignment and segmentation results for a hyperdiploid tumor. (A) The quantiles of QA signals for markers on the amplified chromosomes (black dashed line) and chromosomes with no cytogenetic abnormality (solid black line) of a hyperdiploid tumor’s array against the corresponding quantiles of QA signals for all markers of a female control tissue array. A diagonal gray line along y = x is included for reference. (C) The results of segmenting the standardized differences (Supplementary Materials, Section A) computed from QA signals. The x-axis represents markers (ordered by chromosome and position), and the y-axis represents the standardized differences. Each gray point represents the standardized difference for one marker, and the thick horizontal lines represent the median of the standardized differences in the determined segments. (B, D) panels show analogous results for the same tumor using RSA-RAP signals. RSA-RAP signals and QA signals are on different scales (A, B) because RSA-RAP maps signals to the targeted normal (0,1) distribution and QA maps to the empirically defined distribution of Equation (1).

on chromosomes with no cytogenetically detected abnormality in the tumor (mostly two-copy state) and stochastically less than QA signals of markers on amplified chromosomes (Fig. 1A). Consequently, the segmentation results obtained using QA signals are very difficult to interpret without the benefit of external cytogenetics data (Fig. 1C). The alternating pattern of the signals indicates that the underlying copy number also alternates. However, none of the segments have signal differences that are centered near zero. Thus, without the benefit of cytogenetics data, it is difficult to tell which (if any) set of segments is in the two-copy state. ISA gives qualitatively similar results (data not shown). For this tumor, copy number inferences based on QA signals are concordant with cytogenetics for only 51% of the markers. Virtually the entire genome is inferred to be a region of either amplification or deletion. However, cytogenetics indicates that each chromosome is amplified or in the two-copy state. The chromosomes inferred as deletions have two copies according to cytogenetics. The problem cannot be resolved by choosing a different threshold γ . Choosing γ < 0.1 gives the same inference described above; choosing γ to be somewhat greater than 0.10 erroneously infers

318

amplified chromosomes to be in the two-copy state; and an even larger γ erroneously infers the entire genome to be in the twocopy state. The QA and ISA signals give qualitatively similar pattern for most hyperdiploid tumors in our study (data not shown). These results are directly attributable to the poor alignment of the signals of two-copy markers in the tumor with the signals of the twocopy markers in the control samples (Fig. 1A). The poor alignment arises from the poor reference selection strategies of QA and ISA (Supplementary Materials, Sections B and C). Figure 1 shows the results of RSA-RAP for this tumor. The markers on chromosomes with no cytogenetically detected abnormality (Fig. 1B) have a similar distribution of RSARAP signals as the control arrays (Fig. 1B). Consequently, the standardized differences for the two-copy segments are centered very near zero and thus the results are easily interpreted (Fig. 1D). Cyto-RAP gives qualitatively similar results (data not shown). RSARAP and cyto-RAP give copy number inferences that are concordant with cytogenetics for >99.9% of the markers. Moreover, the highly concordant inference is robust in terms of the selection of γ . The copy number inference is exactly the same for 0.086 ≤ γ ≤ 0.344.

Reference alignment for copy number analysis

0.6 0.0

0.2

0.4

Concordance

0.6 0.4 0.0

0.2

Concordance

No Cytogenetic Abnormality

0.8

1.0

B

1.0

Cytogenetically Detected Gains

0.8

A

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.1

Call Threshold

0.4

0.5

0.4

0.5

All Evaluable Markers

0.6 0.4 0.0

0.0

0.2

0.4

0.6

Concordance

0.8

1.0

D

1.0

Cytogenetically Detected Loss

0.2

Concordance

0.3

Call Threshold

0.8

C

0.2

0.0

0.1

0.2

0.3

0.4

0.5

Call Threshold

0.0

0.1

0.2

0.3

Call Threshold

Fig. 2. ROC-type curves. (A–D) the concordance of each alignment procedure’s results with cytogenetics across all tumor arrays as a function of the threshold γ among markers on amplified chromosomes, deleted chromosomes, chromosomes with no cytogenetically detected abnormality and all markers on chromosomes in one of these three categories. The solid black, dashed black, solid gray and dashed gray lines give the results for RSA-RAP, cyto-RAP, QA and ISA, respectively.

(This range is greater than one-half of the possible range for γ .) These accurate and robust results are obtained because RSA-RAP closely aligns the signal distributions of two-copy markers for this tumor to the two-copy signal distribution of the control arrays (Fig. 1B). Also, for each tumor, (7) was used to compute approximate IMA values by setting kij = 2 for the markers on chromosomes with no cytogenetically detectable abnormalities. The IMA values were computed for QA and RSA-RAP signals. For QA signals, all markers are reference markers, as described previously. For RSARAP signals, the reference markers were identified with RSA. The IMA values were not computed for ISA signals because dChip software did not provide the list of markers included in the rankinvariant set. The IMA values for cyto-RAP values were not computed because the reference markers and the two-copy markers were defined in the same way using cytogenetics data. Therefore, the IMA values for cyto-RAP signals equal zero by definition and are thus not meaningful to consider. The proportion of markers with a copy number inference (using γ = 0.1) concordant with cytogenetics was also computed for each tumor. For the hyperdiploid tumor of Figure 1, the IMA of RSARAP signals is 0.01 and the IMA of QA signals is 0.17 (Fig. S3, Supplementary Materials). Therefore, inequality (10) implies that the pairwise misalignment of this tumor’s RSA-RAP signals with the

RSA-RAP signals of any control array must be small. Hence, the accurate copy number results obtained with RSA-RAP signals (Fig. 1D) are to be expected according to inequality (10). Also, inequality (10) does not provide any assurance that the pairwise misalignment of the tumor array’s QA signals with any control array is small. Thus, the poor results obtained with QA signals (Fig. 1C) are not surprising either. Across the dataset, an IMA less than 0.05 strongly correlates with a concordance greater than 0.8 for QA and RSA-RAP signals (Fig. S4, Supplementary Materials). Additionally, the number of tumors with poor concordance and poor IMA with QA signals is substantially greater than that with RSARAP signals (Fig. S4). For QA signals, a poor IMA has a strong negative correlation with the proportion of markers located on a chromosome with no cytogenetic abnormality (Fig. S5, left panel). The IMA of RSA-RAP signals show much less correlation with the proportion of markers on chromosomes with no cytogenetic abnormality (Fig. S5, right panel), indicating that RSA-RAP is more robust against extensive copy number abnormalities than QA. Finally, RSA-RAP and cyto-RAP gave copy number results with markedly better concordance with cytogenetics across the study as a whole than did QA or ISA (Fig. 2). The proportion of markers across all tumors with copy number inferences that are concordant with cytogenetics using RSA-RAP or cyto-RAP signals is greater than or equal to that using ISA or QA signals for any

319

S.Pounds et al.

fixed value of γ (Fig. 2D). Furthermore, RSA-RAP and cyto-RAP give higher concordance among markers on chromosomes with no cytogenetic abnormality (Fig. 2A) and among markers on amplified chromosomes (Fig. 2B) than QA or ISA for any fixed value of γ . For very small γ , ISA gives a slightly higher concordance among markers on deleted chromosomes than the other methods (Fig. 2C) but apparently at the expense of very poor performance among markers on chromosomes with no cytogenetically apparent abnormality (Fig. 2B). This result is readily explained by the misalignment bias of QA and ISA signals (Supplementary Materials, Sections B and C). Additionally, the overall results of cyto-RAP and RSA-RAP are robust over a wide range of values for γ , as observed for the hyperdiploid tumor discussed above. Most interestingly, the results of RSA-RAP are very similar to those of cyto-RAP. This indicates that RSA selects reference markers well enough that little improvement in performance is achieved by collecting auxiliary copy number data, such as cytogenetics.

5

DISCUSSION

RAP is proposed as a procedure to prepare SNP microarray signals for copy number analysis of tumors. Unlike other alignment procedures, RAP is specifically designed to align the signal distribution of two-copy markers across arrays. Across-array signal comparisons can be biologically meaningful only if the arrays have a similar distribution for signals of markers in the two-copy state. Also, poorly aligned signals are quite prone to giving inaccurate copy number results (Figs 1 and S4). Other alignment procedures are not designed with the explicit objective of aligning the signals of two-copy markers. Subsequently, RAP gives more accurate copy number analysis results than ISA or QA in our application (Fig. 2). RSA is proposed as a strategy to select internal reference markers for RAP. The biological rationale of RSA is that two is the smallest copy number that can result in heterozygosity. Intuitively, this is a better rationale for selecting reference markers to align signals for copy number analysis of tumors than the rationale for using all markers as references for all arrays (as in QA) or the rank-invariant set (as in ISA). RSA performed very well in our application, selecting a chromosome with no cytogenetically detectable abnormality for more than 90% of our tumor samples. Consequently, copy number analyses based on RSA-RAP showed better accuracy than copy number analyses based on ISA or QA (Fig. 2). The internal and pairwise misalignment are defined as metrics of signal misalignment. A large pairwise misalignment indicates that across-array comparisons will not be biologically meaningful. For each pair of arrays, the pairwise misalignment is bounded above by the average of the individual arrays’ IMAs as shown in (10). Thus, a reference selection strategy that allows for a different set of reference markers to be selected for each array, such as RSA, is warranted from the standpoint of attempting to minimize the IMA for each array and thus minimize the upper bound for the pairwise misalignment of each pair of arrays. It may be possible to modify RSA or develop another RSA that can select reference markers to further reduce the IMA. In our application, the most common source of contamination of the reference set was an amplified or deleted cytoband on the reference chromosome selected by RSA. RAP does not require that full chromosomes be used to define the set of reference markers. Therefore, other reference selection procedures may be able to avoid

320

the problem of single band amplification or deletion by allowing selection of subsets of chromosomes as the reference markers. Additionally, algorithms that allow inclusion of two-copy markers from multiple chromosomes may give smaller IMAs than RSA in some cases. Unlike QA and ISA, RAP provides a straightforward way to use validation data on copy number abnormalities to rectify poor results. If validation experiments reveal serious errors in the reference selection or copy number results for a particular tumor, the validation data can be used to reselect reference markers and realign the signal values for that array without impacting the alignment and copy number results for other tumors. RAP performs the alignment for each array separately without using the signal data of other arrays (unless the target distribution is empirically defined using other arrays’ data as in QA). As such, RAP can be used in conjunction with the results of laboratory validation to iteratively refine the reference marker selection and copy number results. In their current forms, neither QA nor ISA provides such a straightforward solution to this problem because the reference selection is ‘hard-coded’ into the algorithm. With QA and ISA, the only ‘remedy’ for this problem is to simply report the problem or exclude the problematic tumor from the study. If the later ‘solution’ were pursued in our study, then two entire disease subtypes, hyperdiploid and hypodiploid ALL, would be excluded simply because the QA and ISA are not well-suited alignment strategies for these tumors. RAP, RSA and the concept of internal and pairwise misalignment may be generalized so that they are applicable in other settings as well. For instance, the problems with QA and ISA observed in our study using SNP arrays to examine copy number are likely to arise in applications that use array comparative genomic hybridization to study copy number abnormalities of tumors. However, the problem of reference selection is more difficult in the latter setting because genotype data are not available and thus reference selection remains a challenge. Also, it may be possible to improve reference selection strategies for the signal alignment of gene expression microarray signals. For example, it may be desirable to exclude Y chromosome genes from the reference set in studies that include males and females. Also, it may be possible to develop strategies that use the present–absent P-values (Pounds and Cheng, 2005) to select ‘absent’ genes as the reference genes to align the signal values of Affymetrix gene expression microarrays. In this context, RAP would align the signals of ‘absent’ probe sets across arrays. (The rationale of aligning ‘absent’ probe sets is to match the distribution of crosshybridization signals across arrays.) The practical merit of these and other extensions of RAP should be explored in future research. The three alignment procedures (QA, ISA and RAP) have several interesting similarities and differences. All three alignment procedures can be characterized in terms of a reference selection strategy, selection of a target distribution and technique to define a transformation of the unaligned signals to the target distribution. For reference selection, QA sets rij = 1 for all i and j, ISA sets rij = 1 for markers in the rank-invariant set and rij = 0 for other markers, and RAP sets the values of rij as determined by the user. For the target distribution, QA uses (1) to empirically define the target distribution ˜ yi(j) ) = qi(j) , ISA uses the selected baseline array to define the G(˜ target distribution, and RAP allows the user to select the target distribution. To define the transformation, QA uses (1), ISA uses local regression as described in Section 2.2, and RAP uses linear interpolation as described in Section 2.3 and the quantile function

Reference alignment for copy number analysis

G−1 of the user-selected target distribution. Thus, to select the best alignment procedure for a specific application in practice, one should consider which method(s) have the best reference selection strategy, target distribution and technique of transformation for the specific setting of the application. From this perspective, it is clear that QA is a special form of RAP. QA sets rij = 1 for all i and j. Thus, by (5), uij = qij for all i and j in this special case. Using (1) to define the target distribution ˜ −1 (uij ) = y˜ ij , thus deriving QA within the framework of yields G RAP. This observation brings up the question of whether using all markers as internal references for all arrays is the optimal reference selection strategy for a specific application. In our application, it is not intuitively reasonable to set all rij = 1 because we already know that many tumors have many markers that are not in the two-copy state. By using all markers as references, QA signals are prone to have large IMAs for tumors with fewer than 80% of markers on a two-copy chromosome (Fig. S5, left panel). RSA is an intuitively more reasonable strategy for reference selection and thus RSA-RAP is more robust in terms of IMA (Fig. S5, right panel). The choice of target distribution for RAP remains an open research question. We chose the normal distribution in our study due to its desirable statistical properties. Bolstad et al. (2003) argue for the use of empirically defined target distributions. However, in our application, it is difficult to meaningfully interpret the empirically defined distribution of Equation (1) because of the radical variation of the true copy number status across arrays. For example, the 75th percentile for a hyperdiploid tumor is likely to represent an amplified copy state while the 75th percentile for a control should represent the two-copy state. In terms of Equation (1), the average across i is computed across a cohort that is not homogeneous in terms of copy number for many values of j. Nevertheless, it may prove worthwhile to develop criteria to guide selection of the target distribution or develop techniques to empirically define the target distribution when each array may have a different set of reference markers. RAP and QA operate on the marginal distribution of unaligned signal values of each array. These methods do not explicitly consider

the correlation among markers. It may be possible to further improve signal alignment by developing methods that explicitly consider correlation among marker signal values.

ACKNOWLEDGEMENTS We thank Dr Fridjtof Thomas for providing R code to implement his change-point algorithm and Dr Xueyuan Cao for summarizing the HapMap data and providing helpful editorial suggestions. We also appreciate the editorial advice of Dr Wei Liu and Mr David Galloway. Funding: American Syrian Lebanese Associated Charities (ALSAC); the National Institutes of Health Cancer Center Support Grant (CA21765); the NIH/NIGMS Pharmacogenetics Research Network and Database (U01GM61393, U01GM61374, http://pharmgkb.org/); and National Institutes of Health (R01CA115422-01A1, R01CA129541-01, and R01CA132946-01). Conflict of Interest: none declared.

REFERENCES Bolstad,B.M. et al. (2003) A comparison of normalization methods for high density oligonucleotide array based on variance and bias. Bioinformatics, 19, 185–193. Mullighan,C.G. et al. (2007) Genes regulating B cell development are mutated in acute lymphoid leukaemia. Nature, 446, 758–764. Li,C. and Wong,W.H. (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol., 2, 8. Pounds,S. and Cheng,C. (2005) Statistical development and evaluation of gene expression data filters. J. Comput. Biol., 12, 482–495. Raimondi,S.C. et al. (1998) Cytogenetics as a diagnostic aid for childhood hematologic disorders: conventional cytogenetic techniques, fluorescence in situ hybridization, and comparative genomic hybridization. In Hanausek,M. and Walaszek,Z. (eds) Tumor Marker Protocols. Methods Molecular Medicine. Humana Press, Totowa, NJ, 1998. pp. 209–227. Steinhoff,C. and Vingron,M. (2006) Normalization and quantification of differential expression in gene expression microarrays. Brief. Bioinform., 7, 166–177. Thomas,F. (2003) Statistical approach to road segmentation. J. Transp. Eng. 129, 300–308. Thomas,F. (2005) Automated road segmentation using a bayesian algorithm. J. Transp. Eng. 131, 591–598.

321

Suggest Documents