Bette L Willis, David J Miller, and Madeleine J H Van Oppen. Microarray ... [14] Justin O Borevitz, David Liang, David Plouffe, Hur-Song Chang, Tong Zhu, Detlef.
SINGLE FEATURE POLYMORPHISMS: DETECTION AND INFLUENCE ON SHORT-OLIGONUCLEOTIDE MICROARRAYS A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy with a Major in Bioinformatics and Computational Biology in the College of Graduate Studies University of Idaho
by Matthew Lee Settles
July 2011
Major Professor: Terence Soule, Ph.D. Co-major Professor: Barrie Robison, Ph.D.
ii
AUTHORIZATION TO SUBMIT DISSERTATION This Dissertation of Matthew Lee Settles, submitted for the degree of Doctor of Philosophy with a major in Bioinformatics and Computational Biology and titled "Single Feature Polymorphisms: Detection and influence on short-oligonucleotide microarrays," has been reviewed in final form. Permission, as indicated by the signatures and dates given below, is now granted to submit final copies to the College of Graduate Studies for approval. Date
Co-Major Professor Dr. Terence Soule Co-Major Professor
Date Dr. Barrie Robison
Committee Member
Date Dr. Chris Williams
Committee Member
Date Dr. Robert Heckendorn
Program Administrator
Date Dr. Paul Joyce
Discipline’s College Dean
Date Dr. Scott Wood
Final Approval and Acceptance by the College of Graduate Studies Date Dr. Jie Chen
iii
ABSTRACT Gene expression microarrays are a tool used by ecologists, to study the genome-wide transcriptional variation that underlies complex interactions among and between organisms and their environment. Microarrays allow for the simultaneous measurement of thousands of gene products across many samples. These measurements rely on sequence similarity for efficient hybridization of sample mRNA targets to the microarray probes. Genetic polymorphisms located within a microarray’s probe affects this hybridization efficiency. These differences in probe level hybridization efficiencies are known as single feature polymorphisms (SFPs). SFPs are both a source of error, they negatively impact the estimate for transcript abundance, and a source of information, they are genetic markers. R 3’ IVT microarray platform uses 11 to 16 short-oligonucleotide probes, The Affymetrix
covering different regions of a gene, to measure transcript abundance. The redundancy of probes within the same gene, allows for the capability to detect candidate SFPs from the R data itself. This dissertation presents three new results pertaining to SFPs in Affymetrix
gene expression microarray data. The first result is a novel, improved algorithm for the accurate detection of SFPs. The second result evaluates the impact of SFPs to false positives in both differential expression analysis and the detection of SFPs themselves and offers a solution to reduce false positives induced by SFPs. Finally, the third result demonstrates how SFPs can be used to provide a new means for evaluating the accuracy of critical microarray preprocessing steps. These results will provide ecologists with new R 3’ IVT microarrays in ecological tools and techniques when using Affymetrix
microarray studies.
iv
ACKNOWLEDGMENTS I would like to thank the following people and groups: My supervisors, Dr. Terence Soule and Dr. Barrie Robison for many valuable discussions, suggestions, needed guidance, and mentoring. My committee members, Dr Williams and Dr. Heckendorn for their comments, guidance and patience in completing this document. My wife, Karin, Son Carter and family for their support of me and my work. Dr. Larry Forney (Director of IBEST) and Drs. Holly Wichman, James Foster, Paul Joyce (IBEST Research Oversight Team) for mentorship and support in all things as especially while completing this document. All of IBEST faculty, staff and students for valuable discussions, mentoring and providing a great and stimulating environment for which I work. To my fellow BCB students for being a fantastic group to be a part of.
v
DEDICATION I would like to dedicate this dissertation to my wife and son for helping me be the man I am today.
vi
TABLE OF CONTENTS
Titlepage
i
Authorization to Submit Dissertation
ii
Abstract
iii
Acknowledgments
iv
Dedication
v
Table of Contents
vi
List of Figures
ix
List of Tables
xi
Chapter 1:
Introduction
1
1.1
General Introduction and Motivation . . . . . . . . . . . . . . . . . . . . .
1
1.2
Objectives of this Research . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.3
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Chapter 2:
Scientific Background
4
2.1
R GeneChip R Microarray Platform . . . . . . . . . . . . The Affymetrix
6
2.2
Signal Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.3
Preprocessing Affymetrix Arrays . . . . . . . . . . . . . . . . . . . . . . .
8
2.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
vii
Chapter 3:
Introduction to the Research Chapters
Chapter 4:
An improved algorithm for the detection of genomic variation using short oligonucleotide expression microarrays
16
18
4.1
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3
Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Chapter 5:
Effect of Single Feature Polymorphisms on Computed Expression Levels
42
5.1
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 6:
Evaluation of microarray preprocessing routines using singe feature polymorphisms
57
6.1
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Chapter 7:
Discussion and Future Work
80
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
viii
7.3
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
References Appendix A:
85 Experimental Results
96
ix
LIST OF FIGURES
2.1
Central dogma of molecular biology. . . . . . . . . . . . . . . . . . . . . .
2.2
Example of three probe set images, with PM and MM rows, containing 20, 16 and 16 probe pairs respectively. . . . . . . . . . . . . . . . . . . . . . .
5
7
4.1
Single Feature Polymorphisms - Example . . . . . . . . . . . . . . . . . . 22
4.2
Position dependent sensitivity of each algorithm to detection of SFPs . . . . 31
4.3
Venn diagram of the overlap of SFP calls . . . . . . . . . . . . . . . . . . . 34
4.4
Comparison of differential expression and SFP calling in the BB3 dataset. . 37
4.5
Comparison of differential expression and SFP calling in the E-TAB-M dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6
Single Feature Polymorphism detection using the RMA method when multiple probes are called as containing a SFP in the BB3 dataset. . . . . . . . 40
5.1
Single Feature Polymorphisms - Example . . . . . . . . . . . . . . . . . . 45
5.2
Single Feature Polymorphism detection in a gene with a high number of called SFPs (10 of 11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1
SFP detection using the RMA preprocessing pipeline in the E-TABM-113 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2
Clustering of preprocessing routines with performance characteristics - BB3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3
Clustering of preprocessing routines with performance characteristics - BB3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.4
Boxplots of sensitivity, specificity and FDR results for the BB3 dataset . . . 71
x
6.5
Boxplots of sensitivity, specificity and FDR results for the E-TABM-113 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xi
LIST OF TABLES
2.1
Breakdown of the preprocessing steps for the MAS5, Plier, RMA, GCRMA, and dChip preprocessing pipelines. . . . . . . . . . . . . . . . . . . . . . .
9
4.1
Sensitivity, Specificity and FDR results. . . . . . . . . . . . . . . . . . . . 29
4.2
SFP frequency in the BB3 and E-TABM-113 datasets. . . . . . . . . . . . . 32
5.1
Comparison of differential expression analysis and SFP prediction with and without the SFP mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2
Contingency table comparing the results between masked and unmasked data for differential expression. . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3
Contingency table comparing the results between masked and unmasked data for SFP prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.1
Breakdown of the preprocessing steps for the MAS5, Plier, RMA, GCRMA, and dChip preprocessing pipelines. . . . . . . . . . . . . . . . . . . . . . . 66
6.2
Range in performance characteristics of each preprocessing subroutine. . . 73
A.1 SFP evaluation results for 89 preprocessing pipelines on the BB3 dataset. . 97 A.2 SFP evaluation results for 97 preprocessing pipelines on the E-TABM-113 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
1
Chapter 1 INTRODUCTION 1.1
General Introduction and Motivation
Gene expression studies in ecology and conservation seek to increase our understanding of natural and managed ecosystems, their constituent organisms and resources. These studies are performed on different scales, from individual organisms to communities and populations. DNA microarrays are a tool used to study gene expression in an ecological context on a genome wide level [36, 27, 6]. A DNA microarray is a collection of DNA fragments (termed probes) laid out in a two dimensional field on a substrate (such as a glass slide, or silicon chip), which can bind to complimentary DNA fragments in a sample and are commonly used to measure the relative abundances of gene expression between R microarray platform offers some unique samples [7, 40, 58]. The Affymetrix
advantages for ecological studies as compared to other microarray platforms (http://www.affymetrix.com). First, the platform uses many oligonucleotide probes (commonly 11 to 20, called a probe set) that cover different regions within the same gene, to more accurately measure gene expression. Second, the oligonucleotide probes are short, 25bp, which is both an advantage and a weakness. Signal measurements from short oligonucleotide probes are more susceptible to adjustments from genetic polymorphism than longer probes are. These adjustments are a source of error, because they do not reflect a comparable measurement of gene expression. However, these adjustments can also be used as genetic markers and can add another dimension of data to the experiment. R gene expression microarrays have recently been used as a Affymetrix
high-throughput platform for both profiling genome wide gene expression differences, and
2
for the discovery of candidate genetic polymorphisms. There are over 20, 000 scientific R microarrays, most of which are gene expression publications that utilize Affymetrix R gene expression microarrays were once only used in studies. Where Affymetrix
differential expression studies, today they are also being used for the discovery of genetic polymorphisms that can be used as genetic markers [14, 18, 70, 21, 35, 20, 30, 75, 60, 34, 28, 66, 53, 17]. Genetic polymorphisms in microarray data are typically described as single feature polymorphisms (SFPs), where a feature is a probe on the microarray. Therefore, the entire microarray probe acts as the genetic marker. Algorithms have been proposed for the detection of SFPs in standard gene expression microarray experiments [74, 68, 15, 56, 19, 14]. Gene expression itself, and in particular expression differences, pose the most significant issue to be addressed for accurate and sensitive detection of SFPs. Conversely, the presence of SFPs also presents a problem for the precise estimation of gene expression. Gene expression estimates are modeled using the combined information from all probes within a probe set. SFPs present themselves as a type of error that can negatively impact expression estimates and induce false positive results in differential expression analysis. The study of single feature polymorphisms in standard expression microarray datasets is the focus of this dissertation. Results are expected to provide ecologists with new methods and information for using R microarrays in ecological microarray studies. Affymetrix
1.2
Objectives of this Research
The objective of this dissertation is to provide researchers in ecology and conservation with new tools and results for enhanced analysis of gene expression microarray studies across genetically different populations. Specifically, this dissertation has the following R gene expression three aims pertaining to single feature polymorphisms in Affymetrix
microarray datasets: Aim #1: To provide a new and improved algorithm for the accurate and sensitive detection of single feature polymorphisms.
3
Aim #2a: To quantify the effect gene polymorphisms have on the false positive (and negative) rates in both differential expression estimates and the detection of single feature polymorphisms. Aim #2b: To reduce the false positive rates in differential expression and detection of single feature polymorphisms by correcting for the presence of single feature polymorphisms. Aim #3: Determine how single feature polymorphisms can be used to provide a new means of evaluating the effectiveness of critical microarray processing steps. Completion of these aims provides new algorithms for SFP detection and SFP correction in gene expression microarray studies and a new method to evaluate the R microarray preprocessing steps. performance characteristics of Affymetrix
1.3
Outline
The next chapter (Chapter 2) will cover the scientific background of this dissertation. Chapter 3 introduces the three research chapters. Chapter 4 addresses aim #1 of the objectives. Chapter 5 addresses aims #2a and #2b and Chapter 6 addresses aim #3 of the objectives. Finally, Chapter 7 concludes this dissertation with a discussion and future work. The remaining parts of the dissertation are the references and an appendix. Appendix A provides for supplemental information pertaining to Chapter 6.
4
Chapter 2 SCIENTIFIC BACKGROUND Microarrays provide a means of comparing the relative expression of genes between experimental conditions. The central dogma of molecular biology states that the transfer of sequential information begins with DNA, is transcribed to messenger RNA (mRNA), and then translated into a protein (see figure 2.1). This process is collectively referred to as gene expression. The "state" of a sample at any given time is governed by which of its genes are expressed and by their relative abundances at that time. Microarrays aim to capture the "state" of a sample by measuring the relative quantities of mRNA in a sample at a particular moment in time. In a typical microarray expression experiment, the objective is to compare the relative abundances of genes across conditions. This allows researchers to make inferences regarding the differences between the "states" of experimental conditions. To make these comparisons the samples must be processed and the resulting data analyzed. Briefly, biological samples are processed by generating cDNA from RNA using reverse transcription. The cDNA is labelled with a fluorescent dye and hybridized to a microarray. Fundamental to microarrays is the process of hybridization, where complimentary nucleic acid sequences specifically pair with each other by forming hydrogen bonds between complementary nucleotide base pairs. The greater the number of complementary base pairs the tighter the resulting bond is between two sequences. Microarrays are manufactured by attaching a large number (hundreds to millions) of expected complementary nucleic acid sequences (termed probes) to a substrate, each of which are bound in high quantity (picomoles). Hybridization bonds labeled, nucleic acids sequences from a sample to complimentary nucleic acid sequences on a microarray. After hybridization, the microarray is washed to remove any non-specific binding nucleic acids,
5
so that only the strongly bonded strands which contain a significant number of complimentary nucleotide base pair matches remain hybridized. The microarray is then scanned using a laser to excite the flourescently labeled sequences still bound. A signal is generated for each probe, the intensity of which is primarily dependent on the quantity of target sample bound to the probe, and the strength of the hybridization bond. Microarrays compare the relative expression of genes between conditions by comparing the average signal intensity of a probe set from one condition to the average intensity of the same probe set from a different condition. For this to occur, the microarray signals must first be quality checked, processed and subsequent data analysis performed. Microarray signals are checked for quality and processed using one of many algorithms that produce estimates of gene expression. Quality is most often determined by comparing the microarray wide signal intensity profiles across all microarrays in an experiment. A fundamental assumption in microarray gene expression experiments is that,
Figure 2.1: Central dogma of molecular biology (Overview of the central dogma of molecular biology. Original work by Mike Jones for wikipedia [http://upload.wikimedia.org/wikipedia/en/thumb/e/ee/Cdmb.svg/800px-Cdmb.svg.png]).
6
on average, there is no change in gene expression across conditions. Poor quality microarrays will have signal intensity profiles that deviate from the average signal intensity profile for the experiment and can be removed from the experiment. Gene expression estimates are then produced from signal intensities values using a preprocessing algorithm that typically performs background correction, normalization and multi-probe summarization. Evidence for gene expression differences between conditions can then be tested by statistical means. Genes are placed into one of three categories: no change, up-regulated, or down-regulated across conditions, and biological interpretations are made. All together, this process is referred to as differential expression analysis; however, differential expression alone does not utilize the full potential of microarray gene expression data to extract biologically meaningful information. Post differential expression analyses, or specialized experimental designs, are common and can enhance biological interpretation from microarray gene expression experiments. Examples of post differential expression analyses include: eQTL analysis [37], gene ontology analysis [57], pathway analysis [69], gene set enrichment analysis [63], gene co-expression network construction [23, 76] and single feature polymorphism (SFP) analysis [18]. SFP analysis is the focus of this dissertation, the other topics are beyond the scope of this dissertation and are not discussed further. The remaining sections of the chapter present the relevant background information required for SFP analysis in gene expression microarray experiments. The next section describes the microarray platform used in this dissertation. Followed by a section that R algorithm to call a gene "expressed" or not. The remaining describes an Affymetrix
sections provide an overview of the most common preprocessing procedures used to produce gene expression estimates from microarray signal intensity values. 2.1
R GeneChip R Microarray Platform The Affymetrix
R GeneChip R microarrays consist of small DNA fragments, termed probes, Affymetrix
chemically synthesized at specific locations on a coated quartz surface [44]. Each probe is
7
an oligonucleotide sequence of length 25bp. The mRNA molecule of interest is typically represented by a probe set comprising of 11 to 20 probe pairs. Each probe pair is composed of one perfect match (PM) probe and one paired mismatch (MM) probe (See figure 2.2). The mismatch probe is created by complementing the middle (13th ) base from the PM probe’s sequence. The MM probes were originally intended to provide for a means to estimate non-specific binding and background. However, studies have shown that the MM probes not only detect non-specific binding and background, but specific signal as well [32, 73]. Labelled mRNA samples are hybridized to the microarray, scanned and image analysis is performed to provide intensity values for each of the PM and MM probes. The PM and MM intensity values provide for the raw data in microarray gene expression analysis.
Figure 2.2: Example of three probe set images, with PM and MM rows, containing 20, 16 and 16 probe pairs respectively.
2.2
Signal Calls
Affymetrix provides for an algorithm in their MicroArray Suite 5 software called Present/Marginal/Absent calls (PMA calls) that can be used to determine whether a probe set is reliably detected (Present), not detected (Absent), or marginally detected (Marginal)
8
as being expressed. Discrimination scores are calculated for each probe pair in the probe set using the equation (PM-MM)/(PM+MM). A signal detection p-value is calculated for a probe set from a One-sided Wilcoxon Signed Rank test of the probe pair discrimination scores compared to a predefined constant τ (default τ = 0.015) [71]. A p-value less than α1 is assigned a present call, a p-value greater than α1 and less than α2 is assigned a marginal call and a p-value greater than α2 is assigned an absent call (defaults α1 = 0.04 and α2 = 0.06). probe sets with many probe pairs having discrimination score close to 1.0 will result in a low p-value and those having negative discrimination score (i.e. M M > P M ) will result in p-values close to 1. The signal call can be used to filter a dataset to only those genes which are considered to be expressed. 2.3
Preprocessing Affymetrix Arrays
The goals of preprocessing Affymetrix microarray data are three fold: 1) to remove variation due to technical sources, while preserving variation from biological sources 2) to normalize a set of microarrays in order to make them comparable and 3) to produce summarized expression values for each probe set. Preprocessing includes the following stages: background correction, probe level normalization, PM correction, probe set summarization and a sometimes a second probe set normalization. Not all preprocessing pipelines use all stages, but all pipelines are meant to provide the user with data ready for statistical analysis. The five most commonly used preprocessing pipelines are: MAS5 [1], Plier [3], RMA [32, 31, 12], GCRMA [73] and dChip [43, 42]. Table 2.1 details the specific algorithms employed at each preprocessing stage for the five preprocessing pipelines listed above. Other preprocessing pipelines exist; however, these five are the most commonly used in practice and are the only preprocessing pipelines used in this dissertation. The specific algorithms employed at each stage are described in greater detail below.
9
Table 2.1: Breakdown of the preprocessing steps for the MAS5, Plier, RMA, GCRMA, and dChip preprocessing pipelines. MAS5
Background Correction Probe Level Normalization PM Correction probe set Summarization probe set Normalization
2.3.1
weighted
Plier none
average none ideal
RMA
GCRMA
RMA
GCRMA
(global model)
(model based)
quantile
quantile
quantile
invariant
normalization
normalization
set
none
none
none
median
median
polish
polish
none
none
subtract MM none
Plier
biweight mean
none
normalization
mismatch Tukey
dChip
none
MBEI
none
scaled
Background Correction Algorithms
There are many sources of error introduced from the processes by which gene expression is measured. The data are fluorescence intensities read by a scanner, which are only a proxy for the actual expression level of the gene. Deviations from actual expression levels are introduced by many sources including non-specific binding (hybridization), RNA quality, reagents, etc.. All of which can be considered as background noise. Several algorithms have been developed to perform background correction. RMA (global model) RMA is the background adjustment implementation for the RMA preprocessing pipeline [32, 31, 12]. The PM probe intensities are corrected by using a global model for the distribution of probe intensities. Specifically, the observed PM probes are modeled as the sum of a normal noise component (Normal with mean µ and variance σ 2 ) and a exponential signal component (exponential with mean α). To avoid any possibility of negatives, the normal is truncated at zero. The expected value of background adjusted probe intensities values, s, can then be computed using the following equation.
10
E (s|O = o) = a + b
φ Φ
a b
− φ o−a b + Φ o−a −1 b
a b
(2.1)
Given O as the observed intensity values, a = s − µ − σ 2 α and b = σ. Note that φ and Φ are the standard normal distribution density and distribution functions respectively. Weighted average This is the routine for background correction performed in the MAS5 pipeline [1]. The chip is broken into a grid of 16 rectangular regions. Within each region the lowest 2% of probe intensities are used to compute a background value for that grid. Each probe is then adjusted based upon a weighted average of the backgrounds for each of the regions. The weights are based on the distances between the location of the probe and the centriods of 16 different regions. GCRMA (model based) This is the routine for background correction performed in the GCRMA pipeline [73]. GCRMA adjusts for optical noise and non-specific binding to produce background corrected intensities values. The algorithm uses probe sequence information to estimate probe affinity for non-specific binding. Sequence information is summarized by the base types (A, T, G, or C) at each position (1 through 25) along the probe to determine the affinity of each probe. A non-specific binding experiment was conducted as part of the algorithmic design in order to estimate the parameters of the position-specific base contributions to the probe affinity. The experiment was conducted such that only non-specific binding was expected to occur, and no gene-specific binding was expected to occur. With the probe affinities available, GCRMA estimates the function for non-specific binding given the probe sequence affinities. Specifically, GRMA estimate non-specific binding by fitting a loess curve through the relationship between MM probe intensities and predefined MM probe affinities. The background adjusted intensity values are then
11
computed as the posterior mean of specific binding given the observed intensities and the probe sequences. 2.3.2
Probe Level Normalization Methods
The purpose of normalization is to produce comparable data across multiple samples. Intensities on different arrays may have systematic bias due to diverse technical effects such as efficiency of RNA extraction, reverse transcription, scanning, spot detection, etc. Further there may also be systematic effects due to characteristics of the microarray, such as effect of different probes, spotting effects, spacial effects, etc. The role of normalization is to compensate for the technical effects, while preserving the effects due to the biology. Quantiles normalization The normalization method used in the RMA, GCRMA, and Plier pipelines is the quantile normalization procedure first introduced in [12]. The goal is to give each chip the same empirical distribution. The algorithm for quantile normalization is: 1. given n microarrays of length p (p are probe intensities), form the matrix X has dimension p × n, where each microarray is a column of X and each probe is a row of X; 2. sort each column of X to get Xsort , keeping note of the original ordering; 3. take the mean across each row of Xsort and assign the mean to each element in the 0 row to get Xsort ;
0 4. get normalized X, X 0 , by reordering each column of Xsort to the original ordering
of X
12
Invariant set Invariant set normalization is the normalization routine performed in the dChip pipeline [43, 42]. The procedure selects an ad hoc set of genes to normalize by, instead of a set of housekeeping genes. The procedure assumes that there exists a subset of genes between two microarrays that have the same rank order, an invariant set. The microarray whose intensities are closest to the median overall intensity across all microarrays is chosen as the baseline microarray. Microarrays are then normalized by selecting invariant sets of probes (as compared to the baseline microarray); then using only them, each array is normalized by a non-linear relationship fitted between the microarray and baseline microarray. 2.3.3
PM Correction
The MM probes were originally designed to provide for a measure of the error due to non-specific binding. PM correction routines are designed to account for this type of error specifically by use of the MM probes. Subtract MM R MicroArray Subtract MM was the original algorithm used in the original Affymetrix
Suite 4 software and is one of the two approaches used in dChip (the other is to perform no PM correction). This procedure simply subtracts the MM probe intensity value from PM probe intensity value for each probe set. Ideal mismatch Subtracting MM probe intensity values from PM intensity values can lead to negative numbers, which are nonsensical. In the MicroArray Suite 5 (MAS5) pipeline, non-specific binding is corrected for by subtracting an ideal MM value from the PM value [1]. The ideal MM value is equal to MM when M M ≤ P M , or to PM minus a small correction value when M M > P M . The correction value is a function of the difference between PM and MM. The ideal MM value allows you to subtract MM when possible (i.e. MM is less
13
than PM) or something else when it is not possible. The ideal mismatch will always be less than the corresponding PM and thus does not produce negative values. 2.3.4
Probe Set Summarization
The purpose of probe set summarization is to produce a single value for a probe set that is the estimated expression level for the probe set (i.e. gene). More than one (typically 16 20) probes are designed for each gene on a microarray in order to produce a robust measure of gene expression in the presence of probe specific binding differences, sequence variation and spacial artifacts that may occur on a microarray. Tukey biweight The summarization method employed in MAS5 is to perform a robust averaging across all the probe intensity values within a probe set. The probes intensity values are first log2 transformed and then the 1-step Tukey biweight procedure is used to produce the expression level estimate for a probe set [50]. This procedure does not use information across multiple arrays and can therefor be used on a single microarray. Model Based Expression Index (MBEI) MBEI is the method implemented in dChip [43, 42]. Li and Wong made the observation that the variance of a probe across multiple microarrays was less than the variance across probes within a probe set on a single microarray. In order to produce probe set expression level estimates, the following multi-chip model is fitted to each probe set:
yij = φi θj + ij
(2.2)
where yij is the probe intensity value for probe i within the probe set and microarray j. The φi parameter is a probe response parameter (measure of probe specific binding) and θj is the probe set expression level estimate for array j.
14
Median polish This is the summarization procedure used in the RMA and GCRMA pipelines and was first described in [32]. To produce probe set expression level estimates, the following multichip linear model is fit to data from each probe set:
log2 (yij ) = αi + βj + ij
(2.3)
where yij is the probe intensity value for probe i within the probe set and microarray j, αi is a probe effect and βj is the log2 probe set expression level estimate for array j. Median polish is an algorithm (see [64]) for fitting this model robustly. Probe Logarithmic Intensity ERror (Plier) The Plier algorithm for probe set summarization was designed by Affymetrix as an update to the 1-step Tukey Bi-weight procedure in MAS5. Plier accounts for the difference in signal between probes within the same probe set by means of a parameter termed probe affinity (probe specific binding estimates). Probe affinities represent the strength of signal produced at a specific concentration for a given probe. Plier seeks to produce a more accurate probe set expression level estimate by utilizing these inherent probe affinities, empirical probe performance, and handling error across low and high concentrations. Probe affinities were calculated using experimental data across multiple microarrays. Plier also utilizes an error model that assumes error is proportional to the probe, rather than the signal. This ensures that the error model can adjust appropriately for low and high concentrations (See [3] for a more thorough description of the algorithm). 2.3.5
Probe Set Normalization
Mean scaled The MAS5 pipeline performs normalization only after probe set summarization rather than at the probe level like most other pipelines. Mean scaled normalization applies a
15
linear scaling method to the summarized probe sets, so that all microarrays within an experiment have the same mean expression level. A scale value is predefined for the experiment (300 is a typical scale value). A scaling factor for each microarray βj is then computed by dividing the predefined scale value by the trimmed mean for microarray j. The normalized expression level estimate x0ij for, probe set i in microarray j, is determined by multiplying the computed scale factor βj to each of the original probe set expression level estimates xij (Equation 2.4). The resulting normalized microarrays each have the same mean. scale µj = βj xij
βj = x0ij
(2.4)
Where µj is the trimmed mean of microarray j, with the highest and lowest 2% of probe set expression level estimates removed. 2.4
Summary
This chapter presented the relevant background information needed to understand the SFP analyses presented within this Dissertation. We described the platform used, Affymetrix, a signal calling algorithm for determining whether a gene is expressed, or not, and a number of preprocessing algorithms used to produce expression estimates from probe intensity signal values. While there are other microarray platforms and algorithms that can be used for SFP analysis, those presented in this chapter are the most common and the only platform and algorithms used in this dissertation.
16
Chapter 3 INTRODUCTION TO THE RESEARCH CHAPTERS The research chapters (4, 5 and 6) address the objective of this dissertation; which is to provide new tools and techniques for enhanced analysis of microarray expression experiments comparing two genetically different populations. Chapter 4 addresses Aim #1 R and presents a new and improved algorithm for the identification of SFPs in Affymetrix
expression microarrays. Chapter 5 quantifies the effect of genetic polymorphisms on the false positive rates of differential expression and SFP detection (Aim #2) and presents a procedure to reduce false positives by masking candidate SFPs (Aim #2b). Finally, Chapter 6 addresses Aim #3 and shows how a database of known genetic polymorphisms can be used as a means for evaluating existing microarray preprocessing routines. In Chapter 4 we present a new and improved algorithm for the detection of SFPs, evaluate and compare the algorithm to two previously published SFP detection algorithms R Barley (Hordeum vulgare) microarray datasets. using two publicly available Affymetrix
Algorithmic performance is measured by sensitivity, specificity, and false discovery rate determined using a reference database of known genetic polymorphisms. The results show that our algorithm provides for a more consistent and sensitive calling of SFP with a lower false discovery rate (FDR). Chapter 5 quantifies the effect of genetic polymorphisms on differential expression analysis and SFP detection. We quantify the false positive rates and false negative rates of differentially expressed genes and SFP detection by masking out possible genetic polymorphisms. The procedure described does not rely on having a reference database of known genetic polymorphisms and can be applied to any microarray dataset where comparisons are being make between populations. The results show that the false positive rate is significantly increased by the presence of genetic polymorphisms for both
17
differential expression analysis and SFP detection (14% and 5%, respectively). Further, we provide a procedure for masking out possible genetic polymorphisms prior to differential expression and SFP prediction, which is expected to reduce the false positives in gene expression studies. In the final research chapter (6), we explore the potential of using a reference database of known genetic polymorphisms to evaluate existing microarray preprocessing pipelines. Previous research has shown that the choice of microarray preprocessing pipeline has a significant influence on experimental results. We evaluate different preprocessing pipelines by comparing their sensitivity, specificity, and FDR in predicting known SFPs. Detection of SFPs relies on a accurate estimate of gene expression and therefore the pipeline that most often predicts known SFPs with a lower FDR implies a better expression estimate. We tested five major preprocessing pipelines (RMA, GCRMA, dChip , MAS, and PLIER) and all possible combination of their background correction, normalization, mismatch probe correction, and summarization routines for their ability to detect single feature polymorphisms in two Barley microarray datasets. The results produced the following important findings: GCRMA performs poorly when used in small experiments, MAS5 performed poorly in general and the RMA preprocessing pipeline had the overall most consistent results for sensitivity, specificity and FDR across both datasets.
18
Chapter 4 AN IMPROVED ALGORITHM FOR THE DETECTION OF GENOMIC VARIATION USING SHORT OLIGONUCLEOTIDE EXPRESSION MICROARRAYS 4.1
Abstract
High-throughput experiments, such as microarrays, often generate far more biological information than the original experimental questions require. With many microarray analyses finished after differential expression, additional analysis are routinely not performed and untapped biological information is left undiscovered. This is especially true if the microarray experiment is from an ecological study containing multiple populations. Comparisons across populations may also contain genetic differences important to the original experimental questions, a subset of these differences may be identified using new techniques for detection of single feature polymorphisms (SFP). SFPs are differences at the microarray probe level due to genetic polymorphisms, such as SNPs and INDELs, and not expression differences. In this study we provide a new algorithm for the detection of SFPs, evaluate the algorithm using existing data from two R Barley (Hordeum vulgare) microarray datasets and publicly available Affymetrix
compare them to two previously published SFP detection algorithms. Results show that our algorithm provides a more consistent, sensitive calling of SFP with a lower false discovery rate. Analysis of SFP along side gene expression differences is a low cost method for enhanced analysis and additional biological inferences.
19
4.2
Introduction
Ecological population studies using transcriptome data are increasingly common and is leading to new biological insights [51, 8, 36]. A common experimental design is to compare gene expression differences across population in order to gain insights into the genetic response to environmental conditions. These experiments however are typically limited to the analysis of differential gene expression and do not incorporate underlying genetic variation. Currently, high-throughput RNA sequencing methods (RNA-seq) can be used to assess genetic variation and gene expression simultaneously; however, this technique is still costly which severely limits the sample size for the experiment. Short-oligonucleotide microarrays have been used to predict candidate locations of genomic and transcriptional polymorphisms between populations, using both genomic DNA (gDNA) [14, 41, 38, 72] and messenger RNA (mRNA) [19, 56, 45, 55, 70]. The advantage of using mRNA over gDNA is that no additional experiments may need be performed, as both expression and genetic variability can be assessed simultaneously. However, mRNA based polymorphism detection algorithms are generally more error prone, since they must also consider gene expression. An algorithm to reliably predict candidate locations of genetic polymorphisms from microarray gene expression experiments would provide a high-throughput technique for studying both the genetic and transcriptional basis of phenotypic variation between populations from routine microarray gene expression experiments. Short-oligonucleotide microarrays are commonly used to assess gene expression differences between populations, but have also been used to successfully predict candidate locations of single-feature polymorphisms (SFP) between two populations. When a short-oligonucleotide probe is designed over a position with a genomic, or transcriptional polymorphism, the hybridization efficiency of the sample is reduced. SFPs are statistical difference in the probe level hybridization efficiency between two populations caused by an underlying genetic or transcriptional polymorphism. They are detected by comparing microarray probe level intensity signals, a proxy value for hybridization efficiency,
20
between two populations. When hybridizing gDNA, SFPs are induced by single-nucleotide polymorphisms (SNPs) and small insertions/deletions (INDELS) [14]. When hybridizing mRNA, SFPs can also be induced by splicing variation and polyadenylation differences [56]. It is also important that the probe be a particular length, the shorter a probe is the greater the likelihood of non-specific binding. Conversely, as a probe becomes longer the impact that small localized polymorphisms (ie. SNPs) have on hybridization efficiency is reduced, making SFP detection increasingly difficult. The R GeneChip microarray platform offers a whole genome solution with Affymetrix
short-oligonucleotide probes (25-mers) that has been used to detect both expression and SFPs [18, 38, 11, 15]. R GeneChip microarrays consist of small DNA fragments, termed probes, Affymetrix
where each probe is an oligonucleotide sequence of length 25bp. The mRNA molecule of interest is measured by multiple probe pairs, usually 11 to 20, assembled into a probe set. Each probe pair is composed of one perfect match (PM) probe and one mismatch (MM) probe, the PM probe matches the targeted mRNA sequence exactly while the MM probe is generated by complementing the middle (13th bp) nucleotide of the PM probe. The MM probe is intended to provide an estimate of non-specific binding and background. Labeled gDNA or cDNA samples are hybridized to a microarray, scanned and image analysis is performed to provide intensity values for each of the PM and MM probes. When a SFP exists between two populations, the relative hybridization efficiency of the two populations is expected to be different and therefore the probe’s relative intensity values to be different (see Figure 4.1A). When hybridizing gDNA, prediction of SFPs becomes relatively straight forward, any significant probe intensity differences can be assumed to be due to hybridization efficiency differences and therefore a SFP. When predicting SFPs from mRNA, any prediction technique must also consider the expression of the mRNA molecule and be able to distinguish probe level differences due to underlying genetic variation from those due to expression differences (see Figure 4.1B). A probe’s hybridization efficiency can be calculated by computing the difference between the probe’s
21
observed intensity (raw intensity value) from the expected intensity value (intensity due to expression) for each probe [56, 55]. The resulting measure has the signal due to expression removed and can be analyzed in a similar manner to gDNA hybridization data. Here we provide two variations on a new algorithm for the detection of SFPs in standard expression microarray experiments. The two variations on our SFP prediction algorithm use different probe set summarization methods, RMA and MAS, to compute the expected intensity values. The algorithm is evaluated using data from two publicly R Barley (Hordeum vulgare) microarray datasets and is compared to available Affymetrix
two previously published SFP detection algorithms [56, 55]. The two Barley datasets use the same two cultivars (Morex and Golden Promise), but differ in the number of microarrays and sample tissue types. A previously reported DNA sequence database of known polymorphisms between Morex and Golden Promise is used to determine the sensitivity, specificity and false feature polymorphism rate of each method. We assess the overlap of called SFPs across the four detection methods as well as between the two barley datasets. We further explore the sensitivity of each algorithm to predict known polymorphisms by their position within the probe. Finally, we compare the overlap between genes with called SFPs and their differential expression calls across each algorithm and both datasets. Results show that our algorithm using the RMA summarization method better estimates and removes signal from expression than the other methods. The result is a more consistent, sensitive calling of SFPs with a lower false feature polymorphism rate. 4.3 4.3.1
Materials and Methods Experiment Datasets
R Barley Genome Array contains 22,840 probe sets, each with 11 probe The Affymetrix
pairs (PM and MM probes) developed by an international collaboration of Barley researchers [16]. Raw data from two publicly available barley datasets were obtained from plexdb (http://www.plexdb.org, Experiment ID: BB3) and ArrayExpress
22
A.
B.
Figure 4.1: Single feature polymorphism (SFP) detection using the RMA preprocessing pipeline with a low expression variability gene (A) and one with high expression variability (B) in the BB3 dataset. Each pane shows the log2 intensity values (y axis) of each array for each probe (x axis), with the Morex cultivar samples shown in red and Golden Promise samples in black. The left panes show unadjusted raw log2 intensity values with the average RMA computed expression values for each genotype (horizontal lines). The right pane shows log2 intensity values after RMA expression adjustment, each probe can then tested for a genotype effect. The bottom axis tick marks show the called SFPs (red for Morex, black for Golden Promise), while the top axis tick marks show probes with known SFPs (red/black) and known SFP free (green) probes, according to the sequence confirmation dataset.
23
(http://www.ebi.ac.uk/arrayexpress, Experiment ID: E-TABM-110). The first barley microarray dataset (BB3) was generated to provide a reference gene expression dataset across 15 tissues, 6 of which (stem, seminal root, vegetative shoot, seedling leaf, coleoptile and hypocotyl) contained samples from both Morex and Golden Promise (GP) cultivars[24]. The dataset consists of 3 biological replicates per tissue and cultivar for a total of 36 arrays, and is the same dataset used to detect SFPs in [56]. As determined in Rostoks et al., and verified here, one array of cultivar GP from the tissue type seminal root consistently clusters with the 3 replicates from the Morex cultivar (data not shown). This tissue type was removed from this experiment, leaving 5 tissue types and 30 microarrays. The second dataset (E-TABM-110) is also from seedling leaves and contains 3 replicates from the same two cultivars, Morex and GP, for a total of 6 microarrays. This dataset is used here to determine the sensitivity to call SFPs using a number of microarrays more typical of a small gene expression experiment, and to assess consistency of SFP calls across experiments. 4.3.2
Sequence Confirmation Dataset and Algorithmic Performance Evaluation
The sequence confirmation dataset used in this study is the same as that of [56] and can be found at their website (http://naturalsystems.uchicago.edu/naturalvariation/barley/SNPtable.csv). The sequences were collected from 3 barley sequence sources totaling 2,699 sequences. Of those, 30 were duplicated in 2 of the 3 sources and 1 was duplicated in all 3. Sixty-six probes contained polymorphisms in both Morex and GP genotypes as compared to the reference sequence on the microarray. After removing duplicates and sequences with polymorphisms in both genotypes, the sequence confirmation dataset consisted of 2,601 sequences of which 223 were polymorphic for GP and 178 for Morex, the remaining 2,200 sequences did not contain a sequence polymorphism for either cultivars. Results of each SFP prediction algorithm were evaluated by direct comparison to the sequence confirmation dataset for calculations of sensitivity, specificity and false
24
discovery rate. An algorithm’s sensitivity is the proportion of known polymorphisms in the sequence confirmation dataset correctly called as a SFP by the algorithm. The specificity is the proportion of known negatives (no SNP) correctly identified as such. False discovery rate is the proportion of called SFPs by the algorithm incorrectly identified as a polymorphism, when the sequence confirmation dataset indicates that no SNP is present. 4.3.3
Data Preprocessing
All analyses were conducted within the R statistical computing language using publicly available packages from CRAN and Bioconductor (R version 2.11.0, http://www.r-project.org; http://cran.fhcrc.org;http://www.bioconductor.org) [54, 26]. Raw CEL files were read into R using the bioconductor package affy [25] and checked for quality using pseudo-chip images and residual error visualizations. Quality assurance of microarray data was further checked using the affyQAReport function from the Bioconductor package affyQCReport [52]. Hybridization and housekeeping controls, RNA degradation, sample clustering, NUSE plots, LPE plots, and RLE plots all showed high quality data (results not shown) and no additional microarrays were removed. MicroArray Suite (MAS) PMA (Present/Marginal/Absent) calls were determined for each probe set within each mircroarray. In this study a marginal call was also considered to be absent. Arrays were then grouped by common attributes (i.e. strain and tissue). Final present/absent calls were determined for each probe set within each group; a probe set was called as present, for the group, if at least 5 of 6 samples within the group were called as present. A probe set was retained for further analysis if at least one group was called as present. This procedure is expected to remove probe sets that are unexpressed across the entire experiment, or expressed in only one cultivar. A second probe set filter that is commonly performed is to filter out probe sets with a low interquartile range (IQR) across all arrays. This filter is intended to remove probe sets that are not changing and therefore potentially biologically uninteresting. This filter is useful for differential expression but not for detection of SFPs, as a probe set can include
25
an SFP but have low expression variability. However, to maintain consistency of analyzed probe sets between both differential expression analysis and SFP calling, this filter was not applied in this study. 4.3.4
Analysis of Differential Expression
Differential expression analysis was conducted on each dataset in the following manner. First each dataset was normalized using the Robust Multichip Averaging (RMA) preprocessing routine by [12, 32, 31]. Differential expression was determined using the linear analysis of microarray technique from the limma package [62] with empirical Bayes adjustment to the variance, followed by Benjimini and Hochberg (BH) correction for multiple testing [61, 9]. Differential expression was only determined for those probe sets which passed the PMA filter as described in preprocessing. A gene was considered to be differentially expressed if it had both an BH adjusted p-value less than 0.05 and a log fold change greater than 0.5. Using both p-value and fold change criteria to determine differential expression is recommended by the MicroArray Quality Control (MAQC) project [46, 59]. 4.3.5
Single Feature Polymorphism Detection
Single Feature Polymorphism detection was conducted in the following manner. First, hybridization efficiencies for each probe were calculated using one of the four proposed models (models described below). Each probe was then fitted for a genotype effect; using the limma approach with empirical Bayes adjustment to the variance followed BH correction for multiple testing. SFP detection was only performed for those probes within probe sets that passed the PMA filter as described in pre-processing. A probe was considered to contain an SFP if it had a BH adjusted p-value of less than 0.05 and a log fold change greater than 0.5. The genotype with the reduced signal was determined to contain the SFP.
26
Model # 1: Linear Model (LM) This model of probe level hybridization efficiencies is the same as the model used in [56]. In this model, the relative probe hybridization efficiencies are modeled as the residuals from fitting the following linear model to each probe set on the array:
log2 (Ipgtr ) = µ + probep + genotypeg + tissuet + tissuet × genotypeg + pgtr
(4.1)
Where, Ipgrt is the background corrected and normalized intensity of probe p, genotype g, tissue type t, replicate r in a probe set. The residuals from the model are extracted and fitted for a genotype effect using the procedure described above. Model # 2: Intensity to RMA Expression Ratio (RATIO) The Intensity to RMA Expression Ratio model (RATIO) is similar to the method used in [55]. The difference is that we used values from the RMA summarization procedure to computed the expected expression values instead of the probe dependent nearest neighbor (PDNN) model [77]. This was done in order to make a more accurate comparison with the other models which use RMA and because of the complicated nature of preparing the energy parameterization files for the barley microarray needed for PDNN. Further, Ronald et al. reported that the RMA summary method exhibited similar and only slightly weaker performance than the PDNN model. In this model the relative probe hybridization efficiencies are modeled as: Ipa Iˆa
(4.2)
Where Ipa is the background adjusted and normalized intensity value of probe p in array a and Iˆa is the expected expression value of array a. The expected value of the ratio is 1 for probes which do not contain an SFP and significantly less than 1 for probes which
27
do contain a SFP. It should be noted that the RMA model (described below) is equivalent ˆ to the a log2 transformation of I/I. Model # 3: RMA subtraction (RMA) The RMA model is the first variant of our new SFP calling algorithm, where the RMA preprocessing procedure is used to estimate the expected expression value of the probe set. From the LM model above we can interpret the sum of the genotype, tissue and any interaction terms as estimates of group level expression from the mean with its own corresponding error (replicate deviations) that are being nested inside the overall error term. This procedure adds unnecessary variance to the analysis of SFP. We can remove this group specific error from the overall error and rewrite the linear model as:
log2 (Ipa ) = probep + log2 (Iˆa ) + pa µ + (log2 (Ipa ) − log2 (Iˆa )) = µ + probep + pa
(4.3) (4.4)
Where, log2 (Ipa ) is the background adjusted and normalized log2 intensity value of array a and probe p in the probe set and log2 (Iˆa ) is the log2 expected expression value of array a for the probe set. The probep term in the model represents a scalar adjustment to each probe that accounts for the general hybridization differences between the probes and can be ignored in this context. Scaling the relative probe hybridization efficiencies to the mean across all probes and arrays has the additional effect, when partnered with the empirical Bayes adjustment to the variance from the limma package, of giving decreased weight to those probe sets with low overall expression. The probe hybridization efficiencies are now modeled as the log2 difference of the probes intensity value from the expected expression value adjusted by the mean. In the RMA model the RMA summarization procedure (median polish) is used to compute the expected intensity values for each probe set of each array [12].
28
Model # 4: MAS subtraction (MAS) The MAS model is the second variation of our new SFP calling algorithm, where the MAS5 microarray preprocessing procedure is used to estimate the expected expression value of a probe set instead of the RMA procedure. The probe level hybridization efficiencies are calculated in the same manner as the RMA model with the exception that R Microarray Suite (MAS5) summarization procedure are used the the Affymetrix
compute the expected intensity values for each probe set of each array [1]. MAS5 is the default preprocessing procedure used in Affymetrix’s MicroArray Suite (MAS) software for their 3‘ IVT microarrays and, after the RMA procedure and is a commonly used R microarrays. preprocessing procedure for experiments which use Affymetrix
4.4 4.4.1
Results Sensitivity, Specificity and False Discovery Rate
Microarray data from two publicly available Barley microarray datasets (E-TABM-113 and BB3) were preprocessed according the procedures described in material and methods. Each microarray experiment was then tested for both differentially expressed genes and single feature polymorphisms (SFPs), where probe sets had been both filtered by presence/absence calls (PMA calls) and without any filter applied. Filtering the BB3 dataset by PMA calls reduced the number of probe sets from 22,840 to 17,457 (251,240 to 192,027 probes) and for the E-TABM-113 dataset reduced the number of probe sets from 22,840 to 14,232 (251,240 to 156,552 probes). Table 4.1 lists the sensitivity, specificity and false discovery rate for each of the four models of hybridization efficiency when applied to filtered and unfiltered BB3 (4.1A) and E-TABM-113 (4.1B) datasets. Filtering the data by PMA calls to only those considered expressed resulted in an increase to the sensitivity to detect known SFPs (average increase of 8.5% in E-TABM-113; 2.0% BB3) and only a slight decrease in the specificity (average decrease of 0.6% in E-TABM-113; 0.3% BB3), while no consistent change in the expected false discovery rate was observed.
29
Analysis and comparisons were performed on all probes (unfiltered dataset), because the comparison studies did not perform any probe set filtering [56, 55]. If a gene is unexpressed any probe containing an SFP would not be detectable, nor would the gene be differentially expressed. It therefore makes little sense to include these probes in any detection analysis. The trends observed across the four models of hybridization efficiency for both filtered and unfiltered were the same; therefore from this point on we discuss results based only on the filtered dataset.
Table 4.1: Sensitivity, Specificity and False discovery rate of each of the four SFP calling algorithms in the E-TABM-113 (A) dataset and BB3 (B) datasets as compared to the barley sequence confirmation dataset. Category best values are shown in bold font. In 8 of 12 possible categories the RMA procedure outperforms the other 3 algorithms with the RATIO method performing the best in the remaining 4 categories. The RMA procedure outperformed the MAS and LM procedures in all categories. BB3
A. RMA
Sensitivity Specificity FDR
Filter 72.5% 97.1% 21.1%
No Filter 70.6% 97.3% 20.5%
B. RMA
Sensitivity Specificity FDR
Filter 56.9% 98.7% 11.5%
No Filter 50.6% 98.9% 11.7%
MAS Filter 62% 96.8% 28.4%
No Filter 59.9% 97.0% 27.9%
RATIO No Filter Filter 76.7% 74.1% 95.0% 95.5% 29.9% 29.1%
E-TABM-113 MAS RATIO No No Filter Filter Filter Filter 53.3% 43.9% 50.4% 38.2% 95.2% 96.5% 98.6% 99.2% 37.3% 35.5% 14% 11.6%
LM Filter 68.8% 94.0% 36.0%
No Filter 67.3% 94.4% 35.6%
LM Filter 51.6% 98% 19.5%
No Filter 44.6% 98.2% 20.4%
Within the BB3 dataset, the sensitivity to detect known SFPs varied widely from 62% using MAS to 76.7% using the RATIO model. The sensitivity reflects the ability of a model to detect known SFPs according to the barley sequence confirmation dataset. Specificity was similar for each of the four models ranging from 94.0% using the LM model to 97.1% for RMA; specificity reflects the ability of a model to call a probe a
30
non-SFP when it is known no SFP exists. A more dramatic difference was observed in the false discovery rate, ranging from 21.1% in the RMA to 36% for the LM model. Within the E-TABM-113 dataset, the sensitivity to detect known SFPs also varied from 50.4% using the RATIO model to 56.9% using the RMA model. Specificity was again similar for each of the four models ranging from 95.2% using the MAS model to 98.7% for RMA. Finally, the false discovery rate ranged from 11.5% using the RMA algorithm to 37.3% for the MAS algorithm. Overall, RMA outperformed MAS and the LM models in every categories and outperformed the RATIO algorithm in five of the six evaluated criteria for filtered data. The RATIO method outperformed the LM model in all cases except sensitivity in the E-TABM-113 dataset, where RATIO performed the worst when compared to all other models. The MAS procedure performed comparable to the LM procedure for sensitivity and specificity but performed significantly worse than all other models for false discovery rate for the smaller E-TABM-113 dataset, however performed better than both RATIO and LM in the larger BB3 dataset. In general, the RMA method performed consistently well across both datasets. Figure 4.2 shows the sensitivity of each model to call known SFPs by the SNP position within the probe. All models show an increased sensitivity to positively detecting known SFPs as the SNPs position moved towards the center of the probe or if multiple SNPs occur within the same probe. A sharp increase in sensitivity was observed when the SNP did not occur in the outside 3 bases of the probe (occurred within bases 4 and 22 inclusive). 4.4.2
Comparison and Overlap of called SFPs
Table 4.2 shows the frequency in the number of called SFPs per gene, across all four models for the BB3 (4.2A) and E-TABM-113 (4.2B) datasets. Most genes contained a single called SFP and the number of SFPs per gene decreased steeply thereafter. Genes which contain many called SFPs are more likely to contain false positives and are a
31
RAT
E-TABM-113
LM
MAS
RAT
LM
60
Percent Sensitivity
0
20
40
60 40 0
20
Percent Sensitivity
RMA
80
MAS
80
RMA
100
100
BB3
1
2
3
4
5
6
7
8
SFP Position
9
10
11
12
13
M
1
2
3
4
5
6
7
8
9
10
11
12
13
M
SFP Position
Figure 4.2: Position dependent sensitivity of each algorithm to detection of SFP. The y-axis is the percent sensitivity of detection of known SNPs according to the SNP position in the probe described on the y-axis. Position 1 is the furthest position from the middle base (bp positions 1 and 25). The 13th position is the middle base in the probe (bp position 13) and M is when multiple SNPs are known to exist within the probe.
32
warning sign that an accurate estimate of gene expression could not be obtained. As the number of true SFP containing probes in a probe set increases the ability to accurately estimate gene expression decreases as the number of probes reflecting expression only is outnumbered by the number of probes with both expression and genetic differences. Both the MAS and LM models contained a significant number of genes with greater than six called SFPs relative to the RMA and RATIO models. Further, the number of SFPs is evenly split between Morex and GP genotypes in RMA, RATIO, and LM models across both datasets, a trend also observed in differential expression. However, the ratio was significantly skewed toward GP in the MAS method in the smaller E-TABM-113 dataset and towards Morex in the larger BB3 dataset. The Ratio model also had approximately 50% more called SFPs relative to the RMA model (12,766 to 8,873) in the BB3 dataset and the MAS model has approximately 100% more called SFPs relative to all other models in the E-TABM-113 dataset. Table 4.2: Frequency of SFPs in the BB3 (A) and E-TABM-113 (B) datasets (number of genes containing 1 SFP, 2 SFPs, etc.), total SFPs found for the Morex (MX) and Golden Promise (GP) genotypes, total SFPs found across both genotypes and total number of genes containing an SFP. A.
BB3 MX
GP
Total
1
2
3
4
5
6
7
8
9
10
11
SFP
SFP
SFPs
Genes
RMA
1921
738
330
172
131
103
98
91
56
53
7
4620
4253
8873
3700
MAS
1556
520
201
148
133
121
123
106
106
73
41
5419
3607
9026
3128
RATIO
2804
1097
513
273
241
191
128
114
68
33
3
6565
6201
12766
5466
LM
1826
570
250
190
120
127
133
123
99
116
68
5273
5279
10552
3622
B.
E-TABM-113 5
6
7
8
9
10
MX
GP
Total
11
SFP
SFP
SFPs
Genes
1
2
3
4
RMA
1726
597
254
141
84
44
19
11
3
0
0
2609
2569
5178
2879
MAS
1976
648
301
174
160
143
166
130
112
92
50
4146
7063
11209
3952
RATIO
1949
636
233
85
40
18
3
1
1
0
0
2370
2249
4619
2969
LM
1516
447
200
116
100
73
48
41
26
10
2
2794
2638
5432
2579
33
An important evaluation is the proportion of shared SFP calls across the four methods and conversely the proportion of unique SFPs (a SFP called in that model only). Figure 4.3 shows a Venn diagram of the overlap of all called SFPs for the two datasets across all 4 models. A large core of called SFPs exists across both datasets and across all models, with relatively few unique SFPs. The exceptions for unique SFPs are in the MAS and RATIO methods. The MAS method called a large number of unique SFPs in both datasets, 60% and 16% of all called SFPs were unique to the MAS method in the E-TABM-113 and BB3 datasets respectively. The RATIO method called a large number (22%) of unique SFPs in the E-TABM-113 dataset. By comparison the RMA model had only 4% and 1% unique SFPs in the E-TABM-113 and BB3 experiments respectively. The RMA model has the greatest overall overlap; while the LM method was intermediate to the RMA and the RATIO and MAS methods with regards to overlap. In the larger BB3 dataset there existed a significant core of SFPs representing a portion of all called SFPs (64% RMA, 63% MAS, 44% RATIO, 54% LM. When not considering the MAS method the remaining three methods showed an even more significant core of called SFPs (89% RMA, 62% RATIO, 75% LM). In the smaller E-TABM-113 dataset, the overlap was less significant between the four methods (50% RMA, 23% MAS, 56% RATIO, 48% LM). Not considering the MAS method again the overlap between the three remaining methods increases significantly again (70% RMA, 78% RATIO, 66% LM). Interestingly, the RATIO method has considerable overlap with the RMA and LM methods in the smaller E-TABM-113 dataset, but has a lower overlap and a large number of unique SFP calls in the larger BB3 dataset. Overall, the RMA method produced results that overlapped the most and had the fewest unique SFP calls as compared to the other 3 models across both datasets. 4.4.3
SFP call agreement between BB3 and E-TABM-113
Both BB3 and E-TABM-113 datasets use the same two cultivars (Morex and Golden Promise) and should therefore have similar called SFPs within the common set of tested genes. A total of 154,110 probes (14,010 genes) were tested for SFP across both datasets.
34
A.
RMA
MAS
E-TABM-113 204
B.
RMA
6699
75
1419
230
41
383
870 213
418
405
568 153
190
350
842 5664
156
133
525
1020
142770
112 490
28
RAT
158
2787
2586
40
MAS
BB3
2243 475
LM
RAT
175848
LM
Figure 4.3: Venn diagram of the overlap of SFP calls between the four algorithms and across the two datasets E-TABM-113 (A) and BB3 (B). The value inside a cell represents the number of called SFPs in common between the algorithms represented by the overlap between the ovals. The value outside all ovals is the number of probes without a called SFP in any of the four algorithms.
35
Of these, the percentage of probes with a called SFP in at least one of the two datasets was 8.9% MAS, 7.3% RATIO, 6.5% LM, and 5.6% for the RMA procedure. The agreement between the two datasets also varied greatly. Considering only probes which contained a called SFP in at least one of the two datasets, the agreement for the four models was 28.9% RATIO, 31.9% MAS, 27.5% LM, and 39.5% for the RMA methods. With the majority of disagreements being a called SFP in one dataset and a no call in the other, the number of called disagreements (i.e occurred as a Morex SFP in one dataset and a GP SFP in the other) were relatively few across all four methods (15 RMA, 22 LM, 30 RATIO, and 89 MAS probes). The relatively low concordance for all four datasets can be explained by the differences between the two datasets. Which are in the number of samples, BB3 having five times more microarrays, and in the number of different tissues, BB3 contains five tissues, where E-TABM-113 contains only one. One would therefore expect the BB3 dataset to have increased power and subsequently greater ability to detect SFPs. The BB3 dataset however also has five different tissues and therefore will have five different expression profiles. If for instance a gene containing a SFP was not expressed in one (or more than one, but not all) tissues, it would be difficult for any SFP calling algorithm to account for both differences in hybridization efficiency in expressed tissue due to the SFP and the lack of a signal in the unexpressed tissues. 4.4.4
Comparison of SFP Calls to Differential Expression
In addition to SFP calls, we also performed differential expression analysis for both datasets. Differential expression analysis resulted in 549 genes up-regulated and 680 down-regulated in Morex relative to Golden Promise in the BB3 experiment (1,229 total differentially expressed genes). For the BB3 dataset, tissue was included as a blocking factor in the linear model. The smaller E-TABM-113 experiment resulted in 760 genes up-regulated and 1,043 genes down-regulated in Morex relative to Golden Promise (1,803 total differentially expressed genes).
36
Of the 1,229 genes differentially expressed in the BB3 dataset, 74.3% (913 genes) contained at least one SFP when computing SFP calls using the RMA model. Conversely, of the 3,700 genes containing at least one SFP, 24.7% were also differentially expressed. Within the E-TABM-113 dataset this effect was less pronounced, of the 1,803 total genes differentially expressed 53.6% (967 genes) contained at least one SFP when computing SFP calls using the RMA model and of the 2,879 genes containing at least one SFP, 33.5% were also differentially expressed. In general, for the RMA model, as the number of probes within a gene called as an SFP increased, the likelihood that the gene was also labeled as being differentially expressed also increased. Similar patterns were seen in both the MAS and LM models, where the percentage of genes differentially expressed containing at least one SFP increased as the number of called SFPs with the gene increased (see Figures 4.4 and 4.5). In the RATIO model however, differentially expression and SFP did not appear to be associated with each other and no relationship was observed. 4.5
Discussion
We have described a new algorithm, with 2 variants using different preprocessing R methodologies (RMA and MAS5), for the prediction of SFPs from standard Affymetrix
microarray gene expression experiments. We compared our two variants to two previously published methods for the prediction of SFPs between two barley cultivars (Morex and Golden Promise) across two existing barley microarray gene expression datasets. The differences between the two datasets being that one (BB3) contains more samples (30 microarrays) across five tissue types and the other (E-TABM-113) only one tissue type and a smaller number of microarrays (6 microarrays). The E-TABM-113 dataset however represents a common experimental design of a direct comparison of two genotypes across a single factor using a small number of microarrays. Any SFP detection technique should also be evaluated on this type of experimental design. Our results show that the RMA model produced the overall best results across both datasets for sensitivity, specificity and false feature polymorphism rate against a database of known SNP differences between the
37
1.0 0.8 0.0
4
5
6
7
8
9
10
11
0
1
2
3
4
5
6
7
Number of SFPs per Gene
Number of SFPs per Gene
RMA
LM MX
GP
Percentage of Genes Differencially Expressed
0.8 0.6 0.4
10
11
8
9
10
11
MX
0.0
0.2 0.0
None
9
1.0
D .
1.0
None
8
0.8
3
0.6
2
0.4
1
GP
Percentage of Genes Differencially Expressed
MX
0.2
0
B.
None
0.6
Percentage of Genes Differencially Expressed
0.8 0.6 0.4 0.2 0.0
Percentage of Genes Differencially Expressed
GP
C .
0.4
MX
1.0
None
0.2
GP
A.
0
1
2
3
4
5
6
7
8
9
10
11
0
1
2
3
4
5
6
7
Number of SFPs per Gene
Number of SFPs per Gene
MAS5
RAT
Figure 4.4: Comparison of differential expression and SFP calling in the BB3 dataset. The y-axis gives the percentage of differentially expressed genes as the number of called SFPs, within the probe set, increases (y-axis).
38
C .
1
2
3
4
5
6
7
8
1.0
9
0
1
2
3
Number of SFPs per Gene
4
7
8
9
10
11
0.8
1.0
MX
0.6
0.8 0.6 0.4
0.0
0.2 0.0
None
0.4
D .
GP
0.2
MX
1.0
None
6
LM
Percentage of Genes Differencially Expressed
GP
5
Number of SFPs per Gene
RMA
Percentage of Genes Differencially Expressed
MX
0.0 0
B.
None
0.8
Percentage of Genes Differencially Expressed
0.8 0.6 0.4 0.2 0.0
Percentage of Genes Differencially Expressed
GP
0.6
MX
0.4
None
1.0
GP
0.2
A.
0
1
2
3
4
5
6
7
Number of SFPs per Gene
MAS5
8
9
10
11
0
1
2
3
4
5
6
7
8
9
Number of SFPs per Gene
RAT
Figure 4.5: Comparison of differential expression and SFP calling in the E-TAB-M dataset. The y-axis gives the percentage of differentially expressed genes as the number of called SFPs, within the probe set, increases (x-axis).
39
two genotypes. Further, the RMA model for calling SFP was the most conservative and consistent across all evaluated statistics and both datasets. In addition, relatively few genes containing at least one SFP (24.7% BB3, 33.5% E-TABM-113), using the RMA method were also called as being differentially expressed. This would imply that most SFPs are not cis-acting SFPs and are not associated with the gene’s expression. However, when a gene is differentially expressed, a majority (74.3% BB3, 53.6% E-TABM-113) were also found to contain at least one called SFP. These SFP are then candidates for cis-acting regulators that impact gene expression and may be important ecological markers. Comparing results of the LM and RATIO models to their respective original papers. We were able to maintain the sensitivity, (67.3% in this study vs 67% in the original studies) while decreasing the false positive rate from 40% to 35.5% for the LM model. The slight sensitivity gain and lower false discovery rate is likely due to the use of the limma procedure for evaluating significance in differential hybridization rather than the significant analysis of microarrays (SAM) procedure [65] used in the original paper. The limma procedure further allows for a more standard cutoff value (BH âL’d’ 0.05) for significance to be used rather than the SAM empirical p-value cutoff of ≤ 0.001 used in the original paper, with roughly the same number of called SFPs (10,552 limma, 10,504 SAM). The RATIO method’s original paper did not use the same dataset, so a direct comparison of the results is not possible. As the results and the model algorithms indicate, the ability of an algorithm to successfully predict SFPs is largely dependent on two factors; where the SFP occurs within the probe and the accuracy of the expression estimate. Results show that if the polymorphism occurs in the outside three bases (position 1-3 and 23-25) of the probe, the likelihood of detection was reduced approximately three fold. Factors that may impact expression estimates are the assumption that all probes within a probe set have a single target and the same target as the other probes within the same probe set (i.e. cross-hybridization and non-specific hybridization is rare). Therefore, before SFP
40
prediction, care should be taken to update the probe to transcript mapping, ensuring that that each probe belongs within the probe set to which it is assigned and that it has a unique target. In addition, results show that as the number probes containing a polymorphism increases within a probe set the likelihood that the corresponding estimate of gene expression will not represent the true level gene expression grows. Poor estimation of gene expression will lead to an increase in the false positive rate of both differential expression analysis and SFP prediction. Within the BB3 dataset, as the number of SFPs within a gene increased the relative frequency of the majority genotype to the minority genotype increases towards one (i.e. more even), this is likely to indicate an entire genotype’s expression profile has been shifted (for example, see Figure 4.6). However, these genes are candidates for splice variation, multi-probe INDELS and/or polyadenylation differences; these probes can then be mapped to exons for possible discovery of these types of polymorphisms. Finally, SFP prediction should be limited to only those probe
Figure 4.6: Single Feature Polymorphism detection using the RMA method when multiple probes are called as containing a SFP in the BB3 dataset. Here the expression profile of the Morex genotype is artificially decreased due to the presence of 5 likely SFP (probes 4 to 8). This reduction in expression leads to an over adjustment of expression and likely false SFP calls in probes 1,2 and 9 to 11.
41
sets where both genotypes are expressed (called present) and many of the observed errors can be attributed to a likely un-expressed transcript in one genotype (data not shown). An intriguing possible use of SFP prediction in experiments where a confirmation sequence dataset exists is for the evaluation of microarray preprocessing procedures. Calling SFPs requires an accurate determination of expression, intuitively the preprocessing procedure that most accurately predicts SFPs, relative to the confirmation dataset, can be assumed to estimate expression the best as well. In addition we may also use known SFPs to evaluate the overall effect that SFP have on expression estimation. For example, we could estimate how many SFPs within a probe set will induce a false expression estimate. Probes containing candidate SFP could then be removed from the probe set and expression recalculated, this process could be performed iteratively until the expression value stabilizes or until the number of probes becomes too few to estimate expression reliably. Expression values calculated using only probes without a SFP will provide a more accurate estimation of expression and provide for lower false positive rates for both differential expression and SFP calls. Such a study would not only give insights into the effects of SFP on gene expression, but also how to correct for genetic variation. We provide an algorithm (RMA) for prediction of SFP from standard expression microarray datasets. The algorithm is simple and quick to implement and produces results which are superior to the other comparison algorithms. The algorithm can be applied to small datasets and still be expected to perform well with a low FDR. Results indicate that the RMA algorithm is an effective technique for studying both gene expression differences and genetic polymorphisms in ecological microarray studies with two populations. Acknowledgments The authors would like to acknowledge Ms. Maia Benner for her comments and edits in preparation of this manuscript.
42
Chapter 5 EFFECT OF SINGLE FEATURE POLYMORPHISMS ON COMPUTED EXPRESSION LEVELS 5.1
Abstract
Microarrays are often used in ecological and conservation experiments to study the differences in gene expression between populations. They have also been used for the detection of genetic variation, known as single feature polymorphisms (SFP) detection, between these populations as well. Yet the degree to which underlying genetic variation influences differential gene expression, or the detection of SFPs is still largely unknown. In this chapter we quantify the false positive rate and false negative rates of differentially expressed genes in the presence of SFPs compared to when SFPs have been masked out. Results show the false positive rates to be significantly increased by SFPs for both differential expression analysis and SFP detection (14% and 5%, respectively). The false negative rates, however, were marginally impacted by the presence of SFPs (0.06% and 0.1% respectively). We also provide a procedure for masking out candidate SFPs prior to differential expression analysis and final SFP prediction. Estimating gene expression with SFPs masked out will produce more accurate gene expression estimates and reduce the false positive rates in both differential expression analysis and SFP prediction. The procedure described does not rely on having a reference database of known genetic polymorphisms and can be applied to any microarray dataset where comparisons are being made between populations.
43
5.2
Introduction
Microarrays remain an important tool in ecology for the study of gene expression differences between populations, but can also been used to detect genetic polymorphisms. Microarrays rely on sequence similarity of target mRNA molecules to oligonucleotide probes for efficient hybridization. The greater the sequence divergence, the less efficient the hybridization is and the greater the effect on the microarray probe’s signal. Genetic variation can then be detected by a statistical comparison of microarray probe level signal differences between samples and populations (See Chapter 4). However, the impact that genetic polymorphisms have on the accuracy of gene expression estimates, and therefore differential gene expression analysis, is still largely unknown. It is reasonable to expect that microarray studies comparing natural populations, or even different species, will have polymorphic sites within microarray probes. These polymorphic sites, however, may result in inaccurate estimates of gene expression, false positive expression differences, and may also interfere with their own detection, when the SFP detection algorithm uses gene expression in its calculations. In this chapter we show the conditions for which genetic variations impacts experimental outcomes. We also detail a procedure for identifying and masking out probes containing candidate polymorphic sites. We show that the procedure leads to a lower false positive rate for differential expression and prediction of genetic polymorphisms. Further, the procedure described does not rely on having a reference database of known genetic polymorphisms and can be applied to any microarray dataset where comparisons are being made between populations. In short oligonucleotide microarrays, like the Affymetrix platform, gene expression estimates are computed by summarizing probe signal intensity values from multiple regions within the same gene. When a genetic polymorphism exists within one of these regions not only can the probe signal intensity value be affected, but also the resulting gene expression estimate (Figure 5.1). If the polymorphism is conserved in only one populations, this can lead to a false determination of differential expression. The types of genetic polymorphisms that can affect probe hybridization are SNPs, insertions/deletions
44
(INDELs), and mRNA transcriptional differences such as splice variation and polyadenylation differences. When applied to microarrays these are all expressed as single feature polymorphisms (SFPs). A SFP, where a feature is a short oligonucleotide probe on a microarray, is a statistical difference in the probe level hybridization efficiency between two groups. Previous work (See Chapter 4) provides an algorithm that is both consistent and sensitive, with a low false discovery rate when compared to a validation dataset of known SNPs. We showed that the location of the SNP in the probe was a determinant of whether the hybridization efficiency would be impacted and the resulting probe signal affected. SNPs that occur in the inner portion of a 25-mer probe (bases 4 to 22) have a greater chance of impacting the resulting probe signal. We have also shown that the number of probes containing a SFP within a probe set is associated with differential expression. The more SFPs in a probe set, the more likely the gene was also called as being differentially expressed. This is a strong indication that these genes may be false positives and that SFP containing probes should be masked out prior to expression calculation and differential expression analysis. Previous studies have reported SNPs to affect differential expression. Walter et. al R 3‘ IVT microarrays may reports that known polymorphisms in the mouse Affymetrix
be responsible for as much as 36% false positive and 13% false negative differentially expressed genes when preprocessing with MAS5 and 22% false positive and 12% false negatives when using the RMA preprocessing pipeline [67]. In addition, Benovoy et al. show that known polymorphic sites effect the determination of both exon expression (probe set) and gene expression (meta-probe set) in Affymetrix Human Exon arrays [10]. Also by masking, they show the false positive rates to be 86.6% and 8.1% at the probe set and meta-probe set levels, respectively, and the false negative rates to be negligible at 0.3% and 0.05%. Both of these studies, however, relied on the existence of a comprehensive sequence polymorphism database to a priori mask out probes believed to contain polymorphisms between the samples used. Here we present a detailed analysis of the effect of SFPs on differential expression
45
A.
B.
Figure 5.1: Single Feature Polymorphism detection using the RMA method with a low expression variability gene (A) and one with high expression variability (B) in the barley BB3 dataset. Each pane contains the log2 intensity values (y axis) of each array and each probe (x axis), with the Morex cultivar shown in red and Golden Promise in black. The left pane shows raw intensity values and average probes set expression values for each genotype (horizontal lines). The right pane shows log2 intensity values after RMA expression adjustment, each probe is then tested for a genotype difference. The bottom axis tick marks show the called SFP (red for Morex, black for Golden Promise), while the top axis tick marks show probes with known SFPs (red/black) and known SFP free (green) probes, according to the sequence confirmation dataset.
46
R 3’ IVT Barley microarray study. To improve the reliability of analysis in an Affymetrix
differential expression analysis and SFP calls, we propose a probe masking strategy based on a preliminary determination of candidate SFPs from the dataset. Finally we report how the masking procedure improves overall results by reducing the number of false positives for SFP detection and differential gene expression analysis. 5.3 5.3.1
Methods Experiment Datasets
R Barley Genome Array contains 22,840 probe sets, each with 11 probe The Affymetrix
pairs (PM and MM probes), developed by an international collaboration of barley researchers [16]. Raw data from a publicly available barley dataset was obtained from plexdb (http://www.plexdb.org, Experiment ID: BB3). We used a subset of 30 microarrays from the original experiment that contained good quality data from 5 tissues (stem, vegetative shoot, seedling leaf, coleoptile and hypocotyl), each with 3 replicates from two barley cultivars, Morex and Golden Promise (GP) [24]. The sequence confirmation dataset used in this study is the same as that of [56] and can be found at their website (http://naturalsystems.uchicago.edu/naturalvariation/barley/SNPtable.csv). 5.3.2
Data Preprocessing
All analyses were conducted within the R statistical computing language using publicly available packages from CRAN and Bioconductor (R version 2.11.0, http://www.r-project.org; http://cran.fhcrc.org; http://www.bioconductor.org) [54, 26]. R CEL files were read into R using the bioconductor package affy [25] Raw Affymetrix
and checked for quality using pseudo-chip images and residual error visualizations. Quality assurance of microarray data was further checked using the affyQAReport function from the Bioconductor package affyQCReport [52]. Hybridization and housekeeping controls, RNA degradation, sample clustering, NUSE plots, LPE plots, and RLE plots all showed high quality data (results not shown) and no additional microarrays
47
were removed. MicroArray Suite (MAS) PMA (Present/Marginal/Absent) calls were determined for each probe set within each array. In this study, a marginal call was also considered to be absent. A probe set was retained for further analysis only if all samples were called as present. This procedure is expected to only keep probe sets that are unambiguously "expressed" across the entire experiment. This is important because if a gene is unexpressed in a sample, any probe containing an SFP would not be detectable, nor could the gene be differentially expressed. Filtering the BB3 dataset in this manner, reduced the number of probe sets from 22,840 to 9,222 (251,240 to 101,499 probes). A second common filter is to remove probe sets with a low interquartile range (IQR) across all arrays, this filter is intended to remove probe sets that are "not changing" and therefore potentially biologically uninteresting. This filter is useful for differential expression analysis, but not for detection of SFPs, as a probe set can include an SFP but have low expression variability. However, to maintain consistency of analyzed probe sets between both differential expression analysis and SFP calling, this filter was not applied in this study. 5.3.3
Analysis of Differential Expression
Differential expression analysis was determined only for those probe sets which passed the PMA filter as described in preprocessing, and was performed by first preprocessing the dataset using the RMA procedure [32, 31, 12]. Differentially expressed genes were then determined by using the linear analysis of microarrays technique from the limma package [62] with empirical Bayes adjustment to the variance [61]. We used a linear model that included both genotype and tissue as main effects. The resulting p-values are then adjusted for multiple testing using the Benjimini and Hochberg (BH) false discovery rate correction procedure [9]. A gene was considered to be differentially expressed if it had an adjusted p-value less than 0.05 and a log fold change (logFC) greater than 0.5. Using both p-value and fold change criteria to determine differential expression is recommended by the MicroArray Quality Control (MAQC) project [46, 59].
48
5.3.4
Detection of Single Feature Polymorphisms (SFPs)
The model of hybridization efficiency used in this study is based on principles from the RMA preprocessing algorithm[32, 31, 12], and was described in detail in Chapter 4. Briefly, the relative hybridization efficiency of a probe was calculated as the difference between the log2 raw probe intensity value and the calculated log2 probe set expression value for the array, adjusted by the mean expression value across all arrays. Resulting values are log2 probe signal values adjusted for the mRNA expression differences between samples. Hpa = (log2 (Ipa ) − log2 (Iˆa )) + µ
(5.1)
Where, Hpa is the calculated hybridization efficiency of probe p and array a, log2 (Ipa ) is the log2 raw intensity value of probep and array a in the probe set, after RMA background correction and normalization. log2 (Iˆa ) is the expected intensity value (expression) of array a for the entire probe set. The difference is then adjusted by the mean log expression value across all arrays, µ, in order to scale each probe value by the overall intensity values. This has the additional effect, when partnered with the empirical Bayes adjustment to the variance in the limma package, of giving decreased weight to those probe sets with low overall expression. The RMA summarization procedure (median polish) is used to compute the expected intensity values for each probe set of each array [12]. SFP detection is only determined for those probes within probe sets that passed the PMA filter as described above, and is then conducted in the following manner. First, hybridization efficiencies for each probe are calculated using the procedure described above. Each probe is then fitted for a genotype effect; using the limma approach with empirical Bayes adjustment to the variance followed BH correction for multiple testing [61, 9]. A probe was considered to contain an SFP if it had a BH adjusted p-value of less than 0.05 and a log fold change (logFC) greater than 0.5. The genotype with the reduced signal of hybridization was determined to contain the SFP.
49
5.3.5
SFP Masking Procedure and Evaluation
In order to mitigate the effect of SFPs on expression estimates, we first predict and then mask out SFPs in the dataset. The procedure described above is used to perform the initial prediction. However, instead of using a stringent multiple testing corrected BH p-value < 0.05, we chose to use instead a raw p-value < 0.05 as the criteria for selecting probes to mask out. This allows for borderline SFP containing probes to also be removed from the analysis. Probe sets that contained more than seven SFPs were removed from further analysis, as those with four or fewer probes remaining are not likely to generate reliable expression estimates. Figure 5.2 shows an example of a probe set with 10 of 11 probes called as containing an SFP. A significant number of Morex probes (probes 5 to 8) show a decrease in hybridization efficiency (possibly due to a splice variant). This results in an under estimate of expression for Morex and the SFP prediction algorithm calls an additional five probes as containing an SFP in the Golden Promise cultivar, which are clearly false positives. Once the probes and probe sets to mask are identified, they are removed from the dataset and the RMA median polish procedure is performed again to produce new estimates of gene expression. Differential expression analysis and SFP prediction is performed using the newly generated masked expression values. To evaluate how masking candidate SFPs affected both differential expressions analysis and SFP detection, we estimated the proportion of false-positives and false-negatives as compared to with and without masking. We treated the differential expression and SFP prediction results from the masked dataset as a reference (truth). These results were then compared to the results from the unmasked dataset in the manner described in Table 5.1. Genes that were significant for differential expression, or SFP (p-value and logFC below threshold), and non-significant (p-value or logFC above threshold) in both unmasked and SFP masked datasets were considered true positive (TP) and true negatives (TN), respectively. The result was considered a false positive (FP) when significance was found in the unmasked dataset, but becomes non-significant in the
50
MX
10 8
log intensity
6
8 6
log intensity
10
12
12
GP
1
2
3
4
5
6
7
8 9 10 11 1 2 3 4 rbaal30a10_s_at ; SFPs called at probe(s): 1, 2, 4, 5, 6, 7, 8, 9, 10, 11
5
6
7
8
9
10
11
Figure 5.2: Single Feature Polymorphism detection in a gene with a high number of called SFPs (10 of 11). Each pane contains the log2 intensity values (y axis) of each array (30 microarrays) and each probe (x axis), with the Morex cultivar shown in red and Golden Promise in black. The left pane shows raw intensity values and average probe set expression values for each genotype (horizontal lines). The right pane shows log2 intensity values after expression adjustment, each probe is then tested for a genotype effect. The bottom axis tick marks show the called SFP (red for Morex, black for Golden Promise).
51
Unmasked SFPs
Table 5.1: Comparison of differential expression analysis and SFP prediction with and without the SFP mask. Masked SFPs Positive for differential Negative for differential expression, or SFP expression, or SFP Positive for differential expression, or SFP Negative for differential expression, or SFP
TRUE POSITIVE
FALSE POSITIVE
FALSE NEGATIVE
TRUE NEGATIVE
masked dataset. Conversely, a false negative (FN) was when significance was found in the masked dataset, but was non-significant in the unmasked dataset. The rates are then computed as: False positive rate (F P R) = F P/(F P + T P ) and false negative rate (F N R) = F N/(F N + T N ).
5.4
Results
We preprocessed the dataset as described in "Materials and Methods", leaving 9,222 probe sets (101,499 probes) remaining for analysis of differential expression and SFP detection. We chose to use only those probe sets which were unambiguously expressed across all samples. We felt this was the best set of genes to evaluate the impact of genetic variation on gene expression experiments, without needing to consider whether a gene was expressed or not. We first determined results for SFP detection and differential expression without any adjustments for genetic variation. This resulted in 3,203 called SFPs (1,782 Morex SFPs and 1,421 GP SFPs, 3.2% of all probes), which occurred in 1,665 genes, or 18.1% of all genes tested. Differential expression analysis resulted in 335 genes (151 up-regulated in Morex and 184 up-regulated in GP, 3.6% of all probe sets). These are the results a typical analysis will produce, when genetic variation is not considered in the estimate of gene expression. Next, we determined a candidate set of probes to mask out of the dataset. Relaxing the criteria to an unadjusted p-value < 0.05, resulted in 3,414 candidate SFPs to mask (3.4%
52
of all probes). Interestingly, this is only 211 more candidate SFPs than the original BH adjusted set of 3,203 SFPs. This is at least partially due the additional restriction of requiring a logFC > 0.5 to be called a SFP. Not including a minimum logFC, would have resulted in 4,952 called SFPs after BH adjustment and 10,917 called SFP with no BH adjustment, a two fold increase. This shows that most of the would be "called" SFPs have a small logFC difference (< 0.05). Morey et al. found that including an expression fold change requirement increased the validation rate for differential expression by qPCR and the MAQC Consortium further found an increase in the agreement of differentially expressed genes across microarray experiments[46, 49]. We see no reason to believe the same would not be true with SFP detection. In addition to the total number of candidate SFPs, we also determined the number of SFP containing probes within each probe set. A total of 1,749 probe sets contained at least one candidate SFP to mask, a 5% increase in the number of probe sets relative to the original BH adjusted set of 1,665. Most probe sets contained only a single called SFP (988 probe sets, 59%), and 55 probe sets contained eight or more called SFPs (488 SFPs in total). Since masking removes probes from a probe set for the purpose of expression estimation, there must exist a minimum number of probes present to estimate gene expression from. For this experiment we chose four probes as the minimum, but others may choose a larger (or smaller) number of probes depending on the experiment. While these probe sets are removed from further consideration for differential expression and SFP calling, they should not be discarded from the study. This set of genes represents the best candidates for possible structural variation and/or INDELs between the populations and require further scrutiny (see Figure 5.2). It is also worth noting that 50 of these 55 genes were also called as being differentially expressed (22 up-regulated in Morex and 28 up-regulated in GP), these are likely to be false positives. These 55 genes were removed from the next stage of the analysis (9,167 probe sets and 2,926 probes to mask remaining), which was to mask the dataset for SFPs and redetermine SFP calls and differential expression. We masked the dataset by removing the 2,926 probes prior to the median polish
53
procedure from the RMA preprocessing routine. This produces new expression estimates for each gene and sample based on a subset of probes not likely to contain genetic variation between the populations. The new estimates of gene expression are then used for calculation of differential expression and in the calculations for SFP calls. While the 2,926 probes are removed for expression calling, they are not removed from the experiment and are re-evaluated for the presence of an SFP using the masked expression estimates. After removing the 55 genes as described above, there remained 100,894 probes to test for the presence of an SFP. After masking, 2,678 probes were called as containing an SFP (1,152 Morex SFP and 1,526 GP SFP, 2.7% of all probes), occurring in 1,605 genes, which is 17.5% of the genes tested. This is a total drop of 525 called SFPs in 60 fewer genes. This is partially due to having removed 55 genes which contained eight or more called SFPs. These genes represented 479 of the 525 probes and 49 of the 60 genes different between the masked and unmasked results. This does not mean that 479 of the 525 are false positive SFP calls, since these probe sets were removed for containing too many probable SFPs. However, many of these may still be false positives due to inaccurate expression estimates and should be investigated separately. Not considering the 55 removed genes, the masked results contain 136 probes previously called and SFPs, which were no longer called as an SFP, and 90 new probes called as containing an SFP that were not called in the unmasked analysis for a net reduction of 46 called SFPs. After masking the dataset, differential expression analysis resulted in 251 differentially expressed genes (117 up-regulated in Morex and 134 up-regulated in GP), a reduction of 84 genes from the unmasked analysis (335 differentially expressed genes in the unmasked results). However, 50 of these 84 genes were from probe sets that contained eight or more probably SFP and were removed from the masked analysis. These 50 genes still represent probable false positive differential expression calls, but differential expression cannot be reliably tested due to the small number of available probes after masking. Not considering these 55 removed for the masked analysis, differential expression of the masked dataset resulted in 40 genes no longer considered as differentially expressed that were labeled as
54
Table 5.2: Contingency table comparing the results between masked and unmasked data for differential expression. Masked Data
Unmasked Data
Differential Expression Morex
Neither
Golden Promise
Morex
113
16
0
Neither
4
8876
2
Golden Promise
0
24
132
such in the unmasked dataset, and six gene were were now labeled as differentially expressed in the masked dataset, but were not in the unmasked dataset, for a net reduction of 34 differentially expressed genes. Next, we assessed whether the masking procedure improved results for either differential expression or SFP calling, or both. We calculated the FPR and FNR for differential expression and SFP calls, using the four scenarios described in Table 5.1, where the masked dataset was considered to be the reference data set (i.e. the truth) (Tables 5.2 and 5.3). It should be noted that the reference dataset may not be free of false-positives, or false-negatives, as there may be other sources of error other than SFPs, but the procedure still allows for the determination of the rates that are induced by SFPs. We found the SFP induced false positive rate to be 14% for differential expression and 5.0% for SFP calls. This demonstrates that removing SFPs prior to differential expression analysis and performing a reanalysis of SFPs will reduce the rate of false positives in the study. SFPs, however, did not affect the false negative rate (FNR) as significantly, which were 0.07% for differential expression and 0.09% for SFP calls. In addition these rates do not consider the 55 genes removed for having eight or more masked probes, and should be considered separately, but would otherwise not be in a typical analysis with no probe masking.
55
Table 5.3: Contingency table comparing the results between masked and unmasked data for SFP prediction. Masked Data
Unmasked Data
SFP prediction Morex
Neither
Golden Promise
Morex
1109
71
0
Neither
43
98,080
47
Golden Promise
0
65
1479
Finally, we compared the masked SFP results to the unmasked SFP for sensitivity, specificity and false discovery rate as compared to as set of known SNP found in the sequence conformation dataset. The sequence confirmation dataset is a collection of SNPs from three barley sequence sources totaling 2,699 sequences. After removing sequences from probe sets that were filtered out (as described in section "Data Preprocessing" above) and those which contained eight or more masked SFPs (55 probe sets), the sequence confirmation dataset consisted of 1,391 sequences of which 131 were polymorphic in Golden Promise and 101 to Morex, the remaining 1,159 sequences did not contain a SNP and serve as true negatives. The sensitivity is the proportion of known polymorphisms in the sequence confirmation dataset called correctly as an SFP by the algorithm. Masking probe sets increased the sensitivity of detection by 0.9% from 76.3% to 77.2%. The specificity is the proportion of known negatives (no SNP) correctly identified as such. Masking improved the specificity slightly from 99.4% to 99.5%. Finally, false sequence polymorphism rate (FDR) is the proportion of called SFPs by the algorithm incorrectly identified as a polymorphism, when the sequence confirmation dataset indicates that no SNP is present. The lower sequence polymorphism rate the better and masking probes lowered the FDR by 0.56% from 5.35% to 4.79%.
56
5.5
Discussion
Our analysis suggests the presence of genetic variation within an Affymetrix microarray study will generate false positives and false negatives in both differential expression analysis and the detection of SFPs. We’ve described a procedure to mask out candidate SFPs and recalculate expression estimates from the data, prior to differential expression analysis and final SFP detection. Masking the dataset in this way was shown to have the potential to reduce the false positive rate (FPR) by 14% for differential expression and 5.0% for SNP calls in this experiment, while the false negative rate (FNR) did not appear to be significantly affected by SFPs. The FPR values determined here are on the same order as those found in previous studies that used sequence databases of known polymorphic sites. The FNR found in this study was similar to that found in the study with R Exon arrays [10], but was significantly less than the comparable study using Affymetrix R Mouse 3‘ IVT microarrays (12% compared to 0.06%) [67]. This is at least Affymetrix
partially be due to the difference in the number of probes masked (> 13,000 compared to 3,500 probes). Finally, we’ve detailed a procedure for identifying and masking out probes that leads to a lower false positive rate in both differential expression and prediction of genetic polymorphisms. The method described does not rely on any pre-existing knowledge of sequence variation, or require a reference database of known genetic polymorphisms. The procedure can be applied to any microarray dataset where comparisons are being made between populations and is well suited for studies in ecology and conservation.
57
Chapter 6 EVALUATION OF MICROARRAY PREPROCESSING ROUTINES USING SINGE FEATURE POLYMORPHISMS 6.1
Abstract
The choice of microarray preprocessing pipeline will have a significant impact on the experimental results. There have been many previously published reports evaluating preprocessing pipelines, but all have used methods comparing expected gene expression levels, measured by spike-in or qRT-PCR, to the microarray calculated gene expression levels. This is potentially problematic, because expected gene expression levels determined by spike-in, or qRT-PCR, can only be performed for a relatively few new number genes which are expected to apply to a much larger set as measured by a microarray. Further, these methods also contain their own sources of technical error. Here we present a new method for evaluating preprocessing pipelines by their ability to detect genetic polymorphisms in expression microarray data. Detection of genetic polymorphisms in expression microarray data, relies most heavily on an accurate estimation of gene expression; therefor, the pipeline that best predicts genetic polymorphisms implies better expression estimates. We compare five commonly used preprocessing pipelines (RMA, GCRMA, dChip, MAS, and PLIER) for their ability to detect single feature polymorphisms in two Barley microarray datasets. Further, we compared individual subroutine performance using 96 new preprocessing pipelines generated from all possible combinations of the five standard preprocessing pipelines background correction, normalization, PM probe correction and summarization routines. Results show that of the five standard preprocessing routines, the RMA preprocessing pipeline to have the overall best results for sensitivity, specificity and FDR across both
58
datasets. Use of MM probes for PM probe correction reduces specificity in particular and performs poorly in general. Model-based summarization methods (PLIER, median polish, MBEI) significantly outperform models that aren’t (MAS), particularly in larger sample sizes. GCRMA background correction performed poorly with a small number of samples and in general non-model based background correction outperformed model-based background correction in both datasets. Results shown here can be used to help inform the analyst of which preprocessing algorithm to use for their datasets. 6.2
Introduction
DNA microarrays are a significant tool used for the study of genome-wide gene expression differences. With millions of probes on a DNA microarray, the users choice in preprocessing method will have a significant impact on the biological inferences made from the experiment. However, it is difficult to know which of the many preprocessing methods should be used and which ones will produce erroneous results. Previous methods for evaluating preprocessing routines have been based on comparisons of microarray computed expression levels to expected levels of expression determined using other molecular techniques (ex. qRT-PCR). These comparisons, however, are typically based on only a very few number of genes and the techniques used to determine expression have their own sources of technical error. A method for evaluating preprocessing routines that does not rely on knowledge of expression levels, which are difficult to absolutely quantify, but instead used a feature which is more absolute, would provide new insights into the performance characteristics of microarray preprocessing. Here we present a new method for evaluating preprocessing routines by their effectiveness in detecting microarray probe level differences caused by genetic polymorphisms, or single feature polymorphism (SFPs). R microarrays have been used for the discovery of SFPs in yeast [55], Affymetrix
Arabidopsis thaliana [14], barley [19, 56], wheat [18] and others [41, 39, 21, 75]. SFPs, where a feature is a short-oligonucleotide probe on a microarray, are differences in probe
59
level hybridization efficiencies between samples caused by underlying genetic or transcriptional variation (See Chapter 4). A probe’s hybridization efficiency can be determined by computing the difference between the probe’s observed intensity (background corrected and normalized intensity value) from the expected intensity value (intensity due to expression) for each probe [56, 55] (see Figure 6.1). The resulting measure has any signal from gene expression removed and can then be analyzed for differences between sets of samples. Therefore, successful detection of SFPs relies on an accurate estimate of gene expression. A sequence database of known polymorphisms can be used to validate SFP detection and by extension evaluate a preprocessing routine’s accuracy in estimating expression. The preprocessing procedure which best predicts known SFPs will also be the procedure which best estimates expression. R GeneChip R microarrays are the most common brand of microarray in Affymetrix
use today, they consist of small DNA fragments, termed probes, chemically synthesized at specific locations on a coated quartz surface [44]. Each probe is an oligonucleotide sequence of length 25bp. A probe set comprising of 11 to 20 probe pairs typically represents the mRNA molecule of interest. Each probe pair is composed of one perfect match (PM) and one mismatch (MM) probe, where the mismatch probe contains the same DNA sequence as the PM probe, but with the middle (13th bp) base pair complimented. The MM probes were originally meant to provide an estimate of non-specific binding and background, but have since been shown to contain a significant amount of foreground signal as well [32, 73]. Labeled cDNA samples are hybridized to the microarray, scanned and image analysis is performed to provide intensity values for each of the PM and MM probes. To determine the original expression levels of the cDNA sample on a microarray, a microarray preprocessing procedure is performed. The objectives of preprocessing microarray data are two fold: 1) to remove variation due to technical sources while preserving variation from biological sources 2) to produce summarized expression values for each gene (probe set). A typical workflow for preprocessing microarray data from raw probe level signal intensities is: background correction, across sample normalization, MM
60
8
8
1
2
3
4
5
6
7
Golden Promise
8
9
10 11 1 2 Contig4511_at ; SFPs called at probe(s): 5 Sequence Confirmed at probe(s): 5
3
4
5
6
7
8
9
10
11
Morex
log intensity
4
5
5
6
7
8 7 6
log intensity
8
9
9
10
10
11
B.
10
log intensity
9
10 9
log intensity
11
11
12
A.
Morex
12
Golden Promise
1
2
3
4
5
6
7
8
9
10 11 1 2 Contig742_at ; SFPs called at probe(s): 10, 11 Sequence Confirmed at probe(s): 10, 11
3
4
5
6
7
8
9
10
11
Figure 6.1: Single Feature Polymorphism detection using the RMA preprocessing pipeline with a low expression variability gene (A) and one with high expression variability (B) in the E-TABM-113 dataset. Each pane contains the log2 intensity values (y-axis) of each array (six microarrays) and each probe (x-axis), with the Morex cultivar shown in red and Golden Promise cultivar in black. The left pane shows raw intensity values and average probe set expression values for each genotype (horizontal lines). The right pane shows log2 intensity values after expression adjustment, each probe is then tested for a genotype difference. The bottom axis tick marks show the called SFP (red for Morex, black for Golden Promise), while the top axis tick marks show probes with known SFPs (red/black) and known SFP free (green) probes, according to the sequence confirmation dataset.
61
correction (if MM probes exist) and finally probe set summarization. Further, these four stages are also usually packaged together into a single preprocessing pipeline (i.e. MAS5[2], RMA[32, 31, 12], GCRMA[35], PLIER[3], and dChip[43, 42]), and the user chooses a preprocessing pipeline to produce expression values for their experiment. The procedure which best approximates the technical variation within the experiment, will produce a more accurate estimate of sample expression, be more sensitive, and produce fewer false positives in differential expression analysis. Evaluating preprocessing routines is difficult, since we typically cannot absolutely quantify genome-wide expression levels prior to performing a microarray experiment. However, due to its importance, several methods have been proposed for evaluating preprocessing routines. Two of the most common are: by concordance to qRT-PCR expression analysis and by the use of spike-in, or dilution, datasets. Evaluation of preprocessing routines by concordance of qRT-PCR expression results to microarray expression results usually occurs in only a small subset of the genes on a microarray [29, 48]. However, concordance in only a small number of genes (typically 10s) may not translate to concordance on a genome-wide level with tens of thousands genes on a microarray. A second method has been to use spike-in datasets, where known quantities of mRNA from specific genes are added to the mixture, and dilution datasets, when multiple samples are pooled together in known proportions [33, 5]. Spike-in datasets will have similar issues as qRT-PCR, in that typically only a few genes are spiked into the dataset. The dilution experiment, however, is a whole-array approach to evaluation. In a dilution experiment, the relative gene abundances are expected to change in a predictable manner as the dilution changes, assuming there is no error in the dilution process itself. The affycomp online tool http://affycomp.jhsph.edu/, which uses spike-in and dilution experiments, provides a way for researchers to evaluate and compare their preprocessing routines with other preprocessing routines. However, no single preprocessing routine has emerged as a clear "winner", and the authors instead leave it up to individual users to determine which routine is "best"" for their experiment. Because of the ambiguity of these
62
previous methods, we introduce a new method to evaluate preprocessing routines using SFPs. SFPs offer two significant advantages. The first is that genetic variation is absolute and easily measurable. The second is the number of SFPs, and known non-SFPs can be significantly larger (thousands), which is two orders of magnitude greater than the typical qRT-PCR experiment or spike in experiment. In this paper, we present a analysis of the performance of preprocessing algorithms by their ability to correctly detect known sequence polymorphisms (SFPs). We present a comparisons of five commonly used preprocessing pipelines (RMA, GCRMA, dChip, MAS, and PLIER) and all possible combinations of their four background correction routines, three normalization routines, two MM probe adjustments and four probe set summarization routines (96 preprocessing routines). Comparisons are made by each pipeline’s ability to correctly detect known sequence polymorphisms in two publicly R barley (Hordeum vulgare) microarrays datasets. The two Barley available Affymetrix
datasets use the same two cultivars (Morex and Golden Promise), but differ in their number of microarrays and tissues types. A previously reported DNA sequence database of known polymorphisms between Morex and Golden Promise is used to evaluate and compare the sensitivity, specificity and false feature polymorphism rate of each pipeline. Because identification of SFPs relies on accurate determination of gene expression, this analysis can inform the researcher of which preprocessing routine is most likely to produced better gene expression estimates. A more accurate estimate of gene expression will reduce the false positive rate in gene expression analysis and provide for more meaningful biological conclusions. Finally, we suggest how these results may be used in choosing a preprocessing routine for differential expression analysis in the general case. 6.3 6.3.1
Methods Experiment Datasets
R Barley Genome Array contains 22,840 probe sets, each with 11 probe The Affymetrix
pairs (PM and MM probes), developed by an international collaboration of barley
63
researchers [16]. Raw data from two publicly available Barley datasets were obtained from plexdb (, Experiment ID: BB3) and ArrayExpress (http://www.ebi.ac.uk/arrayexpress, Experiment ID: E-TABM-110). The first barley genome microarray dataset (BB3) is a subset of 30 microarrays from the original experiment that contained good quality data from five tissues (stem, vegetative shoot, seedling leaf, coleoptile and hypocotyl), each with three replicates from two barley cultivars, Morex and Golden Promise (GP) [24]. The second dataset (E-TABM-110) is also from barley seedling leaves and contains three samples from each of both Morex and Golden Promise cultivars for a total of six microarrays (unpublished study). This dataset is used to evaluate preprocessing routines using a number of microarrays more typical of a smaller study, and to assess consistency of preprocessing outcomes across experiments. The sequence confirmation dataset used in this study is the same as that of [56] and can be found at their website (http://naturalsystems.uchicago.edu/naturalvariation/barley/SNPtable.csv). The sequences were collected from three barley sequence sources and contains a total of 2,699 sequences. Of those, 30 were duplicated in two of the three sources and one was duplicated in all three. Sixty-six probes contained polymorphisms for both Morex and GP genotypes as compared to the sequence on the microarray. After removing duplicates, and sequences with polymorphisms in both genotypes, the sequence confirmation dataset consisted of 2601 sequences of which 223 were polymorphic in GP and 178 in Morex, the remaining 2200 probes did not have a sequence polymorphism between the two cultivars and the microarray. Results of each preprocessing routine were evaluated by direct comparison to the sequence confirmation dataset with calculations of sensitivity, specificity and false discovery rate (FDR). An pipeline’s sensitivity is the proportion of known polymorphisms in the sequence confirmation dataset called as an SFP by the algorithm and pipeline. The specificity is the proportion of known negatives (no SFP) correctly identified as such. False discovery rate is the proportion of called SFPs by the algorithm incorrectly identified
64
as a polymorphism, when the sequence confirmation dataset indicates that no SNP is present. 6.3.2
Data Preprocessing
All analyses were conducted within the R statistical computing language using publicly available packages from CRAN and Bioconductor (R version 2.11.0, http://www.r-project.org; http://cran.fhcrc.org; http://www.bioconductor.org) [54, 26]. Raw CEL files were read into R using the bioconductor package affy [25] and checked for quality using pseudo-chip images and residual error visualizations. Quality assurance of microarray data was further checked using the affyQAReport function from the Bioconductor package affyQCReport. Hybridization and housekeeping controls, RNA degradation, sample clustering, NUSE plots, LPE plots, and RLE plots all showed high quality data (results not shown) and no additional microarrays were removed. Microarray ArraySuite (MAS) PMA (Present, Marginal, Absent) calls were determined for each probe set within each array, for this study a marginal call was also considered to be absent. A probe set was retained for further analysis only if all samples were called as present. This filter is expected to only keep probe sets that are unambiguously "expressed" across the entire experiment. If a gene is unexpressed in a sample, any probe containing an SFP would not be detectable, nor could the gene be differentially expressed. A second commonly performed probe set filter is to remove out probe sets with a low interquartile range (IQR) across all arrays. This filter is intended to remove probe sets that are "not changing" and therefore potentially biologically uninteresting. However, this filter is only useful for differential expression and not for detection of SFP, as a probe set can include an SFP but have low expression variability and was not applied in this study. The most common preprocessing pipelines in use today are MAS5 (Affymetrix), PLIER (Affymetrix), RMA, GCRMA, and dChip. Table 6.1 shows the specific background correction, normalization, PM correction and summarization subroutines used by each. We also wanted to explore how each subroutine effected the expression estimate
65
and the ability to reliably detect SFPs. Therefor, new preprocessing pipelines were generated comprising of all possible combination of four background routines, two normalization routines, two mismatch probe corrections (the subtract MM procedure from dChip was not included) and four probe set summarization routines (96 total preprocessing pipelines). The four background routines included: no background (none), MAS5 background correction [2], RMA background correction [31] and GCRMA background correction [73]. Normalization occurred using the three algorithms: no normalization (none), invariant set normalization [43, 42] and quantile normalization (RMA) [13]. PM correction was done using the idealized mismatch routine from Affymetrix (this method prevents negative values when MM > PM)[2] and pmonly, which is to use the perfect match probes only, or no PM correction. The subtract mismatch routine from dChip was not used in this study; because, it will produce negative values when MM > PM, which prevents the log calculation needed for many of the probe set summarization routines. However, we did produce metrics for the standard dChip preprocessing pipeline: no background correction, invariant set normalization, subtract mismatch probes followed by MBEI summarization. probe set summarization was performed using: MAS5 [2], PLIER[3], median polish [32] and model based expression method (MBEI) [43, 42] . For a complete description of each of the preprocessing algorithms and routines see [32, 43, 42, 35, 2, 3]. 6.3.3
Model of Hybridization Efficiency, Detection of SFPs and Differential Expression
The model of hybridization efficiency used in this study is based on principles from the RMA preprocessing algorithm[32, 31, 12], and was described in detail in Chapter 4. Briefly, the relative hybridization efficiency of a probe was calculated as the difference between the log2 raw probe intensity value and the calculated log2 probe set expression value for the array, adjusted by the mean expression value across all arrays. Resulting values are log2 probe signal values adjusted for the mRNA expression differences between samples.
66
Table 6.1: Breakdown of the preprocessing steps for the MAS5, Plier, RMA, GCRMA, and dChip preprocessing pipelines. MAS5
Background Correction Probe Level Normalization PM Correction probe set Summarization probe set Normalization
weighted
Plier none
average none ideal
RMA
GCRMA
RMA
GCRMA
(global model)
(model based)
quantile
quantile
quantile
invariant
normalization
normalization
set
none
none
none
median
median
polish
polish
none
none
subtract MM none
Plier
biweight mean
none
normalization
mismatch Tukey
dChip
none
MBEI
none
scaled
Hpa = (log2 (Ipa ) − log2 (Iˆa )) + µ
(6.1)
Where, Hpa is the calculated hybridization efficiency of probe p and array a, log2 (Ipa ) is the log2 raw intensity value of probep and array a in the probe set, after background correction, normalization and PM correction. log2 (Iˆa ) is the expected intensity value (expression) of array a for the entire probe set. The difference is then adjusted by the mean log expression value across all arrays, µ, in order to scale each probe value by the overall intensity values. This has the additional effect, when partnered with the empirical Bayes adjustment to the variance in the limma package, of giving decreased weight to those probe sets with low overall expression. The pipelines summarization procedure is used to compute the expected intensity values for each probe set of each array [12]. SFP detection and differential expression were only determined for those probes within probe sets that passed the PMA filter as described above, and was conducted in the following manner. First, hybridization efficiencies for each probe are calculated as described above, using the preprocessing subroutines from the pipeline being evaluated. Each probe was then fitted for a genotype effect; using the limma approach with empirical Bayes adjustment to the variance followed BH correction for multiple testing [61, 9]. A
67
probe was considered to contain an SFP if it had a BH adjusted p-value of less than 0.05 and a log fold change (logFC) greater than 0.5. The genotype with the reduced signal of hybridization was determined to contain the SFP. Differentially expressed genes were determined using the linear analysis of microarrays technique from the limma package [62] with empirical Bayes adjustment to the variance [61]. We used a linear model that included both genotype and tissue as main effects for the BB3 dataset and genotype only for the E-TABM-113 dataset. The resulting p-values are then adjusted for multiple testing using the Benjimini and Hochberg (BH) false discovery rate correction procedure [9]. A gene was considered to be differentially expressed if it had an adjusted p-value less than 0.05 and a log fold change (logFC) greater than 0.5. Using both p-value and fold change criteria to determine differential expression is recommended by the MicroArray Quality Control (MAQC) project [46, 59]. 6.4
Results
Data from two barley microarray datasets (E-TABM-113 and BB3) were checked for quality and then processed to determine which probe sets to filter for presence/absence of expression. Filtering the BB3 dataset by PMA calls reduced the number of probe sets from 22,840 to 9,222 (251,240 to 101,499 probes), and filtering on the E-TABM-113 dataset reduced the number of probe sets from 22,840 to 12,238 (251,240 to 134,785 probes). We then computed expression estimates and hybridization efficiencies for each of the preprocessing routines and tested each probe for the presence of an SFP and differentially expressed genes. The sensitivity, specificity and false discovery rate, as compared to the confirmation dataset, as well as the number of SFPs called, the number of genes containing at least one SFP and the number of differentially expressed genes for each preprocessing routine are shown in figures 6.2 and 6.3 (detailed tables are in supplemental tables A.1 and A.2). Within the BB3 dataset, however, GCRMA background correction followed by invariant set normalization resulted in not enough "invariant" probes to normalize on and had 89 successful preprocessing routines to evaluate.
68
5000
4000
3000
2000
1000
0
20
40
60
80
Percentage
MX GP
E.
1000
2000
3000
Number of SFPs
Differentially Expressed Genes
D.
Sensitivity, Specificity, FDR
Complete Linkage Cluster of SFP Calls
gcrma.quantiles.pmonly.liwong gcrma.none.pmonly.liwong gcrma.quantiles.pmonly.medianpolish gcrma.none.pmonly.medianpolish gcrma.quantiles.mas.liwong gcrma.none.mas.liwong none.quantiles.mas.liwong none.none.mas.liwong mas.invariantset.mas.liwong mas.quantiles.mas.liwong mas.none.mas.liwong rma.quantiles.mas.liwong rma.none.mas.liwong rma.invariantset.mas.liwong none.invariantset.subtractmm.liwong none.invariantset.mas.liwong rma.quantiles.pmonly.liwong rma.none.pmonly.liwong rma.invariantset.pmonly.liwong none.quantiles.pmonly.liwong none.none.pmonly.liwong mas.invariantset.pmonly.liwong mas.quantiles.pmonly.liwong mas.none.pmonly.liwong mas.quantiles.pmonly.plier mas.none.pmonly.plier mas.invariantset.pmonly.plier mas.quantiles.mas.plier mas.none.mas.plier mas.invariantset.mas.plier none.quantiles.mas.medianpolish none.none.mas.medianpolish mas.invariantset.mas.medianpolish mas.quantiles.mas.medianpolish mas.none.mas.medianpolish rma.quantiles.pmonly.medianpolish rma.none.pmonly.medianpolish rma.invariantset.pmonly.medianpolish none.quantiles.pmonly.medianpolish none.none.pmonly.medianpolish mas.invariantset.pmonly.medianpolish mas.quantiles.pmonly.medianpolish mas.none.pmonly.medianpolish none.invariantset.pmonly.medianpolish none.invariantset.pmonly.liwong rma.quantiles.pmonly.plier rma.none.pmonly.plier none.quantiles.mas.plier none.none.mas.plier none.invariantset.mas.plier none.invariantset.mas.medianpolish none.quantiles.pmonly.plier none.none.pmonly.plier none.invariantset.pmonly.plier rma.quantiles.mas.plier rma.none.mas.plier rma.invariantset.mas.plier rma.quantiles.mas.medianpolish rma.none.mas.medianpolish rma.invariantset.mas.medianpolish gcrma.quantiles.pmonly.plier gcrma.none.pmonly.plier gcrma.quantiles.mas.plier gcrma.none.mas.plier rma.invariantset.pmonly.plier none.quantiles.pmonly.mas none.none.pmonly.mas mas.invariantset.pmonly.mas mas.quantiles.pmonly.mas mas.none.pmonly.mas none.invariantset.pmonly.mas rma.quantiles.pmonly.mas rma.none.pmonly.mas rma.invariantset.pmonly.mas none.quantiles.mas.mas none.none.mas.mas none.invariantset.mas.mas mas.quantiles.mas.mas mas.none.mas.mas mas.invariantset.mas.mas rma.quantiles.mas.mas rma.none.mas.mas rma.invariantset.mas.mas gcrma.quantiles.mas.mas gcrma.none.mas.mas gcrma.quantiles.mas.medianpolish gcrma.none.mas.medianpolish gcrma.quantiles.pmonly.mas gcrma.none.pmonly.mas
C.
Genes Containing an SFP
Sensitivity Specificity FDR
B.
Predicted SFPs
A.
600
1000
1400
Number of Genes
200
600
1000
Number of Genes
Figure 6.2: Clustering (pane A.) of preprocessing routines (89 preprocessing routines) for the BB3 dataset with the performance characteristics sensitivity, specificity and FDR (pane B.), the number of called SFPs (pane C.), the number of genes containing a called SFP (pane D.) and the number of differentially expressed genes (pane E.).
69
80
60
40
20
0
20
40
60
80
Percentage
MX GP
E.
1000
3000
Number of SFPs
Differentially Expressed Genes
D.
Sensitivity, Specificity, FDR
Complete Linkage Cluster of SFP Calls
none.quantiles.pmonly.plier none.invariantset.pmonly.plier none.none.pmonly.plier none.quantiles.mas.plier none.invariantset.mas.plier none.none.mas.plier none.quantiles.mas.medianpolish none.invariantset.mas.medianpolish none.none.mas.medianpolish none.quantiles.mas.liwong none.invariantset.mas.liwong none.invariantset.subtractmm.liwong none.none.mas.liwong none.quantiles.mas.mas none.invariantset.mas.mas none.none.mas.mas rma.quantiles.pmonly.mas rma.invariantset.pmonly.mas rma.none.pmonly.mas mas.quantiles.pmonly.mas mas.invariantset.pmonly.mas mas.none.pmonly.mas mas.quantiles.mas.mas mas.invariantset.mas.mas mas.none.mas.mas none.quantiles.pmonly.mas none.invariantset.pmonly.mas none.none.pmonly.mas rma.quantiles.pmonly.liwong rma.invariantset.pmonly.liwong rma.none.pmonly.liwong mas.quantiles.pmonly.liwong mas.invariantset.pmonly.liwong mas.none.pmonly.liwong mas.quantiles.mas.liwong mas.invariantset.mas.liwong mas.none.mas.liwong none.quantiles.pmonly.liwong none.invariantset.pmonly.liwong none.none.pmonly.liwong rma.quantiles.pmonly.medianpolish rma.invariantset.pmonly.medianpolish rma.none.pmonly.medianpolish mas.quantiles.pmonly.medianpolish mas.invariantset.pmonly.medianpolish mas.none.pmonly.medianpolish none.quantiles.pmonly.medianpolish none.invariantset.pmonly.medianpolish none.none.pmonly.medianpolish mas.quantiles.mas.medianpolish mas.invariantset.mas.medianpolish mas.none.mas.medianpolish mas.quantiles.mas.plier mas.invariantset.mas.plier mas.none.mas.plier mas.quantiles.pmonly.plier mas.invariantset.pmonly.plier mas.none.pmonly.plier rma.quantiles.mas.mas rma.none.mas.mas rma.invariantset.mas.mas rma.quantiles.mas.medianpolish rma.none.mas.medianpolish rma.invariantset.mas.medianpolish rma.none.mas.plier rma.invariantset.mas.plier rma.quantiles.mas.plier gcrma.quantiles.mas.medianpolish gcrma.none.mas.medianpolish gcrma.invariantset.mas.medianpolish gcrma.quantiles.mas.mas gcrma.none.mas.mas gcrma.invariantset.mas.mas gcrma.quantiles.pmonly.plier gcrma.none.pmonly.plier gcrma.invariantset.pmonly.plier gcrma.quantiles.mas.plier gcrma.none.mas.plier gcrma.invariantset.mas.plier gcrma.quantiles.pmonly.liwong gcrma.none.pmonly.liwong gcrma.invariantset.pmonly.liwong gcrma.quantiles.pmonly.medianpolish gcrma.none.pmonly.medianpolish gcrma.invariantset.pmonly.medianpolish gcrma.quantiles.pmonly.mas gcrma.none.pmonly.mas gcrma.invariantset.pmonly.mas gcrma.quantiles.mas.liwong gcrma.none.mas.liwong gcrma.invariantset.mas.liwong rma.quantiles.mas.liwong rma.invariantset.mas.liwong rma.none.mas.liwong rma.quantiles.pmonly.plier rma.none.pmonly.plier rma.invariantset.pmonly.plier
C.
Genes Containing an SFP
Sensitivity Specificity FDR
B.
Predicted SFPs
A.
500 1000
2000
Number of Genes
0
400
800
1200
Number of Genes
Figure 6.3: Clustering (pane A.) of preprocessing routines (97 preprocessing routines) for the E-TABM-113 dataset with the performance characteristics sensitivity, specificity and FDR (pane B.), the number of called SFPs (pane C.), the number of genes containing a called SFP (pane D.) and the number of differentially expressed genes (pane E.).
70
The BB3 dataset represents a relatively large microarray study with 30 microarrays. Some of the preprocessing algorithms, in particular the probe set summarization subroutine’s MBEI and PLIER, use complex model based estimations which should benefit from more information available within a large dataset. The E-TABM-113 dataset however is a small study of six microarrays and may represent a common experimental design of a single factor comparison (i.e. Treatment vs Control). Comparing the performance of each preprocessing routine between the two datasets is important, as it should highlight how the number of arrays affects results and help to inform the analyst of which method to use under which conditions. To compare the results from each of the preprocessing routine, a linear model was fitted to the data within each experiment. Analysis of variance tables were computed to determine the impact that preprocessing subroutines had on sensitivity, specificity and FDR of SFP calls. Finally a Tukey’s HSD post hoc analysis was performed for each subroutine with adjusted p-values (family-wise error rates) in order to compare individual algorithms within each subroutine. Figures 6.4 and 6.5 show box plots of each algorithm tested within each preprocessing subroutine in the BB3 and E-TABM-113 datasets, respectively. In addition, no significant and/or consistent interaction between preprocessing subroutines was observed.
sensitivity, specif icity, F DR = (6.2) background + normalization + M M correction + summarization + ε
6.4.1
Characteristics of Preprocessing Subroutines
Preprocessing of microarray data typically involves four steps: background correction, normalization, PM probe correction and summarization. The impact that each of these steps has on the estimate of gene expression, and therefore SFP call, is presented in Table 6.2. Results show choice of background correction, mismatch probe adjustment and
70
●
●
60
●
●
● ● ●
96
● ● ●
94
●
● ●
●
● ●
● ●
●
● ●
50 40
●
●
●
30
● ●
20
● ●
10
● ● ●
po
r ie
m
ed
ia
n
pl
lis h
as m
g liw on
ly on pm
as m
es qu
an
til
ne no
nt ia in va r
a rm
ne no
as m
gc
rm
a
0
False Sequence Polymorphism Rate (%)
90
92
Specificity (%)
98
100
50
Sensitivity (%)
80
71
Background Adjustment Method
Normalization Method
MM Probe Adjustment Method
Probeset Summarization Method
Figure 6.4: Sensitivity (in percent, higher is best), specificity (in percent, higher is best), and false discovery rate (in percent, lower is best), within each of the four preprocessing subroutines: background correction, normalization, PM probe adjustment, and probe set Summarization for the BB3 dataset. Boxplots show standard information (25, 50 and 75th quartiles, min/max and outliers). In addition, the box plot notches show approximate 95% confidence interval for the medians, If the notches of two box plots do not overlap this is "strong evidence" that the two medians are also statistically different [47].
60 50 40 98 97
● ●
96
Specificity (%)
99
100
20
30
Sensitivity (%)
70
80
72
●
50 10
20
30
40
●
po
r ie
m
ed
ia
n
pl
lis h
as m
g liw on
ly on pm
as m
es qu
an
til
ne no
nt ia in va r
a rm
ne no
as m
gc
rm
a
0
False Sequence Polymorphism Rate (%)
95
●
Background Adjustment Method
Normalization Method
MM Probe Adjustment Method
Probeset Summarization Method
Figure 6.5: Sensitivity (in percent, higher is best), specificity (in percent, higher is best), and false discovery rate (in percent, lower is best), within each of the four preprocessing subroutines: background correction, normalization, PM probe adjustment, and probe set Summarization for the E-TABM-113 dataset. Boxplots show standard information (25, 50 and 75th quartiles, min/max and outliers). In addition, the box plot notches show approximate 95% confidence interval for the medians, If the notches of two box plots do not overlap this is "strong evidence" that the two medians are also statistically different [47].
73
summarization to play a significant role in all measurements and normalization was only significant to sensitivity in the BB3 dataset. The choice of background correction procedure had the overall largest impact on results for the smaller E-TABM-113 dataset (a 19.5% increase in sensitivity when choosing MAS over GCRMA, a 1.1% increase in specificity when choosing MAS over none and > 10% decrease in FDR when choosing MAS over GCRMA/RMA). Choice of probe set summarization routine had the overall largest impact on results for the BB3 dataset (a 6% increase in sensitivity when choosing MBEI over PLIER, > 3% increase when choosing MBEI/PLIER over MAS, and >16% decrease in FDR when choosing MBEI/PLIER over MAS). PM correction had a significant impact in both datasets for sensitivity and FDR, and a significant, but less, of an impact on specificity. Choice of normalization routine did not appear to have a significant affect on results. It is worth noting however that in both datasets, the raw intensity density plots (data not shown) showed very consistent profiles across arrays, indicating that the adjustment of any normalization routine, in these datasets, would have been minimal.
Table 6.2: Range (best performing routines worst performing routines) and anova results showing the impact of each of the preprocessing subroutines (background correction, normalization, MM correction and summarization) for both datasets (E-TABM-110 and BB3) on the three measures (sensitivity, specificity and FDR) when calling SFPs. E-TABM-113
Df Background Adjustment
Normalization Method
MM Adjustment
Summarization Method
3
2
1
3
BB3
Sensitivity
Specificity
FDR
Sensitivity
Specificity
FDR
Range
Range
Range
Range
Range
Range)
Pr(>F)
Pr(>F)
Pr(>F)
Pr(>F)
Pr(>F)
Pr(>F)
19.5%
1.1%
11.6%
3.8%
1.5%
8.1%
< 2e − 16
7.6e − 12
1.1e − 09
5.8e − 4
6.7e − 12
6.9e − 14
1.4%
0.2%
1.3%
2.7%
0.1%
0.3%
0.53
0.27
0.7
2.8e − 3
0.90
0.89
11.5%
0.4%
8.7%
3.8%
0.3%
2.8%
< 2e − 16
2.5e − 06
8.4e − 10
6.3e − 07
0.04
4.6e − 05
6.5%
1.1%
11%
6%
3.1%
17.0%
1.6e − 4
2.1e − 15
4.1e − 08
1.6e − 07
< 2e − 16
< 2e − 16
74
6.4.2
Evaluation of Background Routines
We compared the performance characteristics (sensitivity, specificity, FDR) of four background correction algorithms for their ability to detect known SFPs: none, MAS, GCRMA, and RMA. On averge the MAS algorithm outperformed all other algorithms across both datasets and in all performance measurements, except specificity in the BB3 dataset, where no background correction marginally outperformed, on average, the MAS background correction (0.1% average difference). The MAS background correction algorithm significantly (p < 0.05) outperformed the GCRMA background correction algorithm (the worst performing) in all three metrics and across both datasets. Further MAS significantly outperformed the RMA background correction algorithm in all three metrics for the E-TABM-113 dataset and outperformed RMA, but not at a statistically significant level, in the BB3 dataset. On average no background correction outperformed both RMA and GCRMA background corrections, statistically significant for sensitivity in E-TABM-113 dataset; however, it was also the statistically worst algorithm for specificity in E-TABM-113.
Overall rank (averaged across both datasets): Sensitivity: MAS (64.7%), NONE (62.5%), RMA (57.5%), GCRMA (50.3%) Specificity: MAS (98.7%), RMA (98.4%), NONE (98.2%), GCRMA (97.9%) FDR: MAS (11.8%), NONE (16.7%), RMA (18.0%), GCRMA (21.4%) 6.4.3
Evaluation of Normalization Routines
We compared three normalization algorithms for their ability to detect known SFPs: none, invariant set and quantiles. Results show that choice of normalization algorithm played no significant role in the detection of SFPs in the E-TABM-113 dataset, and only a slight, but
75
significant, increase in sensitivity for the BB3 dataset, when using the background algorithms none, or quantiles over invariant set (average increase of 2.5%, p < 0.05).
Overall rank (averaged across both datasets): Sensitivity: quantiles (60.3%), NONE (59.4%), invariant set (57.5%) Specificity: invariant set (98.4%), none (98.3%), quantiles (98.2%) FDR: NONE (16.6%), quantiles (16.9%), invariant set (16.9%) 6.4.4
Evaluation of PM Correction Routines
We compared two mismatch probe correction algorithms: PM only and the idealized mismatch routine from MAS. The PM only procedure used in the preprocessing pipelines PLIER, RMA, GCRMA and dChip (PM only variant) performed significantly (p < 0.05) better than the idealized mismatch approach from the MAS5 algorithm in all performance characteristics and in both datasets, though the effect was less pronounced in the BB3 dataset. This is not surprising and shows that the trend of recent preprocessing approaches to ignore mismatch probes also improves SFP calls. It is also worth noting that the subtract MM variant of dChip, the original implementation of dChip, performed worse than the PM only variant of dChip in all cases (see supplemental tables A), except for specificity in the BB3 dataset where the subtract MM routine performed marginally better (+0.2%).
Overall rank (averaged across both datasets): Sensitivity: PM only (63.0%), idealized mismatch (55.3%) Specificity: PM only (98.5%), idealized mismatch (98.1%) FDR: PM only (13.8%), idealized mismatch (19.7%)
76
6.4.5
Evaluation of Probe Set Summarization Routines
We compared the performance characteristics (sensitivity, specificity, FDR) of four probe set summarization algorithms: median polish from the RMA/GCRMA procedure, PLIER, MBEI and MAS5. More so than any of the previous three preprocessing subroutines, the performance results of the summarization algorithms varied across the two datasets. In the smaller E-TABM-113 dataset the median polish algorithm performed best for specificity and FDR, statistically outperforming both MBEI and MAS. Median polish ranked second in sensitivity, statistically outperforming only PLIER. The MBEI algorithm provided for the best sensitivity in the E-TABM-113 dataset, statistically outperforming both MAS and PLIER. In the BB3 dataset however, the MBEI algorithm performed the best in sensitivity (statistically better than PLIER and MAS) and FDR (statistically better than median polish and MAS). PLIER marginally outperformed MBEI in specificity, but not at a statistically significant level. In general MAS performed poorly, and median polish performed consistently well across all three measures. In general, MBEI proved to have the overall highest sensitivity level, PLIER provided the best specificity and median polish the overall lowest FDR. For the MAS5 algorithm to perform poorly is not unexpected [32], its biggest flaw is that it does not account for a "probe" effect, assuming all probes have on average similar hybridization efficiencies. In addition for MBEI to perform poorly, with a significantly higher FDR, on the smaller E-TABM-113 dataset is also expected. Li and Wong in their original papers, state that at least 10 arrays is needed to properly estimate the model parameters [43, 42].
Overall rank (averaged across both datasets): Sensitivity: MBEI (62.3%), median polish (60.1%), MAS (58.0%), PLIER (56.0%) Specificity: PLIER (99.0%), median polish (98.8%), MBEI (98.5%), MAS (96.9%) FDR: median polish (13.1%), PLIER (13.4%), MBEI (15.4%), MAS (25.2%)
77
6.4.6
The Absolute Best Pipeline
In the E-TABM-113 dataset, the best pipeline for sensitivity was to perform no background correction, perform no normalization, use PM only followed by median polish. This pipeline also had a good balance of sensitivity (63.4%), specificity (99.3%) and FDR (7.43%). For specificity the overall best pipeline was to perform MAS background correction, any normalization method, idealized mismatch correction followed by PLIER summarization (99.5%). For FDR the overall best choice in routines was to perform MAS background correction, no normalization, idealized mismatch correction followed by PLIER summarization (5.75%), while this particular pipeline provided for the absolute best specificity (also 99.5%) and FDR it also produced a significantly worse sensitivity (55.6%). In the BB3 dataset, multiple pipelines performed equally well for sensitivity. The pipelines GCRMA background correction, quantiles normalization, PM only followed by dChip’s MBEI summarization algorithm, or GCRMA, any normalization, PM only and median polish, all of which had a sensitivity of 77.8%. While these procedures produced the best sensitivity, they however performed poorly in both specificity and FDR. For sensitivity the pipelines RMA background correction, no normalization (or quantiles), idealized mismatch correction followed by PLIER summarization produced the overall best performance (99.6%). For FDR the overall best performing pipeline was RMA background correction, no normalization, use of PM probes only followed by MBEI summarization (5.67%). This pipeline also produced a good overall balance of sensitivity, specificity and FDR (76.6%, 99.3%, 5.67%, respectively); however, the overall best balanced pipeline was to perform RMA background correction, invariant set normalization, use PM probes only followed by MBEI summarization (77.4%, 99.3%, 6.09%, respectively).
78
6.5
Discussion
Previous methods for evaluating preprocessing routines use direct comparisons of estimated expression levels from the microarray to expected expression levels determined with either spike-in experiments, or with qRT-PCR. These methods for determining expected expression levels are typically based on only a few number of genes and contain their own sources of technical error. Genetic variation between genotypes; however, is a discreet and absolute feature which produces a consistent and detectable difference in hybridization efficiency within a probe on an microarray, called single feature polymorphism. By statistical comparison of probe signal intensity values, after correcting for mRNA expression values, we can predict candidate regions for genetic variation between samples. Accurate estimates of sample mRNA expression levels; therefor, is the most important factor in the successful determination of SFPs. Here we report a new method for evaluating preprocessing routine’s by their ability to detect single feature polymorphisms. We propose that the preprocessing pipeline which most successfully predicts SFPs, high sensitivity and specificity with a correspondingly low FDR, will also be the pipeline which best-estimates expression. In general, Affymetrix’s MAS summarization performed poorly, particularly in the larger BB3 dataset as compared to the other algorithms. The GCRMA algorithm performed significantly worse in the smaller E-TABM-113 relative to the other algorithms and relative to its own performance in the larger BB3 dataset. GCRMA is a model based algorithm that would appear to need a large number of samples to accurately estimate background and non-specific binding. In this experiment, choice of normalization algorithm did not play a significant role on sensitivity, specificity or FDR. However, this is likely due to the high level of consistency in the signal intensity profiles across arrays, requiring minimal normalization to occur anyway. Finally, use of MM probes in general, like the idealized mismatch routine, reduced the sensitivity to detect SFPs. Results show that the choice of pipeline, and individual algorithms, had a much larger effect on results for the smaller E-TABM-113 dataset, where choice of background routine
79
significantly impacted sensitivity and FDR and choice in summarization routine impacted FDR significantly. In the larger BB3 dataset, not choosing MAS summarization produced the largest impact on results. This is expected since more data, from a larger dataset, will usually produce a higher confidence in results from all algorithms and is more robust to the underlying preprocessing choices. These results also show that you can achieve better results by mixing and matching the subroutines from the different established pipelines. Of the five standard pipelines described above, only the GCRMA pipeline gave an overall best performance for sensitivity in BB3, but had a correspondingly poor result in both specificity and FDR. Of the five established pipelines, the RMA pipeline (RMA backgrounds, quantiles normalization, PM only, median polish summarization) was found to provide for the most consistent and balanced result, producing a high sensitivity and specificity with low FDR across both datasets. These results provide a different perspective on preprocessing evaluation by how well a routine performs when used to predict the presence of genetic polymorphisms. Further, since the calculation for determination of SFP is primarily dependent on the estimated gene expression value, we suggest that improved SFP calls imply better expression estimates. Differential expression results were seen to be most correlated (positive correlation) with the number of SFPs and SFP containing genes. In the smaller E-TABM-113 dataset, the number of differentially expressed genes was highly correlated to sensitivity (r2 = 0.65). Conversely, in the larger BB3 dataset, the number of differentially expressed genes was highly correlated to both specificity and FDR (r2 = −0.72 and r2 = 0.74, respectively). Indicating that in smaller datasets, choice of preprocessing routine will mostly impact the sensitivity to detection of differentially expressed genes, and in larger datasets will impacts the specificity and FDR of differentially expressed genes. These results can be used to assist the analyst in deciding the most appropriate method for their specific application.
80
Chapter 7 DISCUSSION AND FUTURE WORK As pointed out in the introduction, studies in ecology and conservation would benefit from simultaneous measurement of gene expression and genetic diversity. Typically this is done using two different assays, microarrays for gene expression and sequencing (or genotyping) for genetic diversity. With short-oligonucleotide microarrays there is the potential to assay both gene expression and genetic diversity, simultaneously on the same platform. Discovery of candidate genetic polymorphisms (termed single feature polymorphisms) in microarray data has been performed in numerous studies [55, 14, 19, 56, 18, 41, 39, 21, 75]. Discovery of single feature polymorphisms in the presence of gene expression is tricky; and requires the algorithm to accurately account for differences in expression levels between samples. Briefly, Chapter 4 introduces a novel algorithm that effectively accounts for gene expression differences for improved detection of SFPs. Further, genetic polymorphisms can effect the accurate estimation of expression levels, which can in turn effect the reliability of single feature polymorphism detection producing a chicken and egg problem. Chapter 5 outlines a procedure that addresses this issue. The procedure performs a preliminary prediction of single feature polymorphisms, masks those candidate SFPs from the dataset and re-evaluates the expression estimates. The results show that this procedure allows for a more accurate measurement of both gene expression and genetic diversity on the same platform. Finally, Chapter 6 shows how a database of known variants and prediction of SFPs can be used to evaluate microarray preprocessing routines.
81
7.1
Summary
The results presented in this dissertation provide for a complete approach to dealing with genetic polymorphisms (single feature polymorphisms) in short-oligonucleotide microarray data: • Chapter 4 presented an improved algorithm for the detection of SFPs. We evaluated R Barley the new algorithm using data from two publicly available Affymetrix
(Hordeum vulgare) microarray datasets and compared the new algorithm to two previously published SFP detection algorithms. Our SFP detection algorithm outperformed both previously published algorithms. Algorithmic performance was measured by sensitivity, specificity and false discovery rate determined using a reference database of known genetic polymorphisms. Results show that our algorithm outperformed all other algorithms in all performance characteristics but one in one of the datasets (sensitivity in the BB3 dataset). • In Chapter 5 we evaluated the impact of genetic polymorphisms on differential expression analysis and SFP detection. In this chapter we quantified the false positive rate and false negative rate of differentially expressed genes with SFP probes compared to when SFP probes have been masked out (considering the masked dataset as "truth"). Results show the false positive rate to be significantly increased by the presence of SFPs for both differential expression analysis and SFP detection (14% and 5%, respectively). Further, we provide a procedure for masking out candidate SFPs prior to differential expression and final SFP prediction. The procedure described does not rely on having a reference database of known genetic polymorphisms and can be applied to any microarray dataset where comparisons are being make between populations. • In Chapter 6, we explored the use of known genetic polymorphisms as a means to R microarray preprocessing pipelines. Previous studies show evaluate Affymetrix
82
that the choice of microarray preprocessing pipeline has a significant impact on experimental results. Successful detection of SFP relies most heavily on the accurate estimation of gene expression and therefore the pipeline that best predicts SFP implies better expression estimates. We tested five major preprocessing pipelines (RMA, GCRMA, dChip , MAS5 and PLIER) and all possible combinations of their background correction, normalization, mismatch probe correction and summarization routines (total 96 generated preprocessing pipelines) for their ability to detect single feature polymorphisms in two Barley microarray datasets. Results presented the following important findings: GCRMA performed poorly when used in a small experiment, MAS5 performed poorly in general and the RMA preprocessing pipeline had the overall most consistent results for sensitivity, specificity and FDR across both datasets. Together these results and algorithms provide researchers with new tools and R 3’ IVT microarrays. Specifically this dissertation techniques when using Affymetrix
provides new algorithms for improved SFP detection and SFP correction in gene expression microarray studies and provides for a new method to evaluate the performance R microarray preprocessing steps. These results provide characteristics of Affymetrix R microarrays in ecologists with new tools and techniques when using Affymetrix
ecological gene expression microarray studies. 7.2
Applications
Predicted single feature polymorphisms can be used as candidate genetic markers. Genetic markers are loci associated with a trait. They can be used to type individuals, such as in a breeding program, or used as evidence to help narrow in on a genomic region of importance. Using the same algorithm described in Chapter 4, Coram et. al found SFPs (genetic markers) and gene expression marker (GEM) between two recombinant inbred wheat lines, in which one contained a striped rust resistant gene of unknown location[18].
83
West et. al was able to produce a high-density haplomap between two lines of Arabidopsis Thalia [70]. Other examples include [15, 38, 60, 28, 53]. In expression quantitative trait (eQTL) studies, researchers seek to associate genetic polymorphisms that regulate expression levels. The methods described in Chapters 4 and 5 will allow researcher to conduct eQTL studies without the need for a separate genotyping assay. DeCook et. al performed an eQTL study in Arabidopsis Thalia where single feature polymorphisms provided the data for gene markers[22]. Alberts et. al further showed that the effect of genetic polymorphisms may generate a high number of false cis eQTL [4]. The procedure described in Chapter 5 would help to curtail this particular issue. Single feature polymorphisms can also be used to estimate evolutionary parameters. Single feature polymorphisms are a type of genotyping and therefore many analyses that require genotyping data can be modified to perform on SFPs. For example, Jiang et. al used single feature polymorphism data to estimate the mutation and recombination parameters in a Arabidopsis Thalia population. [35] More accurate estimations of both single feature polymorphisms and expression levels reduces the noise in the above applications and provides for a more biologically meaningful result with a reduced false positive rate. 7.3
Future Work
R This work presents the utility of single feature polymorphisms in Affymetrix
microarray studies with two populations. However, studies can have more than two populations, or no defined populations (genetically diverse samples). We would like to detect single feature polymorphisms in these studies as well. The approach presented here conducts pair-wise comparisons from a linear model, which will work for studies with more than two populations, but not those with no defined populations. When a study contains more than two populations, additional processing is needed to assign a genotype to each population and determine which populations are from the same genotype. In the case of no defined populations, one potential approach, that would also work in the general
84
case, would be to use a clustering technique such as k-means clustering, instead of using linear model statistic tests. Significant separation of defined clusters would indicate the presence of an SFP. Samples would then be assigned a genotype by cluster membership. R microarrays are another microarray platform that are worth exploring NimbleGen
in the context of single feature polymorphisms. This platform offers the significant advantage of being cheaper than Affymetrix when designing custom microarrays. NimbleGen offers the ability to design microarrays for organisms for which there is no R microarray in a cost effective manner. In addition, we can generate Affymetrix
microarrays specific to an experimental populations. Finally, the NimbleGen platform R platform is fixed at 25mer allows for a flexible probe length. Where the Affymetrix
probes; NimbleGen’s manufacturing processes allow for probes as small as 25mers and as large as 70mers. Probes could then be designed specifically for either SFP detection (short probes) or expression analysis (long probes).
85
REFERENCES [1] Affymetrix. GeneChip Data Analysis Fundamentals Manual, 2004. [2] Affymetrix. GeneChip Expression Analysis Technical Manual, 2004. [3] Affymetrix. Guide to Probe Logarithmic Intensity Error (PLIER) Estimation. Affymetrix Technical Note, 2005. [4] Rudi Alberts, Peter Terpstra, Yang Li, Rainer Breitling, Jan-Peter Nap, and Ritsert C Jansen. Sequence polymorphisms cause many false cis eQTLs. PloS one, 2(7):e622, January 2007. [5] David B Allison, Xiangqin Cui, Grier P Page, and Mahyar Sabripour. Microarray data analysis: from disarray to consolidation and consensus. Nature reviews. Genetics, 7(1):55–65, January 2006. [6] Jill T. Anderson and Thomas Mitchell-Olds. Ecological genetics and genomics of plant defences: evidence and approaches. Functional Ecology, pages no–no, September 2010. [7] Leonard H. Augenlicht and Diane Kobrin. Cloning and Screening of Sequences Expressed in a Mouse Colon Tumor. Cancer Res., 42(3):1088–1093, 1982. [8] Line K Bay, Karin E Ulstrup, H Bjø rn Nielsen, Hanne Jarmer, Nicolas Goffard, Bette L Willis, David J Miller, and Madeleine J H Van Oppen. Microarray analysis reveals transcriptional plasticity in the reef building coral Acropora millepora. Molecular ecology, 18(14):3062–75, July 2009.
86
[9] Y Benjimini and Y Hochberg. Controlling the false discovery rate: A practical adn powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B, 57:289–300, 1995. [10] David Benovoy, Tony Kwan, and Jacek Majewski. Effect of polymorphisms within probe-target sequences on olignonucleotide microarray experiments. Nucleic Acids Research, 36(13):4417–4423, 2008. [11] Amy N Bernardo, Peter J Bradbury, Hongxiang Ma, Shengwa Hu, Robert L Bowden, Edward S Buckler, and Guihua Bai. Discovery and mapping of single feature polymorphisms in wheat using Affymetrix arrays. BMC Genomics, 10:251, 2009. [12] B M Bolstad, R A Irizarry, M Astrand, and T P Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193, 2003. [13] Ben Bolstad. Probe level quantile normalization of high density oligonucleotide arrays. Unpublished manuscript, 2001. [14] Justin O Borevitz, David Liang, David Plouffe, Hur-Song Chang, Tong Zhu, Detlef Weigel, Charles C Berry, Elizabeth Winzeler, and Joanne Chory. Large-scale identification of single-feature polymorphisms in complex genomes. Genome Research, 13(3):513–523, March 2003. [15] Liam H Childs, Hanna Witucka-Wall, Torsten Günther, Ronan Sulpice, Maria V Korff, Mark Stitt, Dirk Walther, Karl J Schmid, and Thomas Altmann. Single feature polymorphism (SFP)-based selective sweep identification and association mapping of growth-related metabolic traits in Arabidopsis thaliana. BMC Genomics, 11(1):188, 2010. [16] Timothy J Close, Steve I Wanamaker, Rico A Caldo, Stacy M Turner, Daniel A Ashlock, Julie A Dickerson, Rod A Wing, Gary J Muehlbauer, Andris Kleinhofs,
87
and Roger P Wise. A New Resource for Cereal Genomics: 22K Barley GeneChip Comes of Age1. Plant Physiology, 134(3):960–968, 2004. [17] Tristan E Coram, Matthew L Settles, and Xianming Chen. Large-scale analysis of antisense transcription in wheat using the Affymetrix GeneChip Wheat Genome Array. BMC Genomics, 10:253, 2009. [18] Tristan E Coram, Matthew L Settles, Meinan Wang, and Xianming Chen. Surveying expression level polymorphism and single-feature polymorphism in near-isogenic wheat lines differing for the Yr5 stripe rust resistance locus. TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik, 117(3):401–411, August 2008. [19] Xinping Cui, Jin Xu, Rehana Asghar, Pascal Condamine, Jan T Svensson, Steve Wanamaker, Nils Stein, Mikeal Roose, and Timothy J Close. Detecting single-feature polymorphisms using oligonucleotide arrays and robustified projection pursuit. Bioinformatics, 21(20):3852–3858, 2005. [20] Xinping Cui, Na You, Thomas Girke, Richard Michelmore, and Allen Van Deynze. Single feature polymorphism detection using recombinant inbred line microarray expression data. Bioinformatics, 26(16):1983–1989, 2010. [21] Sayan Das, Prasanna R Bhat, Chinta Sudhakar, Jeffrey D Ehlers, Steve Wanamaker, Philip A Roberts, Xinping Cui, and Timothy J Close. Detection and validation of single feature polymorphisms in cowpea (Vigna unguiculata L. Walp) using a soybean genome array. BMC Genomics, 9(1):107, 2008. [22] Rhonda DeCook, Sonia Lall, Dan Nettleton, and Stephen H Howell. Genetic regulation of gene expression during shoot development in Arabidopsis. Genetics, 172(2):1155–1164, February 2006.
88
[23] Patrik D’haeseleer, Shoudan Liang, and Roland Somogyi. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics, 16(8):707–726, 2000. [24] Arnis Druka, Gary Muehlbauer, Ilze Druka, Rico Caldo, Ute Baumann, Nils Rostoks, Andreas Schreiber, Roger Wise, Timothy Close, Andris Kleinhofs, Andreas Graner, Alan Schulman, Peter Langridge, Kazuhiro Sato, Patrick Hayes, Jim McNicol, David Marshall, and Robbie Waugh. An atlas of gene expression from seed to seed through barley development. Functional & integrative genomics, 6(3):202–11, July 2006. [25] Laurent Gautier, Leslie Cope, Benjamin M Bolstad, and Rafael A Irizarry. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20(3):307–315, 2004. [26] Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, Torsten Hothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch, Cheng Li, Martin Maechler, Anthony J Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean YH Yang, and Jianhua Zhang. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology, 5(10):R80, 2004. [27] Greg Gibson. Microarrays in ecology and evolution: a preview. Molecular ecology, 11(1):17–24, January 2002. [28] P K Gupta, S Rustgi, and R R Mir. Array-based high-throughput DNA markers for crop improvement. Heredity, 101(1):5–18, July 2008. [29] Balazs Gyorffy, Bela Molnar, Hermann Lage, Zoltan Szallasi, and Aron C Eklund. Evaluation of microarray preprocessing algorithms based on concordance with RT-PCR in clinical samples. PloS one, 4(5):e5645, January 2009.
89
[30] Youko Horiuchi, Yoshiaki Harushima, Hironori Fujisawa, Takako Mochizuki, Masanori Kawakita, Takayuki Sakaguchi, and Nori Kurata. A simple optimization can improve the performance of single feature polymorphism detection by Affymetrix expression arrays. BMC Genomics, 11:315, 2010. [31] R A Irizarry, B Hobbs, F Collin, Y D Beazer-Barclay, K J Antonellis, U Scherf, and T P Speed. Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics, 4(2):249–264, 2003. [32] Rafael A Irizarry, Benjamin M Bolstad, Francois Collin, Leslie M Cope, Bridget Hobbs, and Terence P Speed. Summaries of Affymetrix GeneChip probe level data. Nucl. Acids Res., 31(4):e15–, 2003. [33] Rafael A Irizarry, Zhijin Wu, and Harris A Jaffee. Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22(7):789–794, 2006. [34] Hongying Jiang, Ming Yi, Jianbing Mu, Louie Zhang, Al Ivens, Leszek J Klimczak, Yentram Huyen, Robert M Stephens, and Xin-Zhuan Su. Detection of genome-wide polymorphisms in the AT-rich Plasmodium falciparum genome using a high-density microarray. BMC Genomics, 9:398, 2008. [35] Rong Jiang, Paul Marjoram, Justin O Borevitz, and Simon Tavaré. Inferring Population Parameters From Single-Feature Polymorphism Data. Genetics, 173(4):2257–2267, 2006. [36] Jan E Kammenga, Michael A Herman, N Joop Ouborg, Loretta Johnson, and Rainer Breitling. Microarray challenges in ecology. Trends in ecology & evolution (Personal edition), 22(5):273–9, May 2007. [37] Christina Kendziorski and Ping Wang. A review of statistical methods for expression quantitative trait loci mapping. Mammalian genome : official journal of the International Mammalian Genome Society, 17(6):509–17, June 2006.
90
[38] Sung Kim, Keyan Zhao, Rong Jiang, John Molitor, Justin O Borevitz, Magnus Nordborg, and Paul Marjoram. Association Mapping With Single-Feature Polymorphisms. Genetics, 173(2):1125–1133, 2006. [39] Sung-Hyun Kim, Prasanna R Bhat, Xinping Cui, Harkamal Walia, Jin Xu, Steve Wanamaker, Abdelbagi M Ismail, Clyde Wilson, and Timothy J Close. Detection and validation of single feature polymorphisms using RNA expression data from a rice genome array. BMC Plant Biology, 9(1):65, 2009. [40] D. A. Kulesh. Identification of Interferon-Modulated Proliferation-Related cDNA Sequences. Proceedings of the National Academy of Sciences, 84(23):8453–8457, December 1987. [41] Rajesh Kumar, Jing Qiu, Trupti Joshi, Babu Valliyodan, Dong Xu, and Henry T Nguyen. Single Feature Polymorphism Discovery in Rice. PLoS ONE, 2(3):9, 2007. [42] C Li and W H Wong. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error applic ation. Genome Biology, 2:1–11, 2001. [43] Cheng Li and Wing Hung Wong. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. PNAS, 98(1):31–36, 2001. [44] R Lipshutz, S Foder, T Gingeras, and D Lockhart. High Density Synthetic Oligonucleodite Arrays. Nature Genetics, pages 20–24, 1999. [45] Z W Luo, E Potokina, A Druka, R Wise, R Waugh, and M J Kearsey. SFP Genotyping From Affymetrix Arrays Is Robust But Largely Detects Cis-acting Expression Regulators. Genetics, 176(2):789–800, 2007. [46] MAQC Consortium. MAQC-II: analyze that! Nature biotechnology, 28(8):761, August 2010.
91
[47] Robert McGill, John W Tukey, and Wayne A Larsen. Variations of Box Plots. American Statistician, 32(1):12–16, 1978. [48] Jakub Mieczkowski, Magdalena E Tyburczy, Michal Dabrowski, and Piotr Pokarowski. Probe set filtering increases correlation between Affymetrix GeneChip and qRT-PCR expression measurements. BMC Bioinformatics, 11(1):104, 2010. [49] Jeanine S Morey, James C Ryan, and Frances M Van Dolah. Microarray validation: factors influencing correlation between oligonucleotide microarrays and real-time PCR. Biological procedures online, 8(1):175–193, 2006. [50] Frederick Mosteller and John W Tukey. Data Analysis and Regression: A Second Course in Statistics, volume 1 of Addison-Wesley Series in Behavioral Science. Addison-Wesley, Reading Mass, 1977. [51] Marjorie F Oleksiak, Gary A Churchill, and Douglas L Crawford. Variation in gene expression within and among natural populations. Nature Genetics, 32(2):261–266, 2002. [52] Craig Parman and Conrad Halling. affyQCReport : A Package to Generate QC Reports for Affymetrix Array Data. October, 2009. [53] Elena Potokina, Arnis Druka, and Michael J Kearsey. Transcript profiling and expression level mapping. Methods In Molecular Biology Clifton Nj, 513:81–92, 2009. [54] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2010. [55] James Ronald, Joshua M Akey, Jacqueline Whittle, Erin N Smith, Gael Yvert, and Leonid Kruglyak. Simultaneous genotyping, gene-expression measurement, and
92
detection of allele-specific expression with oligonucleotide arrays. Genome Research, 15(2):284–291, 2005. [56] Nils Rostoks, Justin O Borevitz, Peter E Hedley, Joanne Russell, Sharon Mudie, Jenny Morris, Linda Cardle, David F Marshall, and Robbie Waugh. Single-feature polymorphism discovery in the barley transcriptome. Genome Biology, 6(6):R54, 2005. [57] Alex Sanchez-Pla, Miquel Salicru, and Jordi Ocanya. Distance based Inference for Gene-Ontology Analysis of Microarray Experiments. Networks, (July 2006), 2006. [58] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science, 270(5235):467–470, October 1995. [59] Leming Shi, Laura H Reid, Wendell D Jones, Richard Shippy, Janet A Warrington, Shawn C Baker, Patrick J Collins, Francoise De Longueville, Ernest S Kawasaki, Kathleen Y Lee, Yuling Luo, Yongming Andrew Sun, James C Willey, Robert A Setterquist, Gavin M Fischer, Weida Tong, Yvonne P Dragan, David J Dix, Felix W Frueh, Frederico M Goodsaid, Damir Herman, Roderick V Jensen, Charles D Johnson, Edward K Lobenhofer, Raj K Puri, Uwe Schrf, Jean Thierry-Mieg, Charles Wang, Mike Wilson, Paul K Wolber, Lu Zhang, Shashi Amur, Wenjun Bao, Catalin C Barbacioru, Anne Bergstrom Lucas, Vincent Bertholet, Cecilie Boysen, Bud Bromley, Donna Brown, Alan Brunner, Roger Canales, Xiaoxi Megan Cao, Thomas A Cebula, James J Chen, Jing Cheng, Tzu-Ming Chu, Eugene Chudin, John Corson, J Christopher Corton, Lisa J Croner, Christopher Davies, Timothy S Davison, Glenda Delenstarr, Xutao Deng, David Dorris, Aron C Eklund, Xiao-Hui Fan, Hong Fang, Stephanie Fulmer-Smentek, James C Fuscoe, Kathryn Gallagher, Weigong Ge, Lei Guo, Xu Guo, Janet Hager, Paul K Haje, Jing Han, Tao Han, Heather C Harbottle, Stephen C Harris, Eli Hatchwell, Craig A Hauser, Susan Hester,
93
Huixiao Hong, Patrick Hurban, Scott A Jackson, Hanlee Ji, Charles R Knight, Winston P Kuo, J Eugene LeClerc, Shawn Levy, Quan-Zhen Li, Chunmei Liu, Ying Liu, Michael J Lombardi, Yunqing Ma, Scott R Magnuson, Botoul Maqsodi, Tim McDaniel, Nan Mei, Ola Myklebost, Baitang Ning, Natalia Novoradovskaya, Michael S Orr, Terry W Osborn, Adam Papallo, Tucker A Patterson, Roger G Perkins, Elizabeth H Peters, Ron Peterson, Kenneth L Philips, P Scott Pine, Lajos Pusztai, Feng Qian, Hongzu Ren, Mitch Rosen, Barry A Rosenzweig, Raymond R Samaha, Mark Schena, Gary P Schroth, Svetlana Shchegrova, Dave D Smith, Frank Staedtler, Zhenqiang Su, Hongmei Sun, Zoltan Szallasi, Zivana Tezak, Danielle Thierry-Mieg, Karol L Thompson, Irina Tikhonova, Yaron Turpaz, Beena Vallanat, Christophe Van, Stephen J Walker, Sue Jane Wang, Yonghong Wang, Russ Wolfinger, Alex Wong, Jie Wu, Chunlin Xiao, Qian Xie, Jun Xu, Wen Yang, Liang Zhang, Sheng Zhong, Yaping Zong, and William Slikker. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology, 24(9):1151–1161, 2006. [60] Sung-Chur Sim, Matthew D Robbins, Charles Chilcott, Tong Zhu, and David M Francis. Oligonucleotide array discovery of polymorphisms in cultivated tomato (Solanum lycopersicum L.) reveals patterns of SNP variation associated with breeding. BMC Genomics, 10:466, 2009. [61] Gordon K Smyth. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1):Article3, 2004. [62] Gordon K Smyth. Limma: linear models for microarray data. In R Gentleman, V Carey, S Dudoit, R Irizarry, and W Huber, editors, Bioinformatics and Computational Biology Solutions using R and Bioconductor, number October, pages 397–420. Springer, 2005.
94
[63] A Subramanian, P Tamayo, and V Mootha. Gene set enrichment analysis: a knowledge-based approach for interpreting, 2005. [64] J W Tukey. Exploratory Data Analysis. Addison-Wesley, 1977. [65] Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu. Significance analysis of microarrays applied to the ionizing radiation response. PNAS, 98(9):5116–5121, 2001. [66] Harkamal Walia, Clyde Wilson, Pascal Condamine, Abdelbagi M Ismail, Jin Xu, Xinping Cui, and Timothy J Close. Array-based genotyping and expression analysis of barley cv. Maythorpe and Golden Promise. BMC Genomics, 8:87, 2007. [67] Nicole A R Walter, Shannon K McWeeney, Sandra T Peters, John K Belknap, Robert Hitzemann, and Kari J Buck. SNPs matter: impact on detection of differential expression., 2007. [68] Minghui Wang, Xiaohua Hu, Gang Li, Lindsey J Leach, Elena Potokina, Arnis Druka, Robbie Waugh, Michael J Kearsey, and Zewei Luo. Robust Detection and Genotyping of Single Feature Polymorphisms from Gene Expression Data. PLoS Computational Biology, 5(3):10, 2009. [69] Thomas Werner. Bioinformatics applications for pathway analysis of microarray data. Current Opinion in Biotechnology, 19(1):50–54, 2008. [70] Marilyn a L West, Hans van Leeuwen, Alexander Kozik, Daniel J Kliebenstein, R W Doerge, Dina A St Clair, and Richard W Michelmore. High-density haplotyping with microarray-based expression and single feature polymorphism markers in Arabidopsis. Genome Research, 16(6):787–795, June 2006. [71] F Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.
95
[72] Elizabeth A Winzeler, Cristian I Castillo-Davis, Guy Oshiro, David Liang, Daniel R Richards, Yingyao Zhou, and Daniel L Hartl. Genetic diversity in yeast assessed with whole-genome oligonucleotide arrays. Genetics, 163(1):79–89, 2003. [73] Zhijin Wu, Rafael A Irizarry, Robert Gentleman, Francisco Martinez-Murillo, and Forrest Spencer. A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association, 99:909–917, December 2004. [74] Wayne Wenzhong Xu, Seungho Cho, S Samuel Yang, Yung-Tsi Bolon, Hatice Bilgic, Haiyan Jia, Yanwen Xiong, and Gary J Muehlbauer. Single-feature polymorphism discovery by computing probe affinity shape powers. BMC Genet, 10:48, 2009. [75] S S Yang, W W Xu, M Tesfaye, J F S Lamb, Hans-Joachim G Jung, D A Samac, C P Vance, and J W Gronwald. Single-Feature Polymorphism Discovery in the Transcriptome of Tetraploid Alfalfa. The Plant Genome, 2(3):224–232, 2009. [76] Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol, 4:Article17, 2005. [77] Yongqing Zhang, Antonio Ferreira, Cheng Cheng, Yongchun Wu, and Jiong Zhang. Modeling oligonucleotide microarray signals. Applied Bioinformatics, 5(3):151–160, 2006.
96
Appendix A EXPERIMENTAL RESULTS This section contains tables of the complete results from the effect of SFPs on preprocessing results experiments performed in Chapter 6.
Dataset BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3
Background Correction mas none rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma none mas Normalization invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset
MM Correction mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly subtractmm mas Summarization liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong mas
Sensitivity 70.3% 65.7% 72.4% 75.3% 75.3% 71.1% 72.8% 75.3% 73.6% 70.3% 74.1% 74.5% 69.5% 77.4% 77% 75.7% 72.4% 76.6% 77.8% 75.7% 73.2% 76.2% 61% 69.5%
Specificity 99.5% 99.5% 99.3% 98% 99.4% 99.4% 99.1% 97.9% 99.4% 99.4% 99.3% 99.2% 99.4% 99.3% 98.1% 99.2% 99.2% 99.3% 98.1% 99.2% 99.2% 99.3% 99.5% 96%
FDR 7.18% 7.65% 8.47% 13.9% 6.25% 6.59% 8.42% 14.7% 6.38% 6.67% 7.81% 6.32% 5.68% 6.09% 12.8% 6.22% 6.49% 5.67% 12.7% 6.22% 6.42% 5.7% 7.89% 25.2%
GP SFP 1498 1426 1747 1979 1688 1637 1826 1927 1582 1539 1779 1338 1140 1638 1862 1510 1361 1552 1844 1443 1317 1541 1738 2026
MX SFP 1670 1538 1713 2070 1687 1575 1701 2128 1778 1686 1782 1633 1451 1915 2163 1713 1577 1881 2239 1775 1610 1883 1818 1851
GP Genes 777 692 817 974 856 795 863 971 821 765 847 757 662 884 1005 833 768 841 1009 809 753 861 821 925
MX MX GP Genes Exp Exp 955 136 245 851 137 246 927 157 432 1098 243 747 977 121 317 900 120 316 948 117 497 1115 320 590 1013 146 237 941 146 235 977 150 401 967 106 138 877 85 114 1074 160 190 1207 224 550 1002 94 197 942 85 166 1088 92 250 1225 396 344 1029 117 145 952 103 128 1082 137 166 998 222 406 1024 189 354 Continued on next page
Table A.1: Sensitivity, specificity, and false feature polymorphism rate as compared to the sequence confirmation dataset. Also shown are the number of SFP called for each genotype (GP SFP and MX SFP), the number of genes containing at least one SFP for each genotype (GP Genes and MX Genes), and the number of differentially expressed genes (MX Exp and GP Exp) for each of the 89 preprocessing routines performed on the BB3 dataset. The standard pipelines dChip (both PM only and subtract MM variant), MAS5, GCRMA, RMA, and PLIER are shown in bold.
97
Dataset BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3
Background Correction none rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma mas none Normalization invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset
MM Correction mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly Summarization mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish
Sensitivity 64.9% 63.6% 61.9% 72.8% 69% 65.7% 63.2% 72% 69.5% 66.1% 72% 68.2% 74.5% 76.2% 72.4% 69.5% 74.5% 75.3% 72.8% 70.3% 74.5% 70.7% 66.9% 66.1% 65.3% 72.8% 69.5% 67.4% 64.9% 72.8% 69.9% 67.8% 72.4% 68.6%
Specificity 96.3% 96.9% 92.4% 95.8% 95.6% 97.1% 92.2% 96.4% 96% 97.2% 97% 97.3% 96% 94% 96.8% 96.8% 96.8% 94.1% 96.8% 96.8% 96.7% 98.7% 98.6% 98.4% 96.7% 98.7% 98.9% 98.5% 96.5% 98.8% 98.8% 98.4% 98.9% 99.2%
Table A.1 – continued from previous page FDR 25.8% 24.4% 41.7% 25% 26.7% 23% 41.5% 22.9% 25.2% 22.5% 19.2% 18.9% 24.3% 31.6% 21% 21% 19.8% 32.1% 20.5% 20.4% 20.5% 11.5% 12.6% 16.8% 25% 10.8% 10.3% 15.3% 25.8% 10.3% 10.7% 15.6% 8.95% 6.29%
GP SFP 1995 2234 3103 2240 2248 2264 3100 2122 2167 2281 1600 1330 2266 2737 1820 1615 2106 2685 1790 1597 2042 1335 1260 1531 2129 1467 1384 1554 2135 1416 1339 1565 1204 1016
MX SFP 1706 1437 2225 1932 1793 1429 2259 2016 1910 1472 1798 1502 2246 2685 1949 1742 2106 2802 1968 1774 2162 1539 1404 1292 1860 1607 1489 1314 1873 1656 1548 1336 1532 1337
GP Genes 821 798 1089 1015 955 830 1093 971 929 826 837 720 1050 1268 933 853 1026 1254 912 837 1000 794 703 728 1019 863 795 757 1032 842 770 757 752 656
MX MX GP Genes Exp Exp 884 188 356 681 166 498 957 356 1240 1071 181 459 961 180 464 706 142 544 969 403 1013 1099 207 375 1004 208 379 729 163 486 1024 149 205 910 109 167 1177 241 344 1361 346 934 1094 139 314 1012 123 276 1155 144 464 1400 580 601 1101 182 242 1014 148 214 1176 217 284 965 137 246 848 136 238 703 146 404 1016 281 1030 1002 131 314 915 133 317 724 121 459 1029 343 833 1022 147 257 941 149 255 739 144 385 976 108 132 870 69 99 Continued on next page
98
Dataset BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3
Background Correction rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma Normalization invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles
MM Correction pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly Summarization medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier
Sensitivity 76.6% 77.8% 74.5% 71.5% 76.6% 77.8% 74.5% 71.5% 76.2% 70.7% 67.4% 61.1% 63.2% 73.6% 69.9% 63.2% 61.5% 73.2% 69.9% 64.9% 72% 65.3% 57.3% 66.5% 72.8% 70.3% 61.9% 66.5% 73.6% 72% 70.3%
Specificity 98.9% 98.3% 98.9% 98.8% 99% 98.4% 98.9% 98.9% 99% 99.1% 99.5% 99.4% 99.2% 99.2% 99.3% 99.6% 99.4% 99.1% 99.4% 99.6% 99.1% 99.5% 99.4% 99.4% 99.1% 99.2% 99.5% 99.3% 99.1% 99.3% 99.3%
Table A.1 – continued from previous page FDR 8.5% 13.5% 8.72% 9.04% 8.04% 13.1% 8.25% 8.56% 7.61% 7.65% 7.47% 8.75% 10.7% 6.38% 7.22% 6.79% 9.82% 7.89% 6.7% 8.28% 9.47% 8.24% 9.87% 8.09% 9.38% 9.19% 8.64% 8.62% 9.28% 8.51% 8.7%
GP SFP 1459 1774 1339 1202 1421 1816 1313 1190 1421 1192 1140 1023 1159 1284 1239 1065 1192 1253 1197 1087 1238 1175 1272 1319 1380 1348 1168 1266 1284 1233 1382
MX SFP 1785 2050 1635 1505 1774 2067 1648 1513 1782 1514 1398 998 1309 1544 1454 985 1321 1589 1506 1044 1548 1505 1705 1421 1537 1474 1159 1643 1618 1603 1770
GP Genes 879 1059 811 747 855 1098 791 746 854 730 648 518 667 778 731 549 681 762 709 553 719 620 548 743 802 745 605 718 760 689 720
MX Genes 1107 1261 1031 959 1103 1260 1034 963 1098 933 827 579 787 964 883 595 795 986 914 623 932 839 714 839 948 861 663 927 986 921 948
MX Exp 166 258 108 83 108 392 129 105 151 127 128 97 131 114 115 81 161 137 134 100 163 167 310 157 129 131 86 334 175 177 250
GP Exp 209 700 205 161 286 429 156 126 184 214 215 211 393 279 277 204 310 227 227 207 207 203 341 489 334 332 269 288 215 217 304
99
Dataset E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113
Background Correction gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma Normalization invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles
MM Correction mas mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly Summarization liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong
Sensitivity 37.6% 57.6% 51.9% 43.7% 36.6% 53.2% 53.2% 41% 35.9% 57.6% 53.9% 44.1% 45.4% 61% 60.7% 63.1% 44.1% 59% 60.7% 60% 44.1% 60.3% 62.4% 62.4%
Specificity 97.3% 98.2% 96% 97.6% 97.6% 98.6% 97.1% 97.8% 98% 98.3% 96.7% 97.6% 98.4% 98.3% 97.9% 98.4% 98.5% 98.6% 98.3% 98.8% 98.4% 98.3% 97.7% 98.3%
FDR 32.7% 19% 35.2% 29.5% 30.3% 16.5% 28.6% 28.8% 28.4% 18.3% 30.9% 29% 18.3% 15.9% 19.7% 14.7% 17.7% 13.9% 17.1% 11.1% 18.2% 16% 20.3% 14.8%
GP SFP 1617 1754 2717 1450 1232 1351 1929 1311 1316 1663 2572 1498 1476 1698 2123 1987 1266 1436 1736 1573 1320 1640 2182 1918
MX SFP 1585 1997 3191 2000 1657 2155 4041 2317 1653 1951 3111 2092 1495 1806 2076 1888 1598 1929 2538 2033 1542 1780 2134 1934
GP Genes 911 1033 1210 734 686 818 866 659 712 1001 1181 753 1043 1126 1290 1338 895 945 1059 1063 931 1090 1320 1297
Table A.2: Sensitivity, specificity, and false feature polymorphism rate as compared to the sequence confirmation dataset. Also shown are the number of SFP called for each genotype (GP SFP and MX SFP), the number of genes containing at least one SFP for each genotype (GP Genes and MX Genes), and the number of differentially expressed genes (MX Exp and GP Exp) for each of the 97 preprocessing routines performed on the E-TABM-113 dataset. The standard pipelines dChip (both PM only and subtract MM variant), MAS5, GCRMA, RMA, and PLIER are shown in bold. MX MX GP Genes Exp Exp 886 432 225 1222 768 521 1441 765 513 1021 406 195 957 529 183 1337 1235 373 1835 1247 380 1140 409 147 930 394 223 1221 784 488 1443 797 490 1040 386 186 1022 679 571 1164 583 504 1275 384 351 1206 717 558 1089 1126 344 1250 981 356 1499 709 277 1285 951 362 1069 679 563 1156 635 505 1299 430 385 1230 723 551 Continued on next page
100
Dataset E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113
Background Correction none gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma Normalization invariantset invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles
MM Correction subtractmm mas mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly mas mas mas mas mas mas mas mas mas Summarization liwong mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish
Sensitivity 51.5% 28.1% 51.9% 45.8% 25.8% 27.8% 50.5% 45.4% 26.4% 29.8% 51.9% 47.8% 27.5% 44.1% 60% 59.7% 61.4% 43.4% 58.6% 60% 60% 43.4% 59% 60.7% 61.7% 33.2% 52.9% 52.9% 34.6% 32.5% 51.5% 51.9% 33.9% 32.5%
Specificity 96.4% 97.2% 98.5% 96.9% 97.1% 97.2% 98.7% 97.1% 97% 97.2% 98.7% 97.1% 97% 98.7% 98.3% 97.7% 98.2% 98.6% 98.4% 97.5% 98.2% 98.5% 98.3% 97.5% 98.2% 98.6% 99.4% 98.4% 98.2% 98.5% 99.4% 98.5% 98.2% 98.4%
Table A.2 – continued from previous page FDR 33.6% 40.7% 16.8% 32.2% 42.9% 42.3% 15.8% 31.3% 43.5% 40.1% 15% 30.5% 42.1% 17.7% 15.3% 20.4% 15.8% 19% 14.8% 21.7% 16.1% 19.5% 15.5% 21.1% 15.7% 23.4% 8.24% 18.8% 27.7% 25% 7.32% 18.2% 28.6% 26.2%
GP SFP 2656 1673 1498 2183 1351 1521 1385 1941 1374 1526 1464 2226 1442 1372 1841 2253 2050 1239 1713 2316 1925 1285 1745 2359 2019 1266 1184 1740 1165 1191 1029 1441 1134 1167
MX SFP 3284 2298 1918 3216 2959 2552 1874 3316 3136 2554 1876 3278 3151 1461 2017 2369 2187 1493 1986 2507 2185 1499 1978 2472 2279 2097 1614 2503 2498 2060 1645 2799 2545 2047
GP Genes 1184 678 894 944 482 601 836 858 466 610 891 983 503 948 1183 1371 1317 860 1096 1421 1216 893 1132 1439 1305 675 846 930 550 641 738 791 520 609
MX MX GP Genes Exp Exp 1447 778 496 923 438 205 1264 757 384 1466 755 382 976 403 154 1028 454 196 1262 773 307 1543 817 332 1030 369 140 1026 432 219 1255 755 369 1525 777 370 1018 413 161 1058 579 413 1369 731 508 1503 414 334 1440 885 545 1087 672 279 1358 913 398 1586 627 289 1446 664 345 1090 552 412 1355 788 516 1557 464 370 1495 911 542 986 475 158 1158 793 344 1380 744 343 984 338 136 989 446 144 1201 989 302 1558 974 298 1002 279 108 997 382 169 Continued on next page
101
Dataset E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113
Background Correction mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none Normalization quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none
MM Correction mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly mas mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly Summarization medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier
Sensitivity 51.5% 54.2% 34.9% 42.4% 60.3% 62.4% 61.4% 40.3% 59.3% 63.4% 61% 41.4% 59.7% 63.1% 62% 30.5% 57.3% 53.9% 30.5% 29.2% 55.6% 53.2% 30.2% 30.8% 56.6% 54.9% 32.2% 38.3% 58% 55.6% 42% 38.3% 54.2% 54.2%
Specificity 99.4% 98.6% 98% 99.3% 99.3% 99.2% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.2% 99.2% 98.9% 99.5% 98.6% 98.8% 99.4% 99.5% 98.7% 98.3% 98.8% 99.5% 98.6% 98.4% 98.6% 99.1% 97.1% 98.7% 99% 99.3% 97.3%
Table A.2 – continued from previous page FDR 8.43% 16.7% 29.5% 10.7% 7.29% 8% 7.65% 11.2% 7.89% 7.43% 7.22% 10.3% 7.37% 8.37% 8.04% 18.2% 6.11% 15% 26.2% 11.3% 5.75% 14.2% 31.5% 19.5% 6.18% 15.2% 30.7% 21.5% 10.9% 27.8% 21.5% 17.5% 9.09% 26.9%
GP SFP 1152 1695 1236 1059 1324 1529 1506 1006 1243 1553 1402 1010 1303 1589 1522 800 1169 1658 912 681 1053 1406 863 706 1158 1661 1001 1233 1401 2977 1845 989 1160 1985
MX SFP 1563 2498 2644 1294 1591 1803 1700 1278 1590 1845 1697 1303 1564 1851 1722 1184 1631 2428 1947 1256 1663 2733 2030 1258 1606 2421 2182 1487 1844 3769 2335 1640 1908 4802
GP Genes 829 934 567 843 961 1075 1107 810 913 1104 1024 816 949 1123 1119 452 790 871 376 388 722 750 355 391 797 881 409 656 913 1332 715 567 768 914
MX MX GP Genes Exp Exp 1129 773 340 1402 760 343 1026 347 142 979 797 407 1136 523 379 1240 239 216 1196 732 424 975 979 246 1151 756 312 1278 400 175 1197 745 322 995 656 369 1126 570 389 1279 271 233 1213 715 428 619 104 57 1123 739 333 1322 718 329 700 180 84 676 111 42 1152 974 278 1487 973 285 737 166 62 670 112 59 1114 739 329 1336 746 336 749 204 89 785 236 154 1221 841 462 1683 852 576 950 370 214 870 436 99 1291 1145 323 2136 1347 378 Continued on next page
102
Dataset E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113
Background Correction rma gcrma mas none rma Normalization none quantiles quantiles quantiles quantiles
MM Correction pmonly pmonly pmonly pmonly pmonly Summarization plier plier plier plier plier
Sensitivity 38.6% 38% 58% 56.9% 43.4%
Specificity 98.9% 98.9% 99.3% 97.3% 98.6%
Table A.2 – continued from previous page FDR 19.1% 18.8% 9.04% 26% 22.9%
GP SFP 1224 1111 1344 2882 1771
MX SFP 2512 1470 1792 3681 2292
GP Genes 512 629 888 1337 732
MX Genes 1082 810 1206 1690 973
MX Exp 352 244 856 890 386
GP Exp 116 164 457 562 213
103