SINGLE FEATURE POLYMORPHISMS

SINGLE FEATURE POLYMORPHISMS: DETECTION AND INFLUENCE ON SHORT-OLIGONUCLEOTIDE MICROARRAYS A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy with a Major in Bioinformatics and Computational Biology in the College of Graduate Studies University of Idaho

by Matthew Lee Settles

July 2011

Major Professor: Terence Soule, Ph.D. Co-major Professor: Barrie Robison, Ph.D.

ii

AUTHORIZATION TO SUBMIT DISSERTATION This Dissertation of Matthew Lee Settles, submitted for the degree of Doctor of Philosophy with a major in Bioinformatics and Computational Biology and titled "Single Feature Polymorphisms: Detection and influence on short-oligonucleotide microarrays," has been reviewed in final form. Permission, as indicated by the signatures and dates given below, is now granted to submit final copies to the College of Graduate Studies for approval. Date

Co-Major Professor Dr. Terence Soule Co-Major Professor

Date Dr. Barrie Robison

Committee Member

Date Dr. Chris Williams

Committee Member

Date Dr. Robert Heckendorn

Program Administrator

Date Dr. Paul Joyce

Discipline’s College Dean

Date Dr. Scott Wood

Final Approval and Acceptance by the College of Graduate Studies Date Dr. Jie Chen

iii

ABSTRACT Gene expression microarrays are a tool used by ecologists, to study the genome-wide transcriptional variation that underlies complex interactions among and between organisms and their environment. Microarrays allow for the simultaneous measurement of thousands of gene products across many samples. These measurements rely on sequence similarity for efficient hybridization of sample mRNA targets to the microarray probes. Genetic polymorphisms located within a microarray’s probe affects this hybridization efficiency. These differences in probe level hybridization efficiencies are known as single feature polymorphisms (SFPs). SFPs are both a source of error, they negatively impact the estimate for transcript abundance, and a source of information, they are genetic markers. R 3’ IVT microarray platform uses 11 to 16 short-oligonucleotide probes, The Affymetrix

covering different regions of a gene, to measure transcript abundance. The redundancy of probes within the same gene, allows for the capability to detect candidate SFPs from the R data itself. This dissertation presents three new results pertaining to SFPs in Affymetrix

gene expression microarray data. The first result is a novel, improved algorithm for the accurate detection of SFPs. The second result evaluates the impact of SFPs to false positives in both differential expression analysis and the detection of SFPs themselves and offers a solution to reduce false positives induced by SFPs. Finally, the third result demonstrates how SFPs can be used to provide a new means for evaluating the accuracy of critical microarray preprocessing steps. These results will provide ecologists with new R 3’ IVT microarrays in ecological tools and techniques when using Affymetrix

microarray studies.

iv

ACKNOWLEDGMENTS I would like to thank the following people and groups: My supervisors, Dr. Terence Soule and Dr. Barrie Robison for many valuable discussions, suggestions, needed guidance, and mentoring. My committee members, Dr Williams and Dr. Heckendorn for their comments, guidance and patience in completing this document. My wife, Karin, Son Carter and family for their support of me and my work. Dr. Larry Forney (Director of IBEST) and Drs. Holly Wichman, James Foster, Paul Joyce (IBEST Research Oversight Team) for mentorship and support in all things as especially while completing this document. All of IBEST faculty, staff and students for valuable discussions, mentoring and providing a great and stimulating environment for which I work. To my fellow BCB students for being a fantastic group to be a part of.

v

DEDICATION I would like to dedicate this dissertation to my wife and son for helping me be the man I am today.

vi

TABLE OF CONTENTS

Titlepage

i

Authorization to Submit Dissertation

ii

Abstract

iii

Acknowledgments

iv

Dedication

v

Table of Contents

vi

List of Figures

ix

List of Tables

xi

Chapter 1:

Introduction

1

1.1

General Introduction and Motivation . . . . . . . . . . . . . . . . . . . . .

1

1.2

Objectives of this Research . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Chapter 2:

Scientific Background

4

2.1

R GeneChip R Microarray Platform . . . . . . . . . . . . The Affymetrix

6

2.2

Signal Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.3

Preprocessing Affymetrix Arrays . . . . . . . . . . . . . . . . . . . . . . .

8

2.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

vii

Chapter 3:

Introduction to the Research Chapters

Chapter 4:

An improved algorithm for the detection of genomic variation using short oligonucleotide expression microarrays

16

18

4.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3

Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Chapter 5:

Effect of Single Feature Polymorphisms on Computed Expression Levels

42

5.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Chapter 6:

Evaluation of microarray preprocessing routines using singe feature polymorphisms

57

6.1

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.4

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Chapter 7:

Discussion and Future Work

80

7.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.2

Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

viii

7.3

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

References Appendix A:

85 Experimental Results

96

ix

LIST OF FIGURES

2.1

Central dogma of molecular biology. . . . . . . . . . . . . . . . . . . . . .

2.2

Example of three probe set images, with PM and MM rows, containing 20, 16 and 16 probe pairs respectively. . . . . . . . . . . . . . . . . . . . . . .

5

7

4.1

Single Feature Polymorphisms - Example . . . . . . . . . . . . . . . . . . 22

4.2

Position dependent sensitivity of each algorithm to detection of SFPs . . . . 31

4.3

Venn diagram of the overlap of SFP calls . . . . . . . . . . . . . . . . . . . 34

4.4

Comparison of differential expression and SFP calling in the BB3 dataset. . 37

4.5

Comparison of differential expression and SFP calling in the E-TAB-M dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.6

Single Feature Polymorphism detection using the RMA method when multiple probes are called as containing a SFP in the BB3 dataset. . . . . . . . 40

5.1

Single Feature Polymorphisms - Example . . . . . . . . . . . . . . . . . . 45

5.2

Single Feature Polymorphism detection in a gene with a high number of called SFPs (10 of 11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1

SFP detection using the RMA preprocessing pipeline in the E-TABM-113 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2

Clustering of preprocessing routines with performance characteristics - BB3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.3

Clustering of preprocessing routines with performance characteristics - BB3 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.4

Boxplots of sensitivity, specificity and FDR results for the BB3 dataset . . . 71

x

6.5

Boxplots of sensitivity, specificity and FDR results for the E-TABM-113 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

xi

LIST OF TABLES

2.1

Breakdown of the preprocessing steps for the MAS5, Plier, RMA, GCRMA, and dChip preprocessing pipelines. . . . . . . . . . . . . . . . . . . . . . .

9

4.1

Sensitivity, Specificity and FDR results. . . . . . . . . . . . . . . . . . . . 29

4.2

SFP frequency in the BB3 and E-TABM-113 datasets. . . . . . . . . . . . . 32

5.1

Comparison of differential expression analysis and SFP prediction with and without the SFP mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2

Contingency table comparing the results between masked and unmasked data for differential expression. . . . . . . . . . . . . . . . . . . . . . . . . 54

5.3

Contingency table comparing the results between masked and unmasked data for SFP prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.1

Breakdown of the preprocessing steps for the MAS5, Plier, RMA, GCRMA, and dChip preprocessing pipelines. . . . . . . . . . . . . . . . . . . . . . . 66

6.2

Range in performance characteristics of each preprocessing subroutine. . . 73

A.1 SFP evaluation results for 89 preprocessing pipelines on the BB3 dataset. . 97 A.2 SFP evaluation results for 97 preprocessing pipelines on the E-TABM-113 dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

1

Chapter 1 INTRODUCTION 1.1

General Introduction and Motivation

Gene expression studies in ecology and conservation seek to increase our understanding of natural and managed ecosystems, their constituent organisms and resources. These studies are performed on different scales, from individual organisms to communities and populations. DNA microarrays are a tool used to study gene expression in an ecological context on a genome wide level [36, 27, 6]. A DNA microarray is a collection of DNA fragments (termed probes) laid out in a two dimensional field on a substrate (such as a glass slide, or silicon chip), which can bind to complimentary DNA fragments in a sample and are commonly used to measure the relative abundances of gene expression between R microarray platform offers some unique samples [7, 40, 58]. The Affymetrix

advantages for ecological studies as compared to other microarray platforms (http://www.affymetrix.com). First, the platform uses many oligonucleotide probes (commonly 11 to 20, called a probe set) that cover different regions within the same gene, to more accurately measure gene expression. Second, the oligonucleotide probes are short, 25bp, which is both an advantage and a weakness. Signal measurements from short oligonucleotide probes are more susceptible to adjustments from genetic polymorphism than longer probes are. These adjustments are a source of error, because they do not reflect a comparable measurement of gene expression. However, these adjustments can also be used as genetic markers and can add another dimension of data to the experiment. R gene expression microarrays have recently been used as a Affymetrix

high-throughput platform for both profiling genome wide gene expression differences, and

2

for the discovery of candidate genetic polymorphisms. There are over 20, 000 scientific R microarrays, most of which are gene expression publications that utilize Affymetrix R gene expression microarrays were once only used in studies. Where Affymetrix

differential expression studies, today they are also being used for the discovery of genetic polymorphisms that can be used as genetic markers [14, 18, 70, 21, 35, 20, 30, 75, 60, 34, 28, 66, 53, 17]. Genetic polymorphisms in microarray data are typically described as single feature polymorphisms (SFPs), where a feature is a probe on the microarray. Therefore, the entire microarray probe acts as the genetic marker. Algorithms have been proposed for the detection of SFPs in standard gene expression microarray experiments [74, 68, 15, 56, 19, 14]. Gene expression itself, and in particular expression differences, pose the most significant issue to be addressed for accurate and sensitive detection of SFPs. Conversely, the presence of SFPs also presents a problem for the precise estimation of gene expression. Gene expression estimates are modeled using the combined information from all probes within a probe set. SFPs present themselves as a type of error that can negatively impact expression estimates and induce false positive results in differential expression analysis. The study of single feature polymorphisms in standard expression microarray datasets is the focus of this dissertation. Results are expected to provide ecologists with new methods and information for using R microarrays in ecological microarray studies. Affymetrix

1.2

Objectives of this Research

The objective of this dissertation is to provide researchers in ecology and conservation with new tools and results for enhanced analysis of gene expression microarray studies across genetically different populations. Specifically, this dissertation has the following R gene expression three aims pertaining to single feature polymorphisms in Affymetrix

microarray datasets: Aim #1: To provide a new and improved algorithm for the accurate and sensitive detection of single feature polymorphisms.

3

Aim #2a: To quantify the effect gene polymorphisms have on the false positive (and negative) rates in both differential expression estimates and the detection of single feature polymorphisms. Aim #2b: To reduce the false positive rates in differential expression and detection of single feature polymorphisms by correcting for the presence of single feature polymorphisms. Aim #3: Determine how single feature polymorphisms can be used to provide a new means of evaluating the effectiveness of critical microarray processing steps. Completion of these aims provides new algorithms for SFP detection and SFP correction in gene expression microarray studies and a new method to evaluate the R microarray preprocessing steps. performance characteristics of Affymetrix

1.3

Outline

The next chapter (Chapter 2) will cover the scientific background of this dissertation. Chapter 3 introduces the three research chapters. Chapter 4 addresses aim #1 of the objectives. Chapter 5 addresses aims #2a and #2b and Chapter 6 addresses aim #3 of the objectives. Finally, Chapter 7 concludes this dissertation with a discussion and future work. The remaining parts of the dissertation are the references and an appendix. Appendix A provides for supplemental information pertaining to Chapter 6.

4

Chapter 2 SCIENTIFIC BACKGROUND Microarrays provide a means of comparing the relative expression of genes between experimental conditions. The central dogma of molecular biology states that the transfer of sequential information begins with DNA, is transcribed to messenger RNA (mRNA), and then translated into a protein (see figure 2.1). This process is collectively referred to as gene expression. The "state" of a sample at any given time is governed by which of its genes are expressed and by their relative abundances at that time. Microarrays aim to capture the "state" of a sample by measuring the relative quantities of mRNA in a sample at a particular moment in time. In a typical microarray expression experiment, the objective is to compare the relative abundances of genes across conditions. This allows researchers to make inferences regarding the differences between the "states" of experimental conditions. To make these comparisons the samples must be processed and the resulting data analyzed. Briefly, biological samples are processed by generating cDNA from RNA using reverse transcription. The cDNA is labelled with a fluorescent dye and hybridized to a microarray. Fundamental to microarrays is the process of hybridization, where complimentary nucleic acid sequences specifically pair with each other by forming hydrogen bonds between complementary nucleotide base pairs. The greater the number of complementary base pairs the tighter the resulting bond is between two sequences. Microarrays are manufactured by attaching a large number (hundreds to millions) of expected complementary nucleic acid sequences (termed probes) to a substrate, each of which are bound in high quantity (picomoles). Hybridization bonds labeled, nucleic acids sequences from a sample to complimentary nucleic acid sequences on a microarray. After hybridization, the microarray is washed to remove any non-specific binding nucleic acids,

5

so that only the strongly bonded strands which contain a significant number of complimentary nucleotide base pair matches remain hybridized. The microarray is then scanned using a laser to excite the flourescently labeled sequences still bound. A signal is generated for each probe, the intensity of which is primarily dependent on the quantity of target sample bound to the probe, and the strength of the hybridization bond. Microarrays compare the relative expression of genes between conditions by comparing the average signal intensity of a probe set from one condition to the average intensity of the same probe set from a different condition. For this to occur, the microarray signals must first be quality checked, processed and subsequent data analysis performed. Microarray signals are checked for quality and processed using one of many algorithms that produce estimates of gene expression. Quality is most often determined by comparing the microarray wide signal intensity profiles across all microarrays in an experiment. A fundamental assumption in microarray gene expression experiments is that,

Figure 2.1: Central dogma of molecular biology (Overview of the central dogma of molecular biology. Original work by Mike Jones for wikipedia [http://upload.wikimedia.org/wikipedia/en/thumb/e/ee/Cdmb.svg/800px-Cdmb.svg.png]).

6

on average, there is no change in gene expression across conditions. Poor quality microarrays will have signal intensity profiles that deviate from the average signal intensity profile for the experiment and can be removed from the experiment. Gene expression estimates are then produced from signal intensities values using a preprocessing algorithm that typically performs background correction, normalization and multi-probe summarization. Evidence for gene expression differences between conditions can then be tested by statistical means. Genes are placed into one of three categories: no change, up-regulated, or down-regulated across conditions, and biological interpretations are made. All together, this process is referred to as differential expression analysis; however, differential expression alone does not utilize the full potential of microarray gene expression data to extract biologically meaningful information. Post differential expression analyses, or specialized experimental designs, are common and can enhance biological interpretation from microarray gene expression experiments. Examples of post differential expression analyses include: eQTL analysis [37], gene ontology analysis [57], pathway analysis [69], gene set enrichment analysis [63], gene co-expression network construction [23, 76] and single feature polymorphism (SFP) analysis [18]. SFP analysis is the focus of this dissertation, the other topics are beyond the scope of this dissertation and are not discussed further. The remaining sections of the chapter present the relevant background information required for SFP analysis in gene expression microarray experiments. The next section describes the microarray platform used in this dissertation. Followed by a section that R algorithm to call a gene "expressed" or not. The remaining describes an Affymetrix

sections provide an overview of the most common preprocessing procedures used to produce gene expression estimates from microarray signal intensity values. 2.1

R GeneChip R Microarray Platform The Affymetrix

R GeneChip R microarrays consist of small DNA fragments, termed probes, Affymetrix

chemically synthesized at specific locations on a coated quartz surface [44]. Each probe is

7

an oligonucleotide sequence of length 25bp. The mRNA molecule of interest is typically represented by a probe set comprising of 11 to 20 probe pairs. Each probe pair is composed of one perfect match (PM) probe and one paired mismatch (MM) probe (See figure 2.2). The mismatch probe is created by complementing the middle (13th ) base from the PM probe’s sequence. The MM probes were originally intended to provide for a means to estimate non-specific binding and background. However, studies have shown that the MM probes not only detect non-specific binding and background, but specific signal as well [32, 73]. Labelled mRNA samples are hybridized to the microarray, scanned and image analysis is performed to provide intensity values for each of the PM and MM probes. The PM and MM intensity values provide for the raw data in microarray gene expression analysis.

Figure 2.2: Example of three probe set images, with PM and MM rows, containing 20, 16 and 16 probe pairs respectively.

2.2

Signal Calls

Affymetrix provides for an algorithm in their MicroArray Suite 5 software called Present/Marginal/Absent calls (PMA calls) that can be used to determine whether a probe set is reliably detected (Present), not detected (Absent), or marginally detected (Marginal)

8

as being expressed. Discrimination scores are calculated for each probe pair in the probe set using the equation (PM-MM)/(PM+MM). A signal detection p-value is calculated for a probe set from a One-sided Wilcoxon Signed Rank test of the probe pair discrimination scores compared to a predefined constant τ (default τ = 0.015) [71]. A p-value less than α1 is assigned a present call, a p-value greater than α1 and less than α2 is assigned a marginal call and a p-value greater than α2 is assigned an absent call (defaults α1 = 0.04 and α2 = 0.06). probe sets with many probe pairs having discrimination score close to 1.0 will result in a low p-value and those having negative discrimination score (i.e. M M > P M ) will result in p-values close to 1. The signal call can be used to filter a dataset to only those genes which are considered to be expressed. 2.3

Preprocessing Affymetrix Arrays

The goals of preprocessing Affymetrix microarray data are three fold: 1) to remove variation due to technical sources, while preserving variation from biological sources 2) to normalize a set of microarrays in order to make them comparable and 3) to produce summarized expression values for each probe set. Preprocessing includes the following stages: background correction, probe level normalization, PM correction, probe set summarization and a sometimes a second probe set normalization. Not all preprocessing pipelines use all stages, but all pipelines are meant to provide the user with data ready for statistical analysis. The five most commonly used preprocessing pipelines are: MAS5 [1], Plier [3], RMA [32, 31, 12], GCRMA [73] and dChip [43, 42]. Table 2.1 details the specific algorithms employed at each preprocessing stage for the five preprocessing pipelines listed above. Other preprocessing pipelines exist; however, these five are the most commonly used in practice and are the only preprocessing pipelines used in this dissertation. The specific algorithms employed at each stage are described in greater detail below.

9

Table 2.1: Breakdown of the preprocessing steps for the MAS5, Plier, RMA, GCRMA, and dChip preprocessing pipelines. MAS5

Background Correction Probe Level Normalization PM Correction probe set Summarization probe set Normalization

2.3.1

weighted

Plier none

average none ideal

RMA

GCRMA

RMA

GCRMA

(global model)

(model based)

quantile

quantile

quantile

invariant

normalization

normalization

set

none

none

none

median

median

polish

polish

none

none

subtract MM none

Plier

biweight mean

none

normalization

mismatch Tukey

dChip

none

MBEI

none

scaled

Background Correction Algorithms

There are many sources of error introduced from the processes by which gene expression is measured. The data are fluorescence intensities read by a scanner, which are only a proxy for the actual expression level of the gene. Deviations from actual expression levels are introduced by many sources including non-specific binding (hybridization), RNA quality, reagents, etc.. All of which can be considered as background noise. Several algorithms have been developed to perform background correction. RMA (global model) RMA is the background adjustment implementation for the RMA preprocessing pipeline [32, 31, 12]. The PM probe intensities are corrected by using a global model for the distribution of probe intensities. Specifically, the observed PM probes are modeled as the sum of a normal noise component (Normal with mean µ and variance σ 2 ) and a exponential signal component (exponential with mean α). To avoid any possibility of negatives, the normal is truncated at zero. The expected value of background adjusted probe intensities values, s, can then be computed using the following equation.

10

E (s|O = o) = a + b

φ Φ

a b

− φ o−a b + Φ o−a −1 b

a b

(2.1)

Given O as the observed intensity values, a = s − µ − σ 2 α and b = σ. Note that φ and Φ are the standard normal distribution density and distribution functions respectively. Weighted average This is the routine for background correction performed in the MAS5 pipeline [1]. The chip is broken into a grid of 16 rectangular regions. Within each region the lowest 2% of probe intensities are used to compute a background value for that grid. Each probe is then adjusted based upon a weighted average of the backgrounds for each of the regions. The weights are based on the distances between the location of the probe and the centriods of 16 different regions. GCRMA (model based) This is the routine for background correction performed in the GCRMA pipeline [73]. GCRMA adjusts for optical noise and non-specific binding to produce background corrected intensities values. The algorithm uses probe sequence information to estimate probe affinity for non-specific binding. Sequence information is summarized by the base types (A, T, G, or C) at each position (1 through 25) along the probe to determine the affinity of each probe. A non-specific binding experiment was conducted as part of the algorithmic design in order to estimate the parameters of the position-specific base contributions to the probe affinity. The experiment was conducted such that only non-specific binding was expected to occur, and no gene-specific binding was expected to occur. With the probe affinities available, GCRMA estimates the function for non-specific binding given the probe sequence affinities. Specifically, GRMA estimate non-specific binding by fitting a loess curve through the relationship between MM probe intensities and predefined MM probe affinities. The background adjusted intensity values are then

11

computed as the posterior mean of specific binding given the observed intensities and the probe sequences. 2.3.2

Probe Level Normalization Methods

The purpose of normalization is to produce comparable data across multiple samples. Intensities on different arrays may have systematic bias due to diverse technical effects such as efficiency of RNA extraction, reverse transcription, scanning, spot detection, etc. Further there may also be systematic effects due to characteristics of the microarray, such as effect of different probes, spotting effects, spacial effects, etc. The role of normalization is to compensate for the technical effects, while preserving the effects due to the biology. Quantiles normalization The normalization method used in the RMA, GCRMA, and Plier pipelines is the quantile normalization procedure first introduced in [12]. The goal is to give each chip the same empirical distribution. The algorithm for quantile normalization is: 1. given n microarrays of length p (p are probe intensities), form the matrix X has dimension p × n, where each microarray is a column of X and each probe is a row of X; 2. sort each column of X to get Xsort , keeping note of the original ordering; 3. take the mean across each row of Xsort and assign the mean to each element in the 0 row to get Xsort ;

0 4. get normalized X, X 0 , by reordering each column of Xsort to the original ordering

of X

12

Invariant set Invariant set normalization is the normalization routine performed in the dChip pipeline [43, 42]. The procedure selects an ad hoc set of genes to normalize by, instead of a set of housekeeping genes. The procedure assumes that there exists a subset of genes between two microarrays that have the same rank order, an invariant set. The microarray whose intensities are closest to the median overall intensity across all microarrays is chosen as the baseline microarray. Microarrays are then normalized by selecting invariant sets of probes (as compared to the baseline microarray); then using only them, each array is normalized by a non-linear relationship fitted between the microarray and baseline microarray. 2.3.3

PM Correction

The MM probes were originally designed to provide for a measure of the error due to non-specific binding. PM correction routines are designed to account for this type of error specifically by use of the MM probes. Subtract MM R MicroArray Subtract MM was the original algorithm used in the original Affymetrix

Suite 4 software and is one of the two approaches used in dChip (the other is to perform no PM correction). This procedure simply subtracts the MM probe intensity value from PM probe intensity value for each probe set. Ideal mismatch Subtracting MM probe intensity values from PM intensity values can lead to negative numbers, which are nonsensical. In the MicroArray Suite 5 (MAS5) pipeline, non-specific binding is corrected for by subtracting an ideal MM value from the PM value [1]. The ideal MM value is equal to MM when M M ≤ P M , or to PM minus a small correction value when M M > P M . The correction value is a function of the difference between PM and MM. The ideal MM value allows you to subtract MM when possible (i.e. MM is less

13

than PM) or something else when it is not possible. The ideal mismatch will always be less than the corresponding PM and thus does not produce negative values. 2.3.4

Probe Set Summarization

The purpose of probe set summarization is to produce a single value for a probe set that is the estimated expression level for the probe set (i.e. gene). More than one (typically 16 20) probes are designed for each gene on a microarray in order to produce a robust measure of gene expression in the presence of probe specific binding differences, sequence variation and spacial artifacts that may occur on a microarray. Tukey biweight The summarization method employed in MAS5 is to perform a robust averaging across all the probe intensity values within a probe set. The probes intensity values are first log2 transformed and then the 1-step Tukey biweight procedure is used to produce the expression level estimate for a probe set [50]. This procedure does not use information across multiple arrays and can therefor be used on a single microarray. Model Based Expression Index (MBEI) MBEI is the method implemented in dChip [43, 42]. Li and Wong made the observation that the variance of a probe across multiple microarrays was less than the variance across probes within a probe set on a single microarray. In order to produce probe set expression level estimates, the following multi-chip model is fitted to each probe set:

yij = φi θj + ij

(2.2)

where yij is the probe intensity value for probe i within the probe set and microarray j. The φi parameter is a probe response parameter (measure of probe specific binding) and θj is the probe set expression level estimate for array j.

14

Median polish This is the summarization procedure used in the RMA and GCRMA pipelines and was first described in [32]. To produce probe set expression level estimates, the following multichip linear model is fit to data from each probe set:

log2 (yij ) = αi + βj + ij

(2.3)

where yij is the probe intensity value for probe i within the probe set and microarray j, αi is a probe effect and βj is the log2 probe set expression level estimate for array j. Median polish is an algorithm (see [64]) for fitting this model robustly. Probe Logarithmic Intensity ERror (Plier) The Plier algorithm for probe set summarization was designed by Affymetrix as an update to the 1-step Tukey Bi-weight procedure in MAS5. Plier accounts for the difference in signal between probes within the same probe set by means of a parameter termed probe affinity (probe specific binding estimates). Probe affinities represent the strength of signal produced at a specific concentration for a given probe. Plier seeks to produce a more accurate probe set expression level estimate by utilizing these inherent probe affinities, empirical probe performance, and handling error across low and high concentrations. Probe affinities were calculated using experimental data across multiple microarrays. Plier also utilizes an error model that assumes error is proportional to the probe, rather than the signal. This ensures that the error model can adjust appropriately for low and high concentrations (See [3] for a more thorough description of the algorithm). 2.3.5

Probe Set Normalization

Mean scaled The MAS5 pipeline performs normalization only after probe set summarization rather than at the probe level like most other pipelines. Mean scaled normalization applies a

15

linear scaling method to the summarized probe sets, so that all microarrays within an experiment have the same mean expression level. A scale value is predefined for the experiment (300 is a typical scale value). A scaling factor for each microarray βj is then computed by dividing the predefined scale value by the trimmed mean for microarray j. The normalized expression level estimate x0ij for, probe set i in microarray j, is determined by multiplying the computed scale factor βj to each of the original probe set expression level estimates xij (Equation 2.4). The resulting normalized microarrays each have the same mean. scale µj = βj xij

βj = x0ij

(2.4)

Where µj is the trimmed mean of microarray j, with the highest and lowest 2% of probe set expression level estimates removed. 2.4

Summary

This chapter presented the relevant background information needed to understand the SFP analyses presented within this Dissertation. We described the platform used, Affymetrix, a signal calling algorithm for determining whether a gene is expressed, or not, and a number of preprocessing algorithms used to produce expression estimates from probe intensity signal values. While there are other microarray platforms and algorithms that can be used for SFP analysis, those presented in this chapter are the most common and the only platform and algorithms used in this dissertation.

16

Chapter 3 INTRODUCTION TO THE RESEARCH CHAPTERS The research chapters (4, 5 and 6) address the objective of this dissertation; which is to provide new tools and techniques for enhanced analysis of microarray expression experiments comparing two genetically different populations. Chapter 4 addresses Aim #1 R and presents a new and improved algorithm for the identification of SFPs in Affymetrix

expression microarrays. Chapter 5 quantifies the effect of genetic polymorphisms on the false positive rates of differential expression and SFP detection (Aim #2) and presents a procedure to reduce false positives by masking candidate SFPs (Aim #2b). Finally, Chapter 6 addresses Aim #3 and shows how a database of known genetic polymorphisms can be used as a means for evaluating existing microarray preprocessing routines. In Chapter 4 we present a new and improved algorithm for the detection of SFPs, evaluate and compare the algorithm to two previously published SFP detection algorithms R Barley (Hordeum vulgare) microarray datasets. using two publicly available Affymetrix

Algorithmic performance is measured by sensitivity, specificity, and false discovery rate determined using a reference database of known genetic polymorphisms. The results show that our algorithm provides for a more consistent and sensitive calling of SFP with a lower false discovery rate (FDR). Chapter 5 quantifies the effect of genetic polymorphisms on differential expression analysis and SFP detection. We quantify the false positive rates and false negative rates of differentially expressed genes and SFP detection by masking out possible genetic polymorphisms. The procedure described does not rely on having a reference database of known genetic polymorphisms and can be applied to any microarray dataset where comparisons are being make between populations. The results show that the false positive rate is significantly increased by the presence of genetic polymorphisms for both

17

differential expression analysis and SFP detection (14% and 5%, respectively). Further, we provide a procedure for masking out possible genetic polymorphisms prior to differential expression and SFP prediction, which is expected to reduce the false positives in gene expression studies. In the final research chapter (6), we explore the potential of using a reference database of known genetic polymorphisms to evaluate existing microarray preprocessing pipelines. Previous research has shown that the choice of microarray preprocessing pipeline has a significant influence on experimental results. We evaluate different preprocessing pipelines by comparing their sensitivity, specificity, and FDR in predicting known SFPs. Detection of SFPs relies on a accurate estimate of gene expression and therefore the pipeline that most often predicts known SFPs with a lower FDR implies a better expression estimate. We tested five major preprocessing pipelines (RMA, GCRMA, dChip , MAS, and PLIER) and all possible combination of their background correction, normalization, mismatch probe correction, and summarization routines for their ability to detect single feature polymorphisms in two Barley microarray datasets. The results produced the following important findings: GCRMA performs poorly when used in small experiments, MAS5 performed poorly in general and the RMA preprocessing pipeline had the overall most consistent results for sensitivity, specificity and FDR across both datasets.

18

Chapter 4 AN IMPROVED ALGORITHM FOR THE DETECTION OF GENOMIC VARIATION USING SHORT OLIGONUCLEOTIDE EXPRESSION MICROARRAYS 4.1

Abstract

High-throughput experiments, such as microarrays, often generate far more biological information than the original experimental questions require. With many microarray analyses finished after differential expression, additional analysis are routinely not performed and untapped biological information is left undiscovered. This is especially true if the microarray experiment is from an ecological study containing multiple populations. Comparisons across populations may also contain genetic differences important to the original experimental questions, a subset of these differences may be identified using new techniques for detection of single feature polymorphisms (SFP). SFPs are differences at the microarray probe level due to genetic polymorphisms, such as SNPs and INDELs, and not expression differences. In this study we provide a new algorithm for the detection of SFPs, evaluate the algorithm using existing data from two R Barley (Hordeum vulgare) microarray datasets and publicly available Affymetrix

compare them to two previously published SFP detection algorithms. Results show that our algorithm provides a more consistent, sensitive calling of SFP with a lower false discovery rate. Analysis of SFP along side gene expression differences is a low cost method for enhanced analysis and additional biological inferences.

19

4.2

Introduction

Ecological population studies using transcriptome data are increasingly common and is leading to new biological insights [51, 8, 36]. A common experimental design is to compare gene expression differences across population in order to gain insights into the genetic response to environmental conditions. These experiments however are typically limited to the analysis of differential gene expression and do not incorporate underlying genetic variation. Currently, high-throughput RNA sequencing methods (RNA-seq) can be used to assess genetic variation and gene expression simultaneously; however, this technique is still costly which severely limits the sample size for the experiment. Short-oligonucleotide microarrays have been used to predict candidate locations of genomic and transcriptional polymorphisms between populations, using both genomic DNA (gDNA) [14, 41, 38, 72] and messenger RNA (mRNA) [19, 56, 45, 55, 70]. The advantage of using mRNA over gDNA is that no additional experiments may need be performed, as both expression and genetic variability can be assessed simultaneously. However, mRNA based polymorphism detection algorithms are generally more error prone, since they must also consider gene expression. An algorithm to reliably predict candidate locations of genetic polymorphisms from microarray gene expression experiments would provide a high-throughput technique for studying both the genetic and transcriptional basis of phenotypic variation between populations from routine microarray gene expression experiments. Short-oligonucleotide microarrays are commonly used to assess gene expression differences between populations, but have also been used to successfully predict candidate locations of single-feature polymorphisms (SFP) between two populations. When a short-oligonucleotide probe is designed over a position with a genomic, or transcriptional polymorphism, the hybridization efficiency of the sample is reduced. SFPs are statistical difference in the probe level hybridization efficiency between two populations caused by an underlying genetic or transcriptional polymorphism. They are detected by comparing microarray probe level intensity signals, a proxy value for hybridization efficiency,

20

between two populations. When hybridizing gDNA, SFPs are induced by single-nucleotide polymorphisms (SNPs) and small insertions/deletions (INDELS) [14]. When hybridizing mRNA, SFPs can also be induced by splicing variation and polyadenylation differences [56]. It is also important that the probe be a particular length, the shorter a probe is the greater the likelihood of non-specific binding. Conversely, as a probe becomes longer the impact that small localized polymorphisms (ie. SNPs) have on hybridization efficiency is reduced, making SFP detection increasingly difficult. The R GeneChip microarray platform offers a whole genome solution with Affymetrix

short-oligonucleotide probes (25-mers) that has been used to detect both expression and SFPs [18, 38, 11, 15]. R GeneChip microarrays consist of small DNA fragments, termed probes, Affymetrix

where each probe is an oligonucleotide sequence of length 25bp. The mRNA molecule of interest is measured by multiple probe pairs, usually 11 to 20, assembled into a probe set. Each probe pair is composed of one perfect match (PM) probe and one mismatch (MM) probe, the PM probe matches the targeted mRNA sequence exactly while the MM probe is generated by complementing the middle (13th bp) nucleotide of the PM probe. The MM probe is intended to provide an estimate of non-specific binding and background. Labeled gDNA or cDNA samples are hybridized to a microarray, scanned and image analysis is performed to provide intensity values for each of the PM and MM probes. When a SFP exists between two populations, the relative hybridization efficiency of the two populations is expected to be different and therefore the probe’s relative intensity values to be different (see Figure 4.1A). When hybridizing gDNA, prediction of SFPs becomes relatively straight forward, any significant probe intensity differences can be assumed to be due to hybridization efficiency differences and therefore a SFP. When predicting SFPs from mRNA, any prediction technique must also consider the expression of the mRNA molecule and be able to distinguish probe level differences due to underlying genetic variation from those due to expression differences (see Figure 4.1B). A probe’s hybridization efficiency can be calculated by computing the difference between the probe’s

21

observed intensity (raw intensity value) from the expected intensity value (intensity due to expression) for each probe [56, 55]. The resulting measure has the signal due to expression removed and can be analyzed in a similar manner to gDNA hybridization data. Here we provide two variations on a new algorithm for the detection of SFPs in standard expression microarray experiments. The two variations on our SFP prediction algorithm use different probe set summarization methods, RMA and MAS, to compute the expected intensity values. The algorithm is evaluated using data from two publicly R Barley (Hordeum vulgare) microarray datasets and is compared to available Affymetrix

two previously published SFP detection algorithms [56, 55]. The two Barley datasets use the same two cultivars (Morex and Golden Promise), but differ in the number of microarrays and sample tissue types. A previously reported DNA sequence database of known polymorphisms between Morex and Golden Promise is used to determine the sensitivity, specificity and false feature polymorphism rate of each method. We assess the overlap of called SFPs across the four detection methods as well as between the two barley datasets. We further explore the sensitivity of each algorithm to predict known polymorphisms by their position within the probe. Finally, we compare the overlap between genes with called SFPs and their differential expression calls across each algorithm and both datasets. Results show that our algorithm using the RMA summarization method better estimates and removes signal from expression than the other methods. The result is a more consistent, sensitive calling of SFPs with a lower false feature polymorphism rate. 4.3 4.3.1

Materials and Methods Experiment Datasets

R Barley Genome Array contains 22,840 probe sets, each with 11 probe The Affymetrix

pairs (PM and MM probes) developed by an international collaboration of Barley researchers [16]. Raw data from two publicly available barley datasets were obtained from plexdb (http://www.plexdb.org, Experiment ID: BB3) and ArrayExpress

22

A.

B.

Figure 4.1: Single feature polymorphism (SFP) detection using the RMA preprocessing pipeline with a low expression variability gene (A) and one with high expression variability (B) in the BB3 dataset. Each pane shows the log2 intensity values (y axis) of each array for each probe (x axis), with the Morex cultivar samples shown in red and Golden Promise samples in black. The left panes show unadjusted raw log2 intensity values with the average RMA computed expression values for each genotype (horizontal lines). The right pane shows log2 intensity values after RMA expression adjustment, each probe can then tested for a genotype effect. The bottom axis tick marks show the called SFPs (red for Morex, black for Golden Promise), while the top axis tick marks show probes with known SFPs (red/black) and known SFP free (green) probes, according to the sequence confirmation dataset.

23

(http://www.ebi.ac.uk/arrayexpress, Experiment ID: E-TABM-110). The first barley microarray dataset (BB3) was generated to provide a reference gene expression dataset across 15 tissues, 6 of which (stem, seminal root, vegetative shoot, seedling leaf, coleoptile and hypocotyl) contained samples from both Morex and Golden Promise (GP) cultivars[24]. The dataset consists of 3 biological replicates per tissue and cultivar for a total of 36 arrays, and is the same dataset used to detect SFPs in [56]. As determined in Rostoks et al., and verified here, one array of cultivar GP from the tissue type seminal root consistently clusters with the 3 replicates from the Morex cultivar (data not shown). This tissue type was removed from this experiment, leaving 5 tissue types and 30 microarrays. The second dataset (E-TABM-110) is also from seedling leaves and contains 3 replicates from the same two cultivars, Morex and GP, for a total of 6 microarrays. This dataset is used here to determine the sensitivity to call SFPs using a number of microarrays more typical of a small gene expression experiment, and to assess consistency of SFP calls across experiments. 4.3.2

Sequence Confirmation Dataset and Algorithmic Performance Evaluation

The sequence confirmation dataset used in this study is the same as that of [56] and can be found at their website (http://naturalsystems.uchicago.edu/naturalvariation/barley/SNPtable.csv). The sequences were collected from 3 barley sequence sources totaling 2,699 sequences. Of those, 30 were duplicated in 2 of the 3 sources and 1 was duplicated in all 3. Sixty-six probes contained polymorphisms in both Morex and GP genotypes as compared to the reference sequence on the microarray. After removing duplicates and sequences with polymorphisms in both genotypes, the sequence confirmation dataset consisted of 2,601 sequences of which 223 were polymorphic for GP and 178 for Morex, the remaining 2,200 sequences did not contain a sequence polymorphism for either cultivars. Results of each SFP prediction algorithm were evaluated by direct comparison to the sequence confirmation dataset for calculations of sensitivity, specificity and false

24

discovery rate. An algorithm’s sensitivity is the proportion of known polymorphisms in the sequence confirmation dataset correctly called as a SFP by the algorithm. The specificity is the proportion of known negatives (no SNP) correctly identified as such. False discovery rate is the proportion of called SFPs by the algorithm incorrectly identified as a polymorphism, when the sequence confirmation dataset indicates that no SNP is present. 4.3.3

Data Preprocessing

All analyses were conducted within the R statistical computing language using publicly available packages from CRAN and Bioconductor (R version 2.11.0, http://www.r-project.org; http://cran.fhcrc.org;http://www.bioconductor.org) [54, 26]. Raw CEL files were read into R using the bioconductor package affy [25] and checked for quality using pseudo-chip images and residual error visualizations. Quality assurance of microarray data was further checked using the affyQAReport function from the Bioconductor package affyQCReport [52]. Hybridization and housekeeping controls, RNA degradation, sample clustering, NUSE plots, LPE plots, and RLE plots all showed high quality data (results not shown) and no additional microarrays were removed. MicroArray Suite (MAS) PMA (Present/Marginal/Absent) calls were determined for each probe set within each mircroarray. In this study a marginal call was also considered to be absent. Arrays were then grouped by common attributes (i.e. strain and tissue). Final present/absent calls were determined for each probe set within each group; a probe set was called as present, for the group, if at least 5 of 6 samples within the group were called as present. A probe set was retained for further analysis if at least one group was called as present. This procedure is expected to remove probe sets that are unexpressed across the entire experiment, or expressed in only one cultivar. A second probe set filter that is commonly performed is to filter out probe sets with a low interquartile range (IQR) across all arrays. This filter is intended to remove probe sets that are not changing and therefore potentially biologically uninteresting. This filter is useful for differential expression but not for detection of SFPs, as a probe set can include

25

an SFP but have low expression variability. However, to maintain consistency of analyzed probe sets between both differential expression analysis and SFP calling, this filter was not applied in this study. 4.3.4

Analysis of Differential Expression

Differential expression analysis was conducted on each dataset in the following manner. First each dataset was normalized using the Robust Multichip Averaging (RMA) preprocessing routine by [12, 32, 31]. Differential expression was determined using the linear analysis of microarray technique from the limma package [62] with empirical Bayes adjustment to the variance, followed by Benjimini and Hochberg (BH) correction for multiple testing [61, 9]. Differential expression was only determined for those probe sets which passed the PMA filter as described in preprocessing. A gene was considered to be differentially expressed if it had both an BH adjusted p-value less than 0.05 and a log fold change greater than 0.5. Using both p-value and fold change criteria to determine differential expression is recommended by the MicroArray Quality Control (MAQC) project [46, 59]. 4.3.5

Single Feature Polymorphism Detection

Single Feature Polymorphism detection was conducted in the following manner. First, hybridization efficiencies for each probe were calculated using one of the four proposed models (models described below). Each probe was then fitted for a genotype effect; using the limma approach with empirical Bayes adjustment to the variance followed BH correction for multiple testing. SFP detection was only performed for those probes within probe sets that passed the PMA filter as described in pre-processing. A probe was considered to contain an SFP if it had a BH adjusted p-value of less than 0.05 and a log fold change greater than 0.5. The genotype with the reduced signal was determined to contain the SFP.

26

Model # 1: Linear Model (LM) This model of probe level hybridization efficiencies is the same as the model used in [56]. In this model, the relative probe hybridization efficiencies are modeled as the residuals from fitting the following linear model to each probe set on the array:

log2 (Ipgtr ) = µ + probep + genotypeg + tissuet + tissuet × genotypeg + pgtr

(4.1)

Where, Ipgrt is the background corrected and normalized intensity of probe p, genotype g, tissue type t, replicate r in a probe set. The residuals from the model are extracted and fitted for a genotype effect using the procedure described above. Model # 2: Intensity to RMA Expression Ratio (RATIO) The Intensity to RMA Expression Ratio model (RATIO) is similar to the method used in [55]. The difference is that we used values from the RMA summarization procedure to computed the expected expression values instead of the probe dependent nearest neighbor (PDNN) model [77]. This was done in order to make a more accurate comparison with the other models which use RMA and because of the complicated nature of preparing the energy parameterization files for the barley microarray needed for PDNN. Further, Ronald et al. reported that the RMA summary method exhibited similar and only slightly weaker performance than the PDNN model. In this model the relative probe hybridization efficiencies are modeled as: Ipa Iâ

(4.2)

Where Ipa is the background adjusted and normalized intensity value of probe p in array a and Iâ is the expected expression value of array a. The expected value of the ratio is 1 for probes which do not contain an SFP and significantly less than 1 for probes which

27

do contain a SFP. It should be noted that the RMA model (described below) is equivalent ˆ to the a log2 transformation of I/I. Model # 3: RMA subtraction (RMA) The RMA model is the first variant of our new SFP calling algorithm, where the RMA preprocessing procedure is used to estimate the expected expression value of the probe set. From the LM model above we can interpret the sum of the genotype, tissue and any interaction terms as estimates of group level expression from the mean with its own corresponding error (replicate deviations) that are being nested inside the overall error term. This procedure adds unnecessary variance to the analysis of SFP. We can remove this group specific error from the overall error and rewrite the linear model as:

log2 (Ipa ) = probep + log2 (Iâ ) + pa µ + (log2 (Ipa ) − log2 (Iâ )) = µ + probep + pa

(4.3) (4.4)

Where, log2 (Ipa ) is the background adjusted and normalized log2 intensity value of array a and probe p in the probe set and log2 (Iâ ) is the log2 expected expression value of array a for the probe set. The probep term in the model represents a scalar adjustment to each probe that accounts for the general hybridization differences between the probes and can be ignored in this context. Scaling the relative probe hybridization efficiencies to the mean across all probes and arrays has the additional effect, when partnered with the empirical Bayes adjustment to the variance from the limma package, of giving decreased weight to those probe sets with low overall expression. The probe hybridization efficiencies are now modeled as the log2 difference of the probes intensity value from the expected expression value adjusted by the mean. In the RMA model the RMA summarization procedure (median polish) is used to compute the expected intensity values for each probe set of each array [12].

28

Model # 4: MAS subtraction (MAS) The MAS model is the second variation of our new SFP calling algorithm, where the MAS5 microarray preprocessing procedure is used to estimate the expected expression value of a probe set instead of the RMA procedure. The probe level hybridization efficiencies are calculated in the same manner as the RMA model with the exception that R Microarray Suite (MAS5) summarization procedure are used the the Affymetrix

compute the expected intensity values for each probe set of each array [1]. MAS5 is the default preprocessing procedure used in Affymetrix’s MicroArray Suite (MAS) software for their 3‘ IVT microarrays and, after the RMA procedure and is a commonly used R microarrays. preprocessing procedure for experiments which use Affymetrix

4.4 4.4.1

Results Sensitivity, Specificity and False Discovery Rate

Microarray data from two publicly available Barley microarray datasets (E-TABM-113 and BB3) were preprocessed according the procedures described in material and methods. Each microarray experiment was then tested for both differentially expressed genes and single feature polymorphisms (SFPs), where probe sets had been both filtered by presence/absence calls (PMA calls) and without any filter applied. Filtering the BB3 dataset by PMA calls reduced the number of probe sets from 22,840 to 17,457 (251,240 to 192,027 probes) and for the E-TABM-113 dataset reduced the number of probe sets from 22,840 to 14,232 (251,240 to 156,552 probes). Table 4.1 lists the sensitivity, specificity and false discovery rate for each of the four models of hybridization efficiency when applied to filtered and unfiltered BB3 (4.1A) and E-TABM-113 (4.1B) datasets. Filtering the data by PMA calls to only those considered expressed resulted in an increase to the sensitivity to detect known SFPs (average increase of 8.5% in E-TABM-113; 2.0% BB3) and only a slight decrease in the specificity (average decrease of 0.6% in E-TABM-113; 0.3% BB3), while no consistent change in the expected false discovery rate was observed.

29

Analysis and comparisons were performed on all probes (unfiltered dataset), because the comparison studies did not perform any probe set filtering [56, 55]. If a gene is unexpressed any probe containing an SFP would not be detectable, nor would the gene be differentially expressed. It therefore makes little sense to include these probes in any detection analysis. The trends observed across the four models of hybridization efficiency for both filtered and unfiltered were the same; therefore from this point on we discuss results based only on the filtered dataset.

Table 4.1: Sensitivity, Specificity and False discovery rate of each of the four SFP calling algorithms in the E-TABM-113 (A) dataset and BB3 (B) datasets as compared to the barley sequence confirmation dataset. Category best values are shown in bold font. In 8 of 12 possible categories the RMA procedure outperforms the other 3 algorithms with the RATIO method performing the best in the remaining 4 categories. The RMA procedure outperformed the MAS and LM procedures in all categories. BB3

A. RMA

Sensitivity Specificity FDR

Filter 72.5% 97.1% 21.1%

No Filter 70.6% 97.3% 20.5%

B. RMA


Filter 56.9% 98.7% 11.5%

No Filter 50.6% 98.9% 11.7%

MAS Filter 62% 96.8% 28.4%

No Filter 59.9% 97.0% 27.9%

RATIO No Filter Filter 76.7% 74.1% 95.0% 95.5% 29.9% 29.1%

E-TABM-113 MAS RATIO No No Filter Filter Filter Filter 53.3% 43.9% 50.4% 38.2% 95.2% 96.5% 98.6% 99.2% 37.3% 35.5% 14% 11.6%

LM Filter 68.8% 94.0% 36.0%

No Filter 67.3% 94.4% 35.6%

LM Filter 51.6% 98% 19.5%

No Filter 44.6% 98.2% 20.4%

Within the BB3 dataset, the sensitivity to detect known SFPs varied widely from 62% using MAS to 76.7% using the RATIO model. The sensitivity reflects the ability of a model to detect known SFPs according to the barley sequence confirmation dataset. Specificity was similar for each of the four models ranging from 94.0% using the LM model to 97.1% for RMA; specificity reflects the ability of a model to call a probe a

30

non-SFP when it is known no SFP exists. A more dramatic difference was observed in the false discovery rate, ranging from 21.1% in the RMA to 36% for the LM model. Within the E-TABM-113 dataset, the sensitivity to detect known SFPs also varied from 50.4% using the RATIO model to 56.9% using the RMA model. Specificity was again similar for each of the four models ranging from 95.2% using the MAS model to 98.7% for RMA. Finally, the false discovery rate ranged from 11.5% using the RMA algorithm to 37.3% for the MAS algorithm. Overall, RMA outperformed MAS and the LM models in every categories and outperformed the RATIO algorithm in five of the six evaluated criteria for filtered data. The RATIO method outperformed the LM model in all cases except sensitivity in the E-TABM-113 dataset, where RATIO performed the worst when compared to all other models. The MAS procedure performed comparable to the LM procedure for sensitivity and specificity but performed significantly worse than all other models for false discovery rate for the smaller E-TABM-113 dataset, however performed better than both RATIO and LM in the larger BB3 dataset. In general, the RMA method performed consistently well across both datasets. Figure 4.2 shows the sensitivity of each model to call known SFPs by the SNP position within the probe. All models show an increased sensitivity to positively detecting known SFPs as the SNPs position moved towards the center of the probe or if multiple SNPs occur within the same probe. A sharp increase in sensitivity was observed when the SNP did not occur in the outside 3 bases of the probe (occurred within bases 4 and 22 inclusive). 4.4.2

Comparison and Overlap of called SFPs

Table 4.2 shows the frequency in the number of called SFPs per gene, across all four models for the BB3 (4.2A) and E-TABM-113 (4.2B) datasets. Most genes contained a single called SFP and the number of SFPs per gene decreased steeply thereafter. Genes which contain many called SFPs are more likely to contain false positives and are a

31

RAT

E-TABM-113

LM

MAS

RAT

LM

60

Percent Sensitivity

0

20

40

60 40 0

20

Percent Sensitivity

RMA

80

MAS

80

RMA

100

100

BB3

1

2

3

4

5

6

7

8

SFP Position

9

10

11

12

13

M

1

2

3

4

5

6

7

8

9

10

11

12

13

M

SFP Position

Figure 4.2: Position dependent sensitivity of each algorithm to detection of SFP. The y-axis is the percent sensitivity of detection of known SNPs according to the SNP position in the probe described on the y-axis. Position 1 is the furthest position from the middle base (bp positions 1 and 25). The 13th position is the middle base in the probe (bp position 13) and M is when multiple SNPs are known to exist within the probe.

32

warning sign that an accurate estimate of gene expression could not be obtained. As the number of true SFP containing probes in a probe set increases the ability to accurately estimate gene expression decreases as the number of probes reflecting expression only is outnumbered by the number of probes with both expression and genetic differences. Both the MAS and LM models contained a significant number of genes with greater than six called SFPs relative to the RMA and RATIO models. Further, the number of SFPs is evenly split between Morex and GP genotypes in RMA, RATIO, and LM models across both datasets, a trend also observed in differential expression. However, the ratio was significantly skewed toward GP in the MAS method in the smaller E-TABM-113 dataset and towards Morex in the larger BB3 dataset. The Ratio model also had approximately 50% more called SFPs relative to the RMA model (12,766 to 8,873) in the BB3 dataset and the MAS model has approximately 100% more called SFPs relative to all other models in the E-TABM-113 dataset. Table 4.2: Frequency of SFPs in the BB3 (A) and E-TABM-113 (B) datasets (number of genes containing 1 SFP, 2 SFPs, etc.), total SFPs found for the Morex (MX) and Golden Promise (GP) genotypes, total SFPs found across both genotypes and total number of genes containing an SFP. A.

BB3 MX

GP

Total

1

2

3

4

5

6

7

8

9

10

11

SFP

SFP

SFPs

Genes

RMA

1921

738

330

172

131

103

98

91

56

53

7

4620

4253

8873

3700

MAS

1556

520

201

148

133

121

123

106

106

73

41

5419

3607

9026

3128

RATIO

2804

1097

513

273

241

191

128

114

68

33

3

6565

6201

12766

5466

LM

1826

570

250

190

120

127

133

123

99

116

68

5273

5279

10552

3622

B.

E-TABM-113 5

6

7

8

9

10

MX

GP

Total

11

SFP

SFP

SFPs

Genes

1

2

3

4

RMA

1726

597

254

141

84

44

19

11

3

0

0

2609

2569

5178

2879

MAS

1976

648

301

174

160

143

166

130

112

92

50

4146

7063

11209

3952

RATIO

1949

636

233

85

40

18

3

1

1

0

0

2370

2249

4619

2969

LM

1516

447

200

116

100

73

48

41

26

10

2

2794

2638

5432

2579

33

An important evaluation is the proportion of shared SFP calls across the four methods and conversely the proportion of unique SFPs (a SFP called in that model only). Figure 4.3 shows a Venn diagram of the overlap of all called SFPs for the two datasets across all 4 models. A large core of called SFPs exists across both datasets and across all models, with relatively few unique SFPs. The exceptions for unique SFPs are in the MAS and RATIO methods. The MAS method called a large number of unique SFPs in both datasets, 60% and 16% of all called SFPs were unique to the MAS method in the E-TABM-113 and BB3 datasets respectively. The RATIO method called a large number (22%) of unique SFPs in the E-TABM-113 dataset. By comparison the RMA model had only 4% and 1% unique SFPs in the E-TABM-113 and BB3 experiments respectively. The RMA model has the greatest overall overlap; while the LM method was intermediate to the RMA and the RATIO and MAS methods with regards to overlap. In the larger BB3 dataset there existed a significant core of SFPs representing a portion of all called SFPs (64% RMA, 63% MAS, 44% RATIO, 54% LM. When not considering the MAS method the remaining three methods showed an even more significant core of called SFPs (89% RMA, 62% RATIO, 75% LM). In the smaller E-TABM-113 dataset, the overlap was less significant between the four methods (50% RMA, 23% MAS, 56% RATIO, 48% LM). Not considering the MAS method again the overlap between the three remaining methods increases significantly again (70% RMA, 78% RATIO, 66% LM). Interestingly, the RATIO method has considerable overlap with the RMA and LM methods in the smaller E-TABM-113 dataset, but has a lower overlap and a large number of unique SFP calls in the larger BB3 dataset. Overall, the RMA method produced results that overlapped the most and had the fewest unique SFP calls as compared to the other 3 models across both datasets. 4.4.3

SFP call agreement between BB3 and E-TABM-113

Both BB3 and E-TABM-113 datasets use the same two cultivars (Morex and Golden Promise) and should therefore have similar called SFPs within the common set of tested genes. A total of 154,110 probes (14,010 genes) were tested for SFP across both datasets.

34

A.

RMA

MAS

E-TABM-113 204

B.

RMA

6699

75

1419

230

41

383

870 213

418

405

568 153

190

350

842 5664

156

133

525

1020

142770

112 490

28

RAT

158

2787

2586

40

MAS

BB3

2243 475

LM

RAT

175848

LM

Figure 4.3: Venn diagram of the overlap of SFP calls between the four algorithms and across the two datasets E-TABM-113 (A) and BB3 (B). The value inside a cell represents the number of called SFPs in common between the algorithms represented by the overlap between the ovals. The value outside all ovals is the number of probes without a called SFP in any of the four algorithms.

35

Of these, the percentage of probes with a called SFP in at least one of the two datasets was 8.9% MAS, 7.3% RATIO, 6.5% LM, and 5.6% for the RMA procedure. The agreement between the two datasets also varied greatly. Considering only probes which contained a called SFP in at least one of the two datasets, the agreement for the four models was 28.9% RATIO, 31.9% MAS, 27.5% LM, and 39.5% for the RMA methods. With the majority of disagreements being a called SFP in one dataset and a no call in the other, the number of called disagreements (i.e occurred as a Morex SFP in one dataset and a GP SFP in the other) were relatively few across all four methods (15 RMA, 22 LM, 30 RATIO, and 89 MAS probes). The relatively low concordance for all four datasets can be explained by the differences between the two datasets. Which are in the number of samples, BB3 having five times more microarrays, and in the number of different tissues, BB3 contains five tissues, where E-TABM-113 contains only one. One would therefore expect the BB3 dataset to have increased power and subsequently greater ability to detect SFPs. The BB3 dataset however also has five different tissues and therefore will have five different expression profiles. If for instance a gene containing a SFP was not expressed in one (or more than one, but not all) tissues, it would be difficult for any SFP calling algorithm to account for both differences in hybridization efficiency in expressed tissue due to the SFP and the lack of a signal in the unexpressed tissues. 4.4.4

Comparison of SFP Calls to Differential Expression

In addition to SFP calls, we also performed differential expression analysis for both datasets. Differential expression analysis resulted in 549 genes up-regulated and 680 down-regulated in Morex relative to Golden Promise in the BB3 experiment (1,229 total differentially expressed genes). For the BB3 dataset, tissue was included as a blocking factor in the linear model. The smaller E-TABM-113 experiment resulted in 760 genes up-regulated and 1,043 genes down-regulated in Morex relative to Golden Promise (1,803 total differentially expressed genes).

36

Of the 1,229 genes differentially expressed in the BB3 dataset, 74.3% (913 genes) contained at least one SFP when computing SFP calls using the RMA model. Conversely, of the 3,700 genes containing at least one SFP, 24.7% were also differentially expressed. Within the E-TABM-113 dataset this effect was less pronounced, of the 1,803 total genes differentially expressed 53.6% (967 genes) contained at least one SFP when computing SFP calls using the RMA model and of the 2,879 genes containing at least one SFP, 33.5% were also differentially expressed. In general, for the RMA model, as the number of probes within a gene called as an SFP increased, the likelihood that the gene was also labeled as being differentially expressed also increased. Similar patterns were seen in both the MAS and LM models, where the percentage of genes differentially expressed containing at least one SFP increased as the number of called SFPs with the gene increased (see Figures 4.4 and 4.5). In the RATIO model however, differentially expression and SFP did not appear to be associated with each other and no relationship was observed. 4.5

Discussion

We have described a new algorithm, with 2 variants using different preprocessing R methodologies (RMA and MAS5), for the prediction of SFPs from standard Affymetrix

microarray gene expression experiments. We compared our two variants to two previously published methods for the prediction of SFPs between two barley cultivars (Morex and Golden Promise) across two existing barley microarray gene expression datasets. The differences between the two datasets being that one (BB3) contains more samples (30 microarrays) across five tissue types and the other (E-TABM-113) only one tissue type and a smaller number of microarrays (6 microarrays). The E-TABM-113 dataset however represents a common experimental design of a direct comparison of two genotypes across a single factor using a small number of microarrays. Any SFP detection technique should also be evaluated on this type of experimental design. Our results show that the RMA model produced the overall best results across both datasets for sensitivity, specificity and false feature polymorphism rate against a database of known SNP differences between the

37

1.0 0.8 0.0

4

5

6

7

8

9

10

11

0

1

2

3

4

5

6

7

Number of SFPs per Gene


RMA

LM MX

GP

Percentage of Genes Differencially Expressed

0.8 0.6 0.4

10

11

8

9

10

11

MX

0.0

0.2 0.0

None

9

1.0

D .

1.0

None

8

0.8

3

0.6

2

0.4

1

GP


MX

0.2

0

B.

None

0.6


0.8 0.6 0.4 0.2 0.0


GP

C .

0.4

MX

1.0

None

0.2

GP

A.

0

1

2

3

4

5

6

7

8

9

10

11

0

1

2

3

4

5

6

7



MAS5

RAT

Figure 4.4: Comparison of differential expression and SFP calling in the BB3 dataset. The y-axis gives the percentage of differentially expressed genes as the number of called SFPs, within the probe set, increases (y-axis).

38

C .

1

2

3

4

5

6

7

8

1.0

9

0

1

2

3


4

7

8

9

10

11

0.8

1.0

MX

0.6

0.8 0.6 0.4

0.0

0.2 0.0

None

0.4

D .

GP

0.2

MX

1.0

None

6

LM


GP

5


RMA


MX

0.0 0

B.

None

0.8


0.8 0.6 0.4 0.2 0.0


GP

0.6

MX

0.4

None

1.0

GP

0.2

A.

0

1

2

3

4

5

6

7


MAS5

8

9

10

11

0

1

2

3

4

5

6

7

8

9


RAT

Figure 4.5: Comparison of differential expression and SFP calling in the E-TAB-M dataset. The y-axis gives the percentage of differentially expressed genes as the number of called SFPs, within the probe set, increases (x-axis).

39

two genotypes. Further, the RMA model for calling SFP was the most conservative and consistent across all evaluated statistics and both datasets. In addition, relatively few genes containing at least one SFP (24.7% BB3, 33.5% E-TABM-113), using the RMA method were also called as being differentially expressed. This would imply that most SFPs are not cis-acting SFPs and are not associated with the gene’s expression. However, when a gene is differentially expressed, a majority (74.3% BB3, 53.6% E-TABM-113) were also found to contain at least one called SFP. These SFP are then candidates for cis-acting regulators that impact gene expression and may be important ecological markers. Comparing results of the LM and RATIO models to their respective original papers. We were able to maintain the sensitivity, (67.3% in this study vs 67% in the original studies) while decreasing the false positive rate from 40% to 35.5% for the LM model. The slight sensitivity gain and lower false discovery rate is likely due to the use of the limma procedure for evaluating significance in differential hybridization rather than the significant analysis of microarrays (SAM) procedure [65] used in the original paper. The limma procedure further allows for a more standard cutoff value (BH âL’d’ 0.05) for significance to be used rather than the SAM empirical p-value cutoff of ≤ 0.001 used in the original paper, with roughly the same number of called SFPs (10,552 limma, 10,504 SAM). The RATIO method’s original paper did not use the same dataset, so a direct comparison of the results is not possible. As the results and the model algorithms indicate, the ability of an algorithm to successfully predict SFPs is largely dependent on two factors; where the SFP occurs within the probe and the accuracy of the expression estimate. Results show that if the polymorphism occurs in the outside three bases (position 1-3 and 23-25) of the probe, the likelihood of detection was reduced approximately three fold. Factors that may impact expression estimates are the assumption that all probes within a probe set have a single target and the same target as the other probes within the same probe set (i.e. cross-hybridization and non-specific hybridization is rare). Therefore, before SFP

40

prediction, care should be taken to update the probe to transcript mapping, ensuring that that each probe belongs within the probe set to which it is assigned and that it has a unique target. In addition, results show that as the number probes containing a polymorphism increases within a probe set the likelihood that the corresponding estimate of gene expression will not represent the true level gene expression grows. Poor estimation of gene expression will lead to an increase in the false positive rate of both differential expression analysis and SFP prediction. Within the BB3 dataset, as the number of SFPs within a gene increased the relative frequency of the majority genotype to the minority genotype increases towards one (i.e. more even), this is likely to indicate an entire genotype’s expression profile has been shifted (for example, see Figure 4.6). However, these genes are candidates for splice variation, multi-probe INDELS and/or polyadenylation differences; these probes can then be mapped to exons for possible discovery of these types of polymorphisms. Finally, SFP prediction should be limited to only those probe

Figure 4.6: Single Feature Polymorphism detection using the RMA method when multiple probes are called as containing a SFP in the BB3 dataset. Here the expression profile of the Morex genotype is artificially decreased due to the presence of 5 likely SFP (probes 4 to 8). This reduction in expression leads to an over adjustment of expression and likely false SFP calls in probes 1,2 and 9 to 11.

41

sets where both genotypes are expressed (called present) and many of the observed errors can be attributed to a likely un-expressed transcript in one genotype (data not shown). An intriguing possible use of SFP prediction in experiments where a confirmation sequence dataset exists is for the evaluation of microarray preprocessing procedures. Calling SFPs requires an accurate determination of expression, intuitively the preprocessing procedure that most accurately predicts SFPs, relative to the confirmation dataset, can be assumed to estimate expression the best as well. In addition we may also use known SFPs to evaluate the overall effect that SFP have on expression estimation. For example, we could estimate how many SFPs within a probe set will induce a false expression estimate. Probes containing candidate SFP could then be removed from the probe set and expression recalculated, this process could be performed iteratively until the expression value stabilizes or until the number of probes becomes too few to estimate expression reliably. Expression values calculated using only probes without a SFP will provide a more accurate estimation of expression and provide for lower false positive rates for both differential expression and SFP calls. Such a study would not only give insights into the effects of SFP on gene expression, but also how to correct for genetic variation. We provide an algorithm (RMA) for prediction of SFP from standard expression microarray datasets. The algorithm is simple and quick to implement and produces results which are superior to the other comparison algorithms. The algorithm can be applied to small datasets and still be expected to perform well with a low FDR. Results indicate that the RMA algorithm is an effective technique for studying both gene expression differences and genetic polymorphisms in ecological microarray studies with two populations. Acknowledgments The authors would like to acknowledge Ms. Maia Benner for her comments and edits in preparation of this manuscript.

42

Chapter 5 EFFECT OF SINGLE FEATURE POLYMORPHISMS ON COMPUTED EXPRESSION LEVELS 5.1

Abstract

Microarrays are often used in ecological and conservation experiments to study the differences in gene expression between populations. They have also been used for the detection of genetic variation, known as single feature polymorphisms (SFP) detection, between these populations as well. Yet the degree to which underlying genetic variation influences differential gene expression, or the detection of SFPs is still largely unknown. In this chapter we quantify the false positive rate and false negative rates of differentially expressed genes in the presence of SFPs compared to when SFPs have been masked out. Results show the false positive rates to be significantly increased by SFPs for both differential expression analysis and SFP detection (14% and 5%, respectively). The false negative rates, however, were marginally impacted by the presence of SFPs (0.06% and 0.1% respectively). We also provide a procedure for masking out candidate SFPs prior to differential expression analysis and final SFP prediction. Estimating gene expression with SFPs masked out will produce more accurate gene expression estimates and reduce the false positive rates in both differential expression analysis and SFP prediction. The procedure described does not rely on having a reference database of known genetic polymorphisms and can be applied to any microarray dataset where comparisons are being made between populations.

43

5.2

Introduction

Microarrays remain an important tool in ecology for the study of gene expression differences between populations, but can also been used to detect genetic polymorphisms. Microarrays rely on sequence similarity of target mRNA molecules to oligonucleotide probes for efficient hybridization. The greater the sequence divergence, the less efficient the hybridization is and the greater the effect on the microarray probe’s signal. Genetic variation can then be detected by a statistical comparison of microarray probe level signal differences between samples and populations (See Chapter 4). However, the impact that genetic polymorphisms have on the accuracy of gene expression estimates, and therefore differential gene expression analysis, is still largely unknown. It is reasonable to expect that microarray studies comparing natural populations, or even different species, will have polymorphic sites within microarray probes. These polymorphic sites, however, may result in inaccurate estimates of gene expression, false positive expression differences, and may also interfere with their own detection, when the SFP detection algorithm uses gene expression in its calculations. In this chapter we show the conditions for which genetic variations impacts experimental outcomes. We also detail a procedure for identifying and masking out probes containing candidate polymorphic sites. We show that the procedure leads to a lower false positive rate for differential expression and prediction of genetic polymorphisms. Further, the procedure described does not rely on having a reference database of known genetic polymorphisms and can be applied to any microarray dataset where comparisons are being made between populations. In short oligonucleotide microarrays, like the Affymetrix platform, gene expression estimates are computed by summarizing probe signal intensity values from multiple regions within the same gene. When a genetic polymorphism exists within one of these regions not only can the probe signal intensity value be affected, but also the resulting gene expression estimate (Figure 5.1). If the polymorphism is conserved in only one populations, this can lead to a false determination of differential expression. The types of genetic polymorphisms that can affect probe hybridization are SNPs, insertions/deletions

44

(INDELs), and mRNA transcriptional differences such as splice variation and polyadenylation differences. When applied to microarrays these are all expressed as single feature polymorphisms (SFPs). A SFP, where a feature is a short oligonucleotide probe on a microarray, is a statistical difference in the probe level hybridization efficiency between two groups. Previous work (See Chapter 4) provides an algorithm that is both consistent and sensitive, with a low false discovery rate when compared to a validation dataset of known SNPs. We showed that the location of the SNP in the probe was a determinant of whether the hybridization efficiency would be impacted and the resulting probe signal affected. SNPs that occur in the inner portion of a 25-mer probe (bases 4 to 22) have a greater chance of impacting the resulting probe signal. We have also shown that the number of probes containing a SFP within a probe set is associated with differential expression. The more SFPs in a probe set, the more likely the gene was also called as being differentially expressed. This is a strong indication that these genes may be false positives and that SFP containing probes should be masked out prior to expression calculation and differential expression analysis. Previous studies have reported SNPs to affect differential expression. Walter et. al R 3‘ IVT microarrays may reports that known polymorphisms in the mouse Affymetrix

be responsible for as much as 36% false positive and 13% false negative differentially expressed genes when preprocessing with MAS5 and 22% false positive and 12% false negatives when using the RMA preprocessing pipeline [67]. In addition, Benovoy et al. show that known polymorphic sites effect the determination of both exon expression (probe set) and gene expression (meta-probe set) in Affymetrix Human Exon arrays [10]. Also by masking, they show the false positive rates to be 86.6% and 8.1% at the probe set and meta-probe set levels, respectively, and the false negative rates to be negligible at 0.3% and 0.05%. Both of these studies, however, relied on the existence of a comprehensive sequence polymorphism database to a priori mask out probes believed to contain polymorphisms between the samples used. Here we present a detailed analysis of the effect of SFPs on differential expression

45

A.

B.

Figure 5.1: Single Feature Polymorphism detection using the RMA method with a low expression variability gene (A) and one with high expression variability (B) in the barley BB3 dataset. Each pane contains the log2 intensity values (y axis) of each array and each probe (x axis), with the Morex cultivar shown in red and Golden Promise in black. The left pane shows raw intensity values and average probes set expression values for each genotype (horizontal lines). The right pane shows log2 intensity values after RMA expression adjustment, each probe is then tested for a genotype difference. The bottom axis tick marks show the called SFP (red for Morex, black for Golden Promise), while the top axis tick marks show probes with known SFPs (red/black) and known SFP free (green) probes, according to the sequence confirmation dataset.

46

R 3’ IVT Barley microarray study. To improve the reliability of analysis in an Affymetrix

differential expression analysis and SFP calls, we propose a probe masking strategy based on a preliminary determination of candidate SFPs from the dataset. Finally we report how the masking procedure improves overall results by reducing the number of false positives for SFP detection and differential gene expression analysis. 5.3 5.3.1

Methods Experiment Datasets


pairs (PM and MM probes), developed by an international collaboration of barley researchers [16]. Raw data from a publicly available barley dataset was obtained from plexdb (http://www.plexdb.org, Experiment ID: BB3). We used a subset of 30 microarrays from the original experiment that contained good quality data from 5 tissues (stem, vegetative shoot, seedling leaf, coleoptile and hypocotyl), each with 3 replicates from two barley cultivars, Morex and Golden Promise (GP) [24]. The sequence confirmation dataset used in this study is the same as that of [56] and can be found at their website (http://naturalsystems.uchicago.edu/naturalvariation/barley/SNPtable.csv). 5.3.2

Data Preprocessing

All analyses were conducted within the R statistical computing language using publicly available packages from CRAN and Bioconductor (R version 2.11.0, http://www.r-project.org; http://cran.fhcrc.org; http://www.bioconductor.org) [54, 26]. R CEL files were read into R using the bioconductor package affy [25] Raw Affymetrix

and checked for quality using pseudo-chip images and residual error visualizations. Quality assurance of microarray data was further checked using the affyQAReport function from the Bioconductor package affyQCReport [52]. Hybridization and housekeeping controls, RNA degradation, sample clustering, NUSE plots, LPE plots, and RLE plots all showed high quality data (results not shown) and no additional microarrays

47

were removed. MicroArray Suite (MAS) PMA (Present/Marginal/Absent) calls were determined for each probe set within each array. In this study, a marginal call was also considered to be absent. A probe set was retained for further analysis only if all samples were called as present. This procedure is expected to only keep probe sets that are unambiguously "expressed" across the entire experiment. This is important because if a gene is unexpressed in a sample, any probe containing an SFP would not be detectable, nor could the gene be differentially expressed. Filtering the BB3 dataset in this manner, reduced the number of probe sets from 22,840 to 9,222 (251,240 to 101,499 probes). A second common filter is to remove probe sets with a low interquartile range (IQR) across all arrays, this filter is intended to remove probe sets that are "not changing" and therefore potentially biologically uninteresting. This filter is useful for differential expression analysis, but not for detection of SFPs, as a probe set can include an SFP but have low expression variability. However, to maintain consistency of analyzed probe sets between both differential expression analysis and SFP calling, this filter was not applied in this study. 5.3.3

Analysis of Differential Expression

Differential expression analysis was determined only for those probe sets which passed the PMA filter as described in preprocessing, and was performed by first preprocessing the dataset using the RMA procedure [32, 31, 12]. Differentially expressed genes were then determined by using the linear analysis of microarrays technique from the limma package [62] with empirical Bayes adjustment to the variance [61]. We used a linear model that included both genotype and tissue as main effects. The resulting p-values are then adjusted for multiple testing using the Benjimini and Hochberg (BH) false discovery rate correction procedure [9]. A gene was considered to be differentially expressed if it had an adjusted p-value less than 0.05 and a log fold change (logFC) greater than 0.5. Using both p-value and fold change criteria to determine differential expression is recommended by the MicroArray Quality Control (MAQC) project [46, 59].

48

5.3.4

Detection of Single Feature Polymorphisms (SFPs)

The model of hybridization efficiency used in this study is based on principles from the RMA preprocessing algorithm[32, 31, 12], and was described in detail in Chapter 4. Briefly, the relative hybridization efficiency of a probe was calculated as the difference between the log2 raw probe intensity value and the calculated log2 probe set expression value for the array, adjusted by the mean expression value across all arrays. Resulting values are log2 probe signal values adjusted for the mRNA expression differences between samples. Hpa = (log2 (Ipa ) − log2 (Iâ )) + µ

(5.1)

Where, Hpa is the calculated hybridization efficiency of probe p and array a, log2 (Ipa ) is the log2 raw intensity value of probep and array a in the probe set, after RMA background correction and normalization. log2 (Iâ ) is the expected intensity value (expression) of array a for the entire probe set. The difference is then adjusted by the mean log expression value across all arrays, µ, in order to scale each probe value by the overall intensity values. This has the additional effect, when partnered with the empirical Bayes adjustment to the variance in the limma package, of giving decreased weight to those probe sets with low overall expression. The RMA summarization procedure (median polish) is used to compute the expected intensity values for each probe set of each array [12]. SFP detection is only determined for those probes within probe sets that passed the PMA filter as described above, and is then conducted in the following manner. First, hybridization efficiencies for each probe are calculated using the procedure described above. Each probe is then fitted for a genotype effect; using the limma approach with empirical Bayes adjustment to the variance followed BH correction for multiple testing [61, 9]. A probe was considered to contain an SFP if it had a BH adjusted p-value of less than 0.05 and a log fold change (logFC) greater than 0.5. The genotype with the reduced signal of hybridization was determined to contain the SFP.

49

5.3.5

SFP Masking Procedure and Evaluation

In order to mitigate the effect of SFPs on expression estimates, we first predict and then mask out SFPs in the dataset. The procedure described above is used to perform the initial prediction. However, instead of using a stringent multiple testing corrected BH p-value < 0.05, we chose to use instead a raw p-value < 0.05 as the criteria for selecting probes to mask out. This allows for borderline SFP containing probes to also be removed from the analysis. Probe sets that contained more than seven SFPs were removed from further analysis, as those with four or fewer probes remaining are not likely to generate reliable expression estimates. Figure 5.2 shows an example of a probe set with 10 of 11 probes called as containing an SFP. A significant number of Morex probes (probes 5 to 8) show a decrease in hybridization efficiency (possibly due to a splice variant). This results in an under estimate of expression for Morex and the SFP prediction algorithm calls an additional five probes as containing an SFP in the Golden Promise cultivar, which are clearly false positives. Once the probes and probe sets to mask are identified, they are removed from the dataset and the RMA median polish procedure is performed again to produce new estimates of gene expression. Differential expression analysis and SFP prediction is performed using the newly generated masked expression values. To evaluate how masking candidate SFPs affected both differential expressions analysis and SFP detection, we estimated the proportion of false-positives and false-negatives as compared to with and without masking. We treated the differential expression and SFP prediction results from the masked dataset as a reference (truth). These results were then compared to the results from the unmasked dataset in the manner described in Table 5.1. Genes that were significant for differential expression, or SFP (p-value and logFC below threshold), and non-significant (p-value or logFC above threshold) in both unmasked and SFP masked datasets were considered true positive (TP) and true negatives (TN), respectively. The result was considered a false positive (FP) when significance was found in the unmasked dataset, but becomes non-significant in the

50

MX

10 8

log intensity

6

8 6

log intensity

10

12

12

GP

1

2

3

4

5

6

7

8 9 10 11 1 2 3 4 rbaal30a10_s_at ; SFPs called at probe(s): 1, 2, 4, 5, 6, 7, 8, 9, 10, 11

5

6

7

8

9

10

11

Figure 5.2: Single Feature Polymorphism detection in a gene with a high number of called SFPs (10 of 11). Each pane contains the log2 intensity values (y axis) of each array (30 microarrays) and each probe (x axis), with the Morex cultivar shown in red and Golden Promise in black. The left pane shows raw intensity values and average probe set expression values for each genotype (horizontal lines). The right pane shows log2 intensity values after expression adjustment, each probe is then tested for a genotype effect. The bottom axis tick marks show the called SFP (red for Morex, black for Golden Promise).

51

Unmasked SFPs

Table 5.1: Comparison of differential expression analysis and SFP prediction with and without the SFP mask. Masked SFPs Positive for differential Negative for differential expression, or SFP expression, or SFP Positive for differential expression, or SFP Negative for differential expression, or SFP

TRUE POSITIVE

FALSE POSITIVE

FALSE NEGATIVE

TRUE NEGATIVE

masked dataset. Conversely, a false negative (FN) was when significance was found in the masked dataset, but was non-significant in the unmasked dataset. The rates are then computed as: False positive rate (F P R) = F P/(F P + T P ) and false negative rate (F N R) = F N/(F N + T N ).

5.4

Results

We preprocessed the dataset as described in "Materials and Methods", leaving 9,222 probe sets (101,499 probes) remaining for analysis of differential expression and SFP detection. We chose to use only those probe sets which were unambiguously expressed across all samples. We felt this was the best set of genes to evaluate the impact of genetic variation on gene expression experiments, without needing to consider whether a gene was expressed or not. We first determined results for SFP detection and differential expression without any adjustments for genetic variation. This resulted in 3,203 called SFPs (1,782 Morex SFPs and 1,421 GP SFPs, 3.2% of all probes), which occurred in 1,665 genes, or 18.1% of all genes tested. Differential expression analysis resulted in 335 genes (151 up-regulated in Morex and 184 up-regulated in GP, 3.6% of all probe sets). These are the results a typical analysis will produce, when genetic variation is not considered in the estimate of gene expression. Next, we determined a candidate set of probes to mask out of the dataset. Relaxing the criteria to an unadjusted p-value < 0.05, resulted in 3,414 candidate SFPs to mask (3.4%

52

of all probes). Interestingly, this is only 211 more candidate SFPs than the original BH adjusted set of 3,203 SFPs. This is at least partially due the additional restriction of requiring a logFC > 0.5 to be called a SFP. Not including a minimum logFC, would have resulted in 4,952 called SFPs after BH adjustment and 10,917 called SFP with no BH adjustment, a two fold increase. This shows that most of the would be "called" SFPs have a small logFC difference (< 0.05). Morey et al. found that including an expression fold change requirement increased the validation rate for differential expression by qPCR and the MAQC Consortium further found an increase in the agreement of differentially expressed genes across microarray experiments[46, 49]. We see no reason to believe the same would not be true with SFP detection. In addition to the total number of candidate SFPs, we also determined the number of SFP containing probes within each probe set. A total of 1,749 probe sets contained at least one candidate SFP to mask, a 5% increase in the number of probe sets relative to the original BH adjusted set of 1,665. Most probe sets contained only a single called SFP (988 probe sets, 59%), and 55 probe sets contained eight or more called SFPs (488 SFPs in total). Since masking removes probes from a probe set for the purpose of expression estimation, there must exist a minimum number of probes present to estimate gene expression from. For this experiment we chose four probes as the minimum, but others may choose a larger (or smaller) number of probes depending on the experiment. While these probe sets are removed from further consideration for differential expression and SFP calling, they should not be discarded from the study. This set of genes represents the best candidates for possible structural variation and/or INDELs between the populations and require further scrutiny (see Figure 5.2). It is also worth noting that 50 of these 55 genes were also called as being differentially expressed (22 up-regulated in Morex and 28 up-regulated in GP), these are likely to be false positives. These 55 genes were removed from the next stage of the analysis (9,167 probe sets and 2,926 probes to mask remaining), which was to mask the dataset for SFPs and redetermine SFP calls and differential expression. We masked the dataset by removing the 2,926 probes prior to the median polish

53

procedure from the RMA preprocessing routine. This produces new expression estimates for each gene and sample based on a subset of probes not likely to contain genetic variation between the populations. The new estimates of gene expression are then used for calculation of differential expression and in the calculations for SFP calls. While the 2,926 probes are removed for expression calling, they are not removed from the experiment and are re-evaluated for the presence of an SFP using the masked expression estimates. After removing the 55 genes as described above, there remained 100,894 probes to test for the presence of an SFP. After masking, 2,678 probes were called as containing an SFP (1,152 Morex SFP and 1,526 GP SFP, 2.7% of all probes), occurring in 1,605 genes, which is 17.5% of the genes tested. This is a total drop of 525 called SFPs in 60 fewer genes. This is partially due to having removed 55 genes which contained eight or more called SFPs. These genes represented 479 of the 525 probes and 49 of the 60 genes different between the masked and unmasked results. This does not mean that 479 of the 525 are false positive SFP calls, since these probe sets were removed for containing too many probable SFPs. However, many of these may still be false positives due to inaccurate expression estimates and should be investigated separately. Not considering the 55 removed genes, the masked results contain 136 probes previously called and SFPs, which were no longer called as an SFP, and 90 new probes called as containing an SFP that were not called in the unmasked analysis for a net reduction of 46 called SFPs. After masking the dataset, differential expression analysis resulted in 251 differentially expressed genes (117 up-regulated in Morex and 134 up-regulated in GP), a reduction of 84 genes from the unmasked analysis (335 differentially expressed genes in the unmasked results). However, 50 of these 84 genes were from probe sets that contained eight or more probably SFP and were removed from the masked analysis. These 50 genes still represent probable false positive differential expression calls, but differential expression cannot be reliably tested due to the small number of available probes after masking. Not considering these 55 removed for the masked analysis, differential expression of the masked dataset resulted in 40 genes no longer considered as differentially expressed that were labeled as

54

Table 5.2: Contingency table comparing the results between masked and unmasked data for differential expression. Masked Data

Unmasked Data

Differential Expression Morex

Neither

Golden Promise

Morex

113

16

0

Neither

4

8876

2

Golden Promise

0

24

132

such in the unmasked dataset, and six gene were were now labeled as differentially expressed in the masked dataset, but were not in the unmasked dataset, for a net reduction of 34 differentially expressed genes. Next, we assessed whether the masking procedure improved results for either differential expression or SFP calling, or both. We calculated the FPR and FNR for differential expression and SFP calls, using the four scenarios described in Table 5.1, where the masked dataset was considered to be the reference data set (i.e. the truth) (Tables 5.2 and 5.3). It should be noted that the reference dataset may not be free of false-positives, or false-negatives, as there may be other sources of error other than SFPs, but the procedure still allows for the determination of the rates that are induced by SFPs. We found the SFP induced false positive rate to be 14% for differential expression and 5.0% for SFP calls. This demonstrates that removing SFPs prior to differential expression analysis and performing a reanalysis of SFPs will reduce the rate of false positives in the study. SFPs, however, did not affect the false negative rate (FNR) as significantly, which were 0.07% for differential expression and 0.09% for SFP calls. In addition these rates do not consider the 55 genes removed for having eight or more masked probes, and should be considered separately, but would otherwise not be in a typical analysis with no probe masking.

55

Table 5.3: Contingency table comparing the results between masked and unmasked data for SFP prediction. Masked Data

Unmasked Data

SFP prediction Morex

Neither

Golden Promise

Morex

1109

71

0

Neither

43

98,080

47

Golden Promise

0

65

1479

Finally, we compared the masked SFP results to the unmasked SFP for sensitivity, specificity and false discovery rate as compared to as set of known SNP found in the sequence conformation dataset. The sequence confirmation dataset is a collection of SNPs from three barley sequence sources totaling 2,699 sequences. After removing sequences from probe sets that were filtered out (as described in section "Data Preprocessing" above) and those which contained eight or more masked SFPs (55 probe sets), the sequence confirmation dataset consisted of 1,391 sequences of which 131 were polymorphic in Golden Promise and 101 to Morex, the remaining 1,159 sequences did not contain a SNP and serve as true negatives. The sensitivity is the proportion of known polymorphisms in the sequence confirmation dataset called correctly as an SFP by the algorithm. Masking probe sets increased the sensitivity of detection by 0.9% from 76.3% to 77.2%. The specificity is the proportion of known negatives (no SNP) correctly identified as such. Masking improved the specificity slightly from 99.4% to 99.5%. Finally, false sequence polymorphism rate (FDR) is the proportion of called SFPs by the algorithm incorrectly identified as a polymorphism, when the sequence confirmation dataset indicates that no SNP is present. The lower sequence polymorphism rate the better and masking probes lowered the FDR by 0.56% from 5.35% to 4.79%.

56

5.5

Discussion

Our analysis suggests the presence of genetic variation within an Affymetrix microarray study will generate false positives and false negatives in both differential expression analysis and the detection of SFPs. We’ve described a procedure to mask out candidate SFPs and recalculate expression estimates from the data, prior to differential expression analysis and final SFP detection. Masking the dataset in this way was shown to have the potential to reduce the false positive rate (FPR) by 14% for differential expression and 5.0% for SNP calls in this experiment, while the false negative rate (FNR) did not appear to be significantly affected by SFPs. The FPR values determined here are on the same order as those found in previous studies that used sequence databases of known polymorphic sites. The FNR found in this study was similar to that found in the study with R Exon arrays [10], but was significantly less than the comparable study using Affymetrix R Mouse 3‘ IVT microarrays (12% compared to 0.06%) [67]. This is at least Affymetrix

partially be due to the difference in the number of probes masked (> 13,000 compared to 3,500 probes). Finally, we’ve detailed a procedure for identifying and masking out probes that leads to a lower false positive rate in both differential expression and prediction of genetic polymorphisms. The method described does not rely on any pre-existing knowledge of sequence variation, or require a reference database of known genetic polymorphisms. The procedure can be applied to any microarray dataset where comparisons are being made between populations and is well suited for studies in ecology and conservation.

57

Chapter 6 EVALUATION OF MICROARRAY PREPROCESSING ROUTINES USING SINGE FEATURE POLYMORPHISMS 6.1

Abstract

The choice of microarray preprocessing pipeline will have a significant impact on the experimental results. There have been many previously published reports evaluating preprocessing pipelines, but all have used methods comparing expected gene expression levels, measured by spike-in or qRT-PCR, to the microarray calculated gene expression levels. This is potentially problematic, because expected gene expression levels determined by spike-in, or qRT-PCR, can only be performed for a relatively few new number genes which are expected to apply to a much larger set as measured by a microarray. Further, these methods also contain their own sources of technical error. Here we present a new method for evaluating preprocessing pipelines by their ability to detect genetic polymorphisms in expression microarray data. Detection of genetic polymorphisms in expression microarray data, relies most heavily on an accurate estimation of gene expression; therefor, the pipeline that best predicts genetic polymorphisms implies better expression estimates. We compare five commonly used preprocessing pipelines (RMA, GCRMA, dChip, MAS, and PLIER) for their ability to detect single feature polymorphisms in two Barley microarray datasets. Further, we compared individual subroutine performance using 96 new preprocessing pipelines generated from all possible combinations of the five standard preprocessing pipelines background correction, normalization, PM probe correction and summarization routines. Results show that of the five standard preprocessing routines, the RMA preprocessing pipeline to have the overall best results for sensitivity, specificity and FDR across both

58

datasets. Use of MM probes for PM probe correction reduces specificity in particular and performs poorly in general. Model-based summarization methods (PLIER, median polish, MBEI) significantly outperform models that aren’t (MAS), particularly in larger sample sizes. GCRMA background correction performed poorly with a small number of samples and in general non-model based background correction outperformed model-based background correction in both datasets. Results shown here can be used to help inform the analyst of which preprocessing algorithm to use for their datasets. 6.2

Introduction

DNA microarrays are a significant tool used for the study of genome-wide gene expression differences. With millions of probes on a DNA microarray, the users choice in preprocessing method will have a significant impact on the biological inferences made from the experiment. However, it is difficult to know which of the many preprocessing methods should be used and which ones will produce erroneous results. Previous methods for evaluating preprocessing routines have been based on comparisons of microarray computed expression levels to expected levels of expression determined using other molecular techniques (ex. qRT-PCR). These comparisons, however, are typically based on only a very few number of genes and the techniques used to determine expression have their own sources of technical error. A method for evaluating preprocessing routines that does not rely on knowledge of expression levels, which are difficult to absolutely quantify, but instead used a feature which is more absolute, would provide new insights into the performance characteristics of microarray preprocessing. Here we present a new method for evaluating preprocessing routines by their effectiveness in detecting microarray probe level differences caused by genetic polymorphisms, or single feature polymorphism (SFPs). R microarrays have been used for the discovery of SFPs in yeast [55], Affymetrix

Arabidopsis thaliana [14], barley [19, 56], wheat [18] and others [41, 39, 21, 75]. SFPs, where a feature is a short-oligonucleotide probe on a microarray, are differences in probe

59

level hybridization efficiencies between samples caused by underlying genetic or transcriptional variation (See Chapter 4). A probe’s hybridization efficiency can be determined by computing the difference between the probe’s observed intensity (background corrected and normalized intensity value) from the expected intensity value (intensity due to expression) for each probe [56, 55] (see Figure 6.1). The resulting measure has any signal from gene expression removed and can then be analyzed for differences between sets of samples. Therefore, successful detection of SFPs relies on an accurate estimate of gene expression. A sequence database of known polymorphisms can be used to validate SFP detection and by extension evaluate a preprocessing routine’s accuracy in estimating expression. The preprocessing procedure which best predicts known SFPs will also be the procedure which best estimates expression. R GeneChip R microarrays are the most common brand of microarray in Affymetrix

use today, they consist of small DNA fragments, termed probes, chemically synthesized at specific locations on a coated quartz surface [44]. Each probe is an oligonucleotide sequence of length 25bp. A probe set comprising of 11 to 20 probe pairs typically represents the mRNA molecule of interest. Each probe pair is composed of one perfect match (PM) and one mismatch (MM) probe, where the mismatch probe contains the same DNA sequence as the PM probe, but with the middle (13th bp) base pair complimented. The MM probes were originally meant to provide an estimate of non-specific binding and background, but have since been shown to contain a significant amount of foreground signal as well [32, 73]. Labeled cDNA samples are hybridized to the microarray, scanned and image analysis is performed to provide intensity values for each of the PM and MM probes. To determine the original expression levels of the cDNA sample on a microarray, a microarray preprocessing procedure is performed. The objectives of preprocessing microarray data are two fold: 1) to remove variation due to technical sources while preserving variation from biological sources 2) to produce summarized expression values for each gene (probe set). A typical workflow for preprocessing microarray data from raw probe level signal intensities is: background correction, across sample normalization, MM

60

8

8

1

2

3

4

5

6

7

Golden Promise

8

9

10 11 1 2 Contig4511_at ; SFPs called at probe(s): 5 Sequence Confirmed at probe(s): 5

3

4

5

6

7

8

9

10

11

Morex

log intensity

4

5

5

6

7

8 7 6

log intensity

8

9

9

10

10

11

B.

10

log intensity

9

10 9

log intensity

11

11

12

A.

Morex

12

Golden Promise

1

2

3

4

5

6

7

8

9

10 11 1 2 Contig742_at ; SFPs called at probe(s): 10, 11 Sequence Confirmed at probe(s): 10, 11

3

4

5

6

7

8

9

10

11

Figure 6.1: Single Feature Polymorphism detection using the RMA preprocessing pipeline with a low expression variability gene (A) and one with high expression variability (B) in the E-TABM-113 dataset. Each pane contains the log2 intensity values (y-axis) of each array (six microarrays) and each probe (x-axis), with the Morex cultivar shown in red and Golden Promise cultivar in black. The left pane shows raw intensity values and average probe set expression values for each genotype (horizontal lines). The right pane shows log2 intensity values after expression adjustment, each probe is then tested for a genotype difference. The bottom axis tick marks show the called SFP (red for Morex, black for Golden Promise), while the top axis tick marks show probes with known SFPs (red/black) and known SFP free (green) probes, according to the sequence confirmation dataset.

61

correction (if MM probes exist) and finally probe set summarization. Further, these four stages are also usually packaged together into a single preprocessing pipeline (i.e. MAS5[2], RMA[32, 31, 12], GCRMA[35], PLIER[3], and dChip[43, 42]), and the user chooses a preprocessing pipeline to produce expression values for their experiment. The procedure which best approximates the technical variation within the experiment, will produce a more accurate estimate of sample expression, be more sensitive, and produce fewer false positives in differential expression analysis. Evaluating preprocessing routines is difficult, since we typically cannot absolutely quantify genome-wide expression levels prior to performing a microarray experiment. However, due to its importance, several methods have been proposed for evaluating preprocessing routines. Two of the most common are: by concordance to qRT-PCR expression analysis and by the use of spike-in, or dilution, datasets. Evaluation of preprocessing routines by concordance of qRT-PCR expression results to microarray expression results usually occurs in only a small subset of the genes on a microarray [29, 48]. However, concordance in only a small number of genes (typically 10s) may not translate to concordance on a genome-wide level with tens of thousands genes on a microarray. A second method has been to use spike-in datasets, where known quantities of mRNA from specific genes are added to the mixture, and dilution datasets, when multiple samples are pooled together in known proportions [33, 5]. Spike-in datasets will have similar issues as qRT-PCR, in that typically only a few genes are spiked into the dataset. The dilution experiment, however, is a whole-array approach to evaluation. In a dilution experiment, the relative gene abundances are expected to change in a predictable manner as the dilution changes, assuming there is no error in the dilution process itself. The affycomp online tool http://affycomp.jhsph.edu/, which uses spike-in and dilution experiments, provides a way for researchers to evaluate and compare their preprocessing routines with other preprocessing routines. However, no single preprocessing routine has emerged as a clear "winner", and the authors instead leave it up to individual users to determine which routine is "best"" for their experiment. Because of the ambiguity of these

62

previous methods, we introduce a new method to evaluate preprocessing routines using SFPs. SFPs offer two significant advantages. The first is that genetic variation is absolute and easily measurable. The second is the number of SFPs, and known non-SFPs can be significantly larger (thousands), which is two orders of magnitude greater than the typical qRT-PCR experiment or spike in experiment. In this paper, we present a analysis of the performance of preprocessing algorithms by their ability to correctly detect known sequence polymorphisms (SFPs). We present a comparisons of five commonly used preprocessing pipelines (RMA, GCRMA, dChip, MAS, and PLIER) and all possible combinations of their four background correction routines, three normalization routines, two MM probe adjustments and four probe set summarization routines (96 preprocessing routines). Comparisons are made by each pipeline’s ability to correctly detect known sequence polymorphisms in two publicly R barley (Hordeum vulgare) microarrays datasets. The two Barley available Affymetrix

datasets use the same two cultivars (Morex and Golden Promise), but differ in their number of microarrays and tissues types. A previously reported DNA sequence database of known polymorphisms between Morex and Golden Promise is used to evaluate and compare the sensitivity, specificity and false feature polymorphism rate of each pipeline. Because identification of SFPs relies on accurate determination of gene expression, this analysis can inform the researcher of which preprocessing routine is most likely to produced better gene expression estimates. A more accurate estimate of gene expression will reduce the false positive rate in gene expression analysis and provide for more meaningful biological conclusions. Finally, we suggest how these results may be used in choosing a preprocessing routine for differential expression analysis in the general case. 6.3 6.3.1

Methods Experiment Datasets


pairs (PM and MM probes), developed by an international collaboration of barley

63

researchers [16]. Raw data from two publicly available Barley datasets were obtained from plexdb (, Experiment ID: BB3) and ArrayExpress (http://www.ebi.ac.uk/arrayexpress, Experiment ID: E-TABM-110). The first barley genome microarray dataset (BB3) is a subset of 30 microarrays from the original experiment that contained good quality data from five tissues (stem, vegetative shoot, seedling leaf, coleoptile and hypocotyl), each with three replicates from two barley cultivars, Morex and Golden Promise (GP) [24]. The second dataset (E-TABM-110) is also from barley seedling leaves and contains three samples from each of both Morex and Golden Promise cultivars for a total of six microarrays (unpublished study). This dataset is used to evaluate preprocessing routines using a number of microarrays more typical of a smaller study, and to assess consistency of preprocessing outcomes across experiments. The sequence confirmation dataset used in this study is the same as that of [56] and can be found at their website (http://naturalsystems.uchicago.edu/naturalvariation/barley/SNPtable.csv). The sequences were collected from three barley sequence sources and contains a total of 2,699 sequences. Of those, 30 were duplicated in two of the three sources and one was duplicated in all three. Sixty-six probes contained polymorphisms for both Morex and GP genotypes as compared to the sequence on the microarray. After removing duplicates, and sequences with polymorphisms in both genotypes, the sequence confirmation dataset consisted of 2601 sequences of which 223 were polymorphic in GP and 178 in Morex, the remaining 2200 probes did not have a sequence polymorphism between the two cultivars and the microarray. Results of each preprocessing routine were evaluated by direct comparison to the sequence confirmation dataset with calculations of sensitivity, specificity and false discovery rate (FDR). An pipeline’s sensitivity is the proportion of known polymorphisms in the sequence confirmation dataset called as an SFP by the algorithm and pipeline. The specificity is the proportion of known negatives (no SFP) correctly identified as such. False discovery rate is the proportion of called SFPs by the algorithm incorrectly identified

64

as a polymorphism, when the sequence confirmation dataset indicates that no SNP is present. 6.3.2

Data Preprocessing

All analyses were conducted within the R statistical computing language using publicly available packages from CRAN and Bioconductor (R version 2.11.0, http://www.r-project.org; http://cran.fhcrc.org; http://www.bioconductor.org) [54, 26]. Raw CEL files were read into R using the bioconductor package affy [25] and checked for quality using pseudo-chip images and residual error visualizations. Quality assurance of microarray data was further checked using the affyQAReport function from the Bioconductor package affyQCReport. Hybridization and housekeeping controls, RNA degradation, sample clustering, NUSE plots, LPE plots, and RLE plots all showed high quality data (results not shown) and no additional microarrays were removed. Microarray ArraySuite (MAS) PMA (Present, Marginal, Absent) calls were determined for each probe set within each array, for this study a marginal call was also considered to be absent. A probe set was retained for further analysis only if all samples were called as present. This filter is expected to only keep probe sets that are unambiguously "expressed" across the entire experiment. If a gene is unexpressed in a sample, any probe containing an SFP would not be detectable, nor could the gene be differentially expressed. A second commonly performed probe set filter is to remove out probe sets with a low interquartile range (IQR) across all arrays. This filter is intended to remove probe sets that are "not changing" and therefore potentially biologically uninteresting. However, this filter is only useful for differential expression and not for detection of SFP, as a probe set can include an SFP but have low expression variability and was not applied in this study. The most common preprocessing pipelines in use today are MAS5 (Affymetrix), PLIER (Affymetrix), RMA, GCRMA, and dChip. Table 6.1 shows the specific background correction, normalization, PM correction and summarization subroutines used by each. We also wanted to explore how each subroutine effected the expression estimate

65

and the ability to reliably detect SFPs. Therefor, new preprocessing pipelines were generated comprising of all possible combination of four background routines, two normalization routines, two mismatch probe corrections (the subtract MM procedure from dChip was not included) and four probe set summarization routines (96 total preprocessing pipelines). The four background routines included: no background (none), MAS5 background correction [2], RMA background correction [31] and GCRMA background correction [73]. Normalization occurred using the three algorithms: no normalization (none), invariant set normalization [43, 42] and quantile normalization (RMA) [13]. PM correction was done using the idealized mismatch routine from Affymetrix (this method prevents negative values when MM > PM)[2] and pmonly, which is to use the perfect match probes only, or no PM correction. The subtract mismatch routine from dChip was not used in this study; because, it will produce negative values when MM > PM, which prevents the log calculation needed for many of the probe set summarization routines. However, we did produce metrics for the standard dChip preprocessing pipeline: no background correction, invariant set normalization, subtract mismatch probes followed by MBEI summarization. probe set summarization was performed using: MAS5 [2], PLIER[3], median polish [32] and model based expression method (MBEI) [43, 42] . For a complete description of each of the preprocessing algorithms and routines see [32, 43, 42, 35, 2, 3]. 6.3.3

Model of Hybridization Efficiency, Detection of SFPs and Differential Expression

The model of hybridization efficiency used in this study is based on principles from the RMA preprocessing algorithm[32, 31, 12], and was described in detail in Chapter 4. Briefly, the relative hybridization efficiency of a probe was calculated as the difference between the log2 raw probe intensity value and the calculated log2 probe set expression value for the array, adjusted by the mean expression value across all arrays. Resulting values are log2 probe signal values adjusted for the mRNA expression differences between samples.

66

Table 6.1: Breakdown of the preprocessing steps for the MAS5, Plier, RMA, GCRMA, and dChip preprocessing pipelines. MAS5

Background Correction Probe Level Normalization PM Correction probe set Summarization probe set Normalization

weighted

Plier none

average none ideal

RMA

GCRMA

RMA

GCRMA

(global model)

(model based)

quantile

quantile

quantile

invariant

normalization

normalization

set

none

none

none

median

median

polish

polish

none

none

subtract MM none

Plier

biweight mean

none

normalization

mismatch Tukey

dChip

none

MBEI

none

scaled

Hpa = (log2 (Ipa ) − log2 (Iâ )) + µ

(6.1)

Where, Hpa is the calculated hybridization efficiency of probe p and array a, log2 (Ipa ) is the log2 raw intensity value of probep and array a in the probe set, after background correction, normalization and PM correction. log2 (Iâ ) is the expected intensity value (expression) of array a for the entire probe set. The difference is then adjusted by the mean log expression value across all arrays, µ, in order to scale each probe value by the overall intensity values. This has the additional effect, when partnered with the empirical Bayes adjustment to the variance in the limma package, of giving decreased weight to those probe sets with low overall expression. The pipelines summarization procedure is used to compute the expected intensity values for each probe set of each array [12]. SFP detection and differential expression were only determined for those probes within probe sets that passed the PMA filter as described above, and was conducted in the following manner. First, hybridization efficiencies for each probe are calculated as described above, using the preprocessing subroutines from the pipeline being evaluated. Each probe was then fitted for a genotype effect; using the limma approach with empirical Bayes adjustment to the variance followed BH correction for multiple testing [61, 9]. A

67

probe was considered to contain an SFP if it had a BH adjusted p-value of less than 0.05 and a log fold change (logFC) greater than 0.5. The genotype with the reduced signal of hybridization was determined to contain the SFP. Differentially expressed genes were determined using the linear analysis of microarrays technique from the limma package [62] with empirical Bayes adjustment to the variance [61]. We used a linear model that included both genotype and tissue as main effects for the BB3 dataset and genotype only for the E-TABM-113 dataset. The resulting p-values are then adjusted for multiple testing using the Benjimini and Hochberg (BH) false discovery rate correction procedure [9]. A gene was considered to be differentially expressed if it had an adjusted p-value less than 0.05 and a log fold change (logFC) greater than 0.5. Using both p-value and fold change criteria to determine differential expression is recommended by the MicroArray Quality Control (MAQC) project [46, 59]. 6.4

Results

Data from two barley microarray datasets (E-TABM-113 and BB3) were checked for quality and then processed to determine which probe sets to filter for presence/absence of expression. Filtering the BB3 dataset by PMA calls reduced the number of probe sets from 22,840 to 9,222 (251,240 to 101,499 probes), and filtering on the E-TABM-113 dataset reduced the number of probe sets from 22,840 to 12,238 (251,240 to 134,785 probes). We then computed expression estimates and hybridization efficiencies for each of the preprocessing routines and tested each probe for the presence of an SFP and differentially expressed genes. The sensitivity, specificity and false discovery rate, as compared to the confirmation dataset, as well as the number of SFPs called, the number of genes containing at least one SFP and the number of differentially expressed genes for each preprocessing routine are shown in figures 6.2 and 6.3 (detailed tables are in supplemental tables A.1 and A.2). Within the BB3 dataset, however, GCRMA background correction followed by invariant set normalization resulted in not enough "invariant" probes to normalize on and had 89 successful preprocessing routines to evaluate.

68

5000

4000

3000

2000

1000

0

20

40

60

80

Percentage

MX GP

E.

1000

2000

3000

Number of SFPs

Differentially Expressed Genes

D.

Sensitivity, Specificity, FDR

Complete Linkage Cluster of SFP Calls

gcrma.quantiles.pmonly.liwong gcrma.none.pmonly.liwong gcrma.quantiles.pmonly.medianpolish gcrma.none.pmonly.medianpolish gcrma.quantiles.mas.liwong gcrma.none.mas.liwong none.quantiles.mas.liwong none.none.mas.liwong mas.invariantset.mas.liwong mas.quantiles.mas.liwong mas.none.mas.liwong rma.quantiles.mas.liwong rma.none.mas.liwong rma.invariantset.mas.liwong none.invariantset.subtractmm.liwong none.invariantset.mas.liwong rma.quantiles.pmonly.liwong rma.none.pmonly.liwong rma.invariantset.pmonly.liwong none.quantiles.pmonly.liwong none.none.pmonly.liwong mas.invariantset.pmonly.liwong mas.quantiles.pmonly.liwong mas.none.pmonly.liwong mas.quantiles.pmonly.plier mas.none.pmonly.plier mas.invariantset.pmonly.plier mas.quantiles.mas.plier mas.none.mas.plier mas.invariantset.mas.plier none.quantiles.mas.medianpolish none.none.mas.medianpolish mas.invariantset.mas.medianpolish mas.quantiles.mas.medianpolish mas.none.mas.medianpolish rma.quantiles.pmonly.medianpolish rma.none.pmonly.medianpolish rma.invariantset.pmonly.medianpolish none.quantiles.pmonly.medianpolish none.none.pmonly.medianpolish mas.invariantset.pmonly.medianpolish mas.quantiles.pmonly.medianpolish mas.none.pmonly.medianpolish none.invariantset.pmonly.medianpolish none.invariantset.pmonly.liwong rma.quantiles.pmonly.plier rma.none.pmonly.plier none.quantiles.mas.plier none.none.mas.plier none.invariantset.mas.plier none.invariantset.mas.medianpolish none.quantiles.pmonly.plier none.none.pmonly.plier none.invariantset.pmonly.plier rma.quantiles.mas.plier rma.none.mas.plier rma.invariantset.mas.plier rma.quantiles.mas.medianpolish rma.none.mas.medianpolish rma.invariantset.mas.medianpolish gcrma.quantiles.pmonly.plier gcrma.none.pmonly.plier gcrma.quantiles.mas.plier gcrma.none.mas.plier rma.invariantset.pmonly.plier none.quantiles.pmonly.mas none.none.pmonly.mas mas.invariantset.pmonly.mas mas.quantiles.pmonly.mas mas.none.pmonly.mas none.invariantset.pmonly.mas rma.quantiles.pmonly.mas rma.none.pmonly.mas rma.invariantset.pmonly.mas none.quantiles.mas.mas none.none.mas.mas none.invariantset.mas.mas mas.quantiles.mas.mas mas.none.mas.mas mas.invariantset.mas.mas rma.quantiles.mas.mas rma.none.mas.mas rma.invariantset.mas.mas gcrma.quantiles.mas.mas gcrma.none.mas.mas gcrma.quantiles.mas.medianpolish gcrma.none.mas.medianpolish gcrma.quantiles.pmonly.mas gcrma.none.pmonly.mas

C.

Genes Containing an SFP


B.

Predicted SFPs

A.

600

1000

1400

Number of Genes

200

600

1000

Number of Genes

Figure 6.2: Clustering (pane A.) of preprocessing routines (89 preprocessing routines) for the BB3 dataset with the performance characteristics sensitivity, specificity and FDR (pane B.), the number of called SFPs (pane C.), the number of genes containing a called SFP (pane D.) and the number of differentially expressed genes (pane E.).

69

80

60

40

20

0

20

40

60

80

Percentage

MX GP

E.

1000

3000

Number of SFPs

Differentially Expressed Genes

D.

Sensitivity, Specificity, FDR

Complete Linkage Cluster of SFP Calls

none.quantiles.pmonly.plier none.invariantset.pmonly.plier none.none.pmonly.plier none.quantiles.mas.plier none.invariantset.mas.plier none.none.mas.plier none.quantiles.mas.medianpolish none.invariantset.mas.medianpolish none.none.mas.medianpolish none.quantiles.mas.liwong none.invariantset.mas.liwong none.invariantset.subtractmm.liwong none.none.mas.liwong none.quantiles.mas.mas none.invariantset.mas.mas none.none.mas.mas rma.quantiles.pmonly.mas rma.invariantset.pmonly.mas rma.none.pmonly.mas mas.quantiles.pmonly.mas mas.invariantset.pmonly.mas mas.none.pmonly.mas mas.quantiles.mas.mas mas.invariantset.mas.mas mas.none.mas.mas none.quantiles.pmonly.mas none.invariantset.pmonly.mas none.none.pmonly.mas rma.quantiles.pmonly.liwong rma.invariantset.pmonly.liwong rma.none.pmonly.liwong mas.quantiles.pmonly.liwong mas.invariantset.pmonly.liwong mas.none.pmonly.liwong mas.quantiles.mas.liwong mas.invariantset.mas.liwong mas.none.mas.liwong none.quantiles.pmonly.liwong none.invariantset.pmonly.liwong none.none.pmonly.liwong rma.quantiles.pmonly.medianpolish rma.invariantset.pmonly.medianpolish rma.none.pmonly.medianpolish mas.quantiles.pmonly.medianpolish mas.invariantset.pmonly.medianpolish mas.none.pmonly.medianpolish none.quantiles.pmonly.medianpolish none.invariantset.pmonly.medianpolish none.none.pmonly.medianpolish mas.quantiles.mas.medianpolish mas.invariantset.mas.medianpolish mas.none.mas.medianpolish mas.quantiles.mas.plier mas.invariantset.mas.plier mas.none.mas.plier mas.quantiles.pmonly.plier mas.invariantset.pmonly.plier mas.none.pmonly.plier rma.quantiles.mas.mas rma.none.mas.mas rma.invariantset.mas.mas rma.quantiles.mas.medianpolish rma.none.mas.medianpolish rma.invariantset.mas.medianpolish rma.none.mas.plier rma.invariantset.mas.plier rma.quantiles.mas.plier gcrma.quantiles.mas.medianpolish gcrma.none.mas.medianpolish gcrma.invariantset.mas.medianpolish gcrma.quantiles.mas.mas gcrma.none.mas.mas gcrma.invariantset.mas.mas gcrma.quantiles.pmonly.plier gcrma.none.pmonly.plier gcrma.invariantset.pmonly.plier gcrma.quantiles.mas.plier gcrma.none.mas.plier gcrma.invariantset.mas.plier gcrma.quantiles.pmonly.liwong gcrma.none.pmonly.liwong gcrma.invariantset.pmonly.liwong gcrma.quantiles.pmonly.medianpolish gcrma.none.pmonly.medianpolish gcrma.invariantset.pmonly.medianpolish gcrma.quantiles.pmonly.mas gcrma.none.pmonly.mas gcrma.invariantset.pmonly.mas gcrma.quantiles.mas.liwong gcrma.none.mas.liwong gcrma.invariantset.mas.liwong rma.quantiles.mas.liwong rma.invariantset.mas.liwong rma.none.mas.liwong rma.quantiles.pmonly.plier rma.none.pmonly.plier rma.invariantset.pmonly.plier

C.

Genes Containing an SFP


B.

Predicted SFPs

A.

500 1000

2000

Number of Genes

0

400

800

1200

Number of Genes

Figure 6.3: Clustering (pane A.) of preprocessing routines (97 preprocessing routines) for the E-TABM-113 dataset with the performance characteristics sensitivity, specificity and FDR (pane B.), the number of called SFPs (pane C.), the number of genes containing a called SFP (pane D.) and the number of differentially expressed genes (pane E.).

70

The BB3 dataset represents a relatively large microarray study with 30 microarrays. Some of the preprocessing algorithms, in particular the probe set summarization subroutine’s MBEI and PLIER, use complex model based estimations which should benefit from more information available within a large dataset. The E-TABM-113 dataset however is a small study of six microarrays and may represent a common experimental design of a single factor comparison (i.e. Treatment vs Control). Comparing the performance of each preprocessing routine between the two datasets is important, as it should highlight how the number of arrays affects results and help to inform the analyst of which method to use under which conditions. To compare the results from each of the preprocessing routine, a linear model was fitted to the data within each experiment. Analysis of variance tables were computed to determine the impact that preprocessing subroutines had on sensitivity, specificity and FDR of SFP calls. Finally a Tukey’s HSD post hoc analysis was performed for each subroutine with adjusted p-values (family-wise error rates) in order to compare individual algorithms within each subroutine. Figures 6.4 and 6.5 show box plots of each algorithm tested within each preprocessing subroutine in the BB3 and E-TABM-113 datasets, respectively. In addition, no significant and/or consistent interaction between preprocessing subroutines was observed.

sensitivity, specif icity, F DR = (6.2) background + normalization + M M correction + summarization + ε

6.4.1

Characteristics of Preprocessing Subroutines

Preprocessing of microarray data typically involves four steps: background correction, normalization, PM probe correction and summarization. The impact that each of these steps has on the estimate of gene expression, and therefore SFP call, is presented in Table 6.2. Results show choice of background correction, mismatch probe adjustment and

70

●

●

60

●

●

● ● ●

96

● ● ●

94

●

● ●

●

● ●

● ●

●

● ●

50 40

●

●

●

30

● ●

20

● ●

10

● ● ●

po

r ie

m

ed

ia

n

pl

lis h

as m

g liw on

ly on pm

as m

es qu

an

til

ne no

nt ia in va r

a rm

ne no

as m

gc

rm

a

0

False Sequence Polymorphism Rate (%)

90

92

Specificity (%)

98

100

50

Sensitivity (%)

80

71

Background Adjustment Method

Normalization Method

MM Probe Adjustment Method

Probeset Summarization Method

Figure 6.4: Sensitivity (in percent, higher is best), specificity (in percent, higher is best), and false discovery rate (in percent, lower is best), within each of the four preprocessing subroutines: background correction, normalization, PM probe adjustment, and probe set Summarization for the BB3 dataset. Boxplots show standard information (25, 50 and 75th quartiles, min/max and outliers). In addition, the box plot notches show approximate 95% confidence interval for the medians, If the notches of two box plots do not overlap this is "strong evidence" that the two medians are also statistically different [47].

60 50 40 98 97

● ●

96

Specificity (%)

99

100

20

30

Sensitivity (%)

70

80

72

●

50 10

20

30

40

●

po

r ie

m

ed

ia

n

pl

lis h

as m

g liw on

ly on pm

as m

es qu

an

til

ne no

nt ia in va r

a rm

ne no

as m

gc

rm

a

0

False Sequence Polymorphism Rate (%)

95

●

Background Adjustment Method


MM Probe Adjustment Method

Probeset Summarization Method

Figure 6.5: Sensitivity (in percent, higher is best), specificity (in percent, higher is best), and false discovery rate (in percent, lower is best), within each of the four preprocessing subroutines: background correction, normalization, PM probe adjustment, and probe set Summarization for the E-TABM-113 dataset. Boxplots show standard information (25, 50 and 75th quartiles, min/max and outliers). In addition, the box plot notches show approximate 95% confidence interval for the medians, If the notches of two box plots do not overlap this is "strong evidence" that the two medians are also statistically different [47].

73

summarization to play a significant role in all measurements and normalization was only significant to sensitivity in the BB3 dataset. The choice of background correction procedure had the overall largest impact on results for the smaller E-TABM-113 dataset (a 19.5% increase in sensitivity when choosing MAS over GCRMA, a 1.1% increase in specificity when choosing MAS over none and > 10% decrease in FDR when choosing MAS over GCRMA/RMA). Choice of probe set summarization routine had the overall largest impact on results for the BB3 dataset (a 6% increase in sensitivity when choosing MBEI over PLIER, > 3% increase when choosing MBEI/PLIER over MAS, and >16% decrease in FDR when choosing MBEI/PLIER over MAS). PM correction had a significant impact in both datasets for sensitivity and FDR, and a significant, but less, of an impact on specificity. Choice of normalization routine did not appear to have a significant affect on results. It is worth noting however that in both datasets, the raw intensity density plots (data not shown) showed very consistent profiles across arrays, indicating that the adjustment of any normalization routine, in these datasets, would have been minimal.

Table 6.2: Range (best performing routines worst performing routines) and anova results showing the impact of each of the preprocessing subroutines (background correction, normalization, MM correction and summarization) for both datasets (E-TABM-110 and BB3) on the three measures (sensitivity, specificity and FDR) when calling SFPs. E-TABM-113

Df Background Adjustment


MM Adjustment

Summarization Method

3

2

1

3

BB3

Sensitivity

Specificity

FDR

Sensitivity

Specificity

FDR

Range

Range

Range

Range

Range

Range)

Pr(>F)

Pr(>F)

Pr(>F)

Pr(>F)

Pr(>F)

Pr(>F)

19.5%

1.1%

11.6%

3.8%

1.5%

8.1%

< 2e − 16

7.6e − 12

1.1e − 09

5.8e − 4

6.7e − 12

6.9e − 14

1.4%

0.2%

1.3%

2.7%

0.1%

0.3%

0.53

0.27

0.7

2.8e − 3

0.90

0.89

11.5%

0.4%

8.7%

3.8%

0.3%

2.8%

< 2e − 16

2.5e − 06

8.4e − 10

6.3e − 07

0.04

4.6e − 05

6.5%

1.1%

11%

6%

3.1%

17.0%

1.6e − 4

2.1e − 15

4.1e − 08

1.6e − 07

< 2e − 16

< 2e − 16

74

6.4.2

Evaluation of Background Routines

We compared the performance characteristics (sensitivity, specificity, FDR) of four background correction algorithms for their ability to detect known SFPs: none, MAS, GCRMA, and RMA. On averge the MAS algorithm outperformed all other algorithms across both datasets and in all performance measurements, except specificity in the BB3 dataset, where no background correction marginally outperformed, on average, the MAS background correction (0.1% average difference). The MAS background correction algorithm significantly (p < 0.05) outperformed the GCRMA background correction algorithm (the worst performing) in all three metrics and across both datasets. Further MAS significantly outperformed the RMA background correction algorithm in all three metrics for the E-TABM-113 dataset and outperformed RMA, but not at a statistically significant level, in the BB3 dataset. On average no background correction outperformed both RMA and GCRMA background corrections, statistically significant for sensitivity in E-TABM-113 dataset; however, it was also the statistically worst algorithm for specificity in E-TABM-113.

Overall rank (averaged across both datasets): Sensitivity: MAS (64.7%), NONE (62.5%), RMA (57.5%), GCRMA (50.3%) Specificity: MAS (98.7%), RMA (98.4%), NONE (98.2%), GCRMA (97.9%) FDR: MAS (11.8%), NONE (16.7%), RMA (18.0%), GCRMA (21.4%) 6.4.3

Evaluation of Normalization Routines

We compared three normalization algorithms for their ability to detect known SFPs: none, invariant set and quantiles. Results show that choice of normalization algorithm played no significant role in the detection of SFPs in the E-TABM-113 dataset, and only a slight, but

75

significant, increase in sensitivity for the BB3 dataset, when using the background algorithms none, or quantiles over invariant set (average increase of 2.5%, p < 0.05).

Overall rank (averaged across both datasets): Sensitivity: quantiles (60.3%), NONE (59.4%), invariant set (57.5%) Specificity: invariant set (98.4%), none (98.3%), quantiles (98.2%) FDR: NONE (16.6%), quantiles (16.9%), invariant set (16.9%) 6.4.4

Evaluation of PM Correction Routines

We compared two mismatch probe correction algorithms: PM only and the idealized mismatch routine from MAS. The PM only procedure used in the preprocessing pipelines PLIER, RMA, GCRMA and dChip (PM only variant) performed significantly (p < 0.05) better than the idealized mismatch approach from the MAS5 algorithm in all performance characteristics and in both datasets, though the effect was less pronounced in the BB3 dataset. This is not surprising and shows that the trend of recent preprocessing approaches to ignore mismatch probes also improves SFP calls. It is also worth noting that the subtract MM variant of dChip, the original implementation of dChip, performed worse than the PM only variant of dChip in all cases (see supplemental tables A), except for specificity in the BB3 dataset where the subtract MM routine performed marginally better (+0.2%).

Overall rank (averaged across both datasets): Sensitivity: PM only (63.0%), idealized mismatch (55.3%) Specificity: PM only (98.5%), idealized mismatch (98.1%) FDR: PM only (13.8%), idealized mismatch (19.7%)

76

6.4.5

Evaluation of Probe Set Summarization Routines

We compared the performance characteristics (sensitivity, specificity, FDR) of four probe set summarization algorithms: median polish from the RMA/GCRMA procedure, PLIER, MBEI and MAS5. More so than any of the previous three preprocessing subroutines, the performance results of the summarization algorithms varied across the two datasets. In the smaller E-TABM-113 dataset the median polish algorithm performed best for specificity and FDR, statistically outperforming both MBEI and MAS. Median polish ranked second in sensitivity, statistically outperforming only PLIER. The MBEI algorithm provided for the best sensitivity in the E-TABM-113 dataset, statistically outperforming both MAS and PLIER. In the BB3 dataset however, the MBEI algorithm performed the best in sensitivity (statistically better than PLIER and MAS) and FDR (statistically better than median polish and MAS). PLIER marginally outperformed MBEI in specificity, but not at a statistically significant level. In general MAS performed poorly, and median polish performed consistently well across all three measures. In general, MBEI proved to have the overall highest sensitivity level, PLIER provided the best specificity and median polish the overall lowest FDR. For the MAS5 algorithm to perform poorly is not unexpected [32], its biggest flaw is that it does not account for a "probe" effect, assuming all probes have on average similar hybridization efficiencies. In addition for MBEI to perform poorly, with a significantly higher FDR, on the smaller E-TABM-113 dataset is also expected. Li and Wong in their original papers, state that at least 10 arrays is needed to properly estimate the model parameters [43, 42].

Overall rank (averaged across both datasets): Sensitivity: MBEI (62.3%), median polish (60.1%), MAS (58.0%), PLIER (56.0%) Specificity: PLIER (99.0%), median polish (98.8%), MBEI (98.5%), MAS (96.9%) FDR: median polish (13.1%), PLIER (13.4%), MBEI (15.4%), MAS (25.2%)

77

6.4.6

The Absolute Best Pipeline

In the E-TABM-113 dataset, the best pipeline for sensitivity was to perform no background correction, perform no normalization, use PM only followed by median polish. This pipeline also had a good balance of sensitivity (63.4%), specificity (99.3%) and FDR (7.43%). For specificity the overall best pipeline was to perform MAS background correction, any normalization method, idealized mismatch correction followed by PLIER summarization (99.5%). For FDR the overall best choice in routines was to perform MAS background correction, no normalization, idealized mismatch correction followed by PLIER summarization (5.75%), while this particular pipeline provided for the absolute best specificity (also 99.5%) and FDR it also produced a significantly worse sensitivity (55.6%). In the BB3 dataset, multiple pipelines performed equally well for sensitivity. The pipelines GCRMA background correction, quantiles normalization, PM only followed by dChip’s MBEI summarization algorithm, or GCRMA, any normalization, PM only and median polish, all of which had a sensitivity of 77.8%. While these procedures produced the best sensitivity, they however performed poorly in both specificity and FDR. For sensitivity the pipelines RMA background correction, no normalization (or quantiles), idealized mismatch correction followed by PLIER summarization produced the overall best performance (99.6%). For FDR the overall best performing pipeline was RMA background correction, no normalization, use of PM probes only followed by MBEI summarization (5.67%). This pipeline also produced a good overall balance of sensitivity, specificity and FDR (76.6%, 99.3%, 5.67%, respectively); however, the overall best balanced pipeline was to perform RMA background correction, invariant set normalization, use PM probes only followed by MBEI summarization (77.4%, 99.3%, 6.09%, respectively).

78

6.5

Discussion

Previous methods for evaluating preprocessing routines use direct comparisons of estimated expression levels from the microarray to expected expression levels determined with either spike-in experiments, or with qRT-PCR. These methods for determining expected expression levels are typically based on only a few number of genes and contain their own sources of technical error. Genetic variation between genotypes; however, is a discreet and absolute feature which produces a consistent and detectable difference in hybridization efficiency within a probe on an microarray, called single feature polymorphism. By statistical comparison of probe signal intensity values, after correcting for mRNA expression values, we can predict candidate regions for genetic variation between samples. Accurate estimates of sample mRNA expression levels; therefor, is the most important factor in the successful determination of SFPs. Here we report a new method for evaluating preprocessing routine’s by their ability to detect single feature polymorphisms. We propose that the preprocessing pipeline which most successfully predicts SFPs, high sensitivity and specificity with a correspondingly low FDR, will also be the pipeline which best-estimates expression. In general, Affymetrix’s MAS summarization performed poorly, particularly in the larger BB3 dataset as compared to the other algorithms. The GCRMA algorithm performed significantly worse in the smaller E-TABM-113 relative to the other algorithms and relative to its own performance in the larger BB3 dataset. GCRMA is a model based algorithm that would appear to need a large number of samples to accurately estimate background and non-specific binding. In this experiment, choice of normalization algorithm did not play a significant role on sensitivity, specificity or FDR. However, this is likely due to the high level of consistency in the signal intensity profiles across arrays, requiring minimal normalization to occur anyway. Finally, use of MM probes in general, like the idealized mismatch routine, reduced the sensitivity to detect SFPs. Results show that the choice of pipeline, and individual algorithms, had a much larger effect on results for the smaller E-TABM-113 dataset, where choice of background routine

79

significantly impacted sensitivity and FDR and choice in summarization routine impacted FDR significantly. In the larger BB3 dataset, not choosing MAS summarization produced the largest impact on results. This is expected since more data, from a larger dataset, will usually produce a higher confidence in results from all algorithms and is more robust to the underlying preprocessing choices. These results also show that you can achieve better results by mixing and matching the subroutines from the different established pipelines. Of the five standard pipelines described above, only the GCRMA pipeline gave an overall best performance for sensitivity in BB3, but had a correspondingly poor result in both specificity and FDR. Of the five established pipelines, the RMA pipeline (RMA backgrounds, quantiles normalization, PM only, median polish summarization) was found to provide for the most consistent and balanced result, producing a high sensitivity and specificity with low FDR across both datasets. These results provide a different perspective on preprocessing evaluation by how well a routine performs when used to predict the presence of genetic polymorphisms. Further, since the calculation for determination of SFP is primarily dependent on the estimated gene expression value, we suggest that improved SFP calls imply better expression estimates. Differential expression results were seen to be most correlated (positive correlation) with the number of SFPs and SFP containing genes. In the smaller E-TABM-113 dataset, the number of differentially expressed genes was highly correlated to sensitivity (r2 = 0.65). Conversely, in the larger BB3 dataset, the number of differentially expressed genes was highly correlated to both specificity and FDR (r2 = −0.72 and r2 = 0.74, respectively). Indicating that in smaller datasets, choice of preprocessing routine will mostly impact the sensitivity to detection of differentially expressed genes, and in larger datasets will impacts the specificity and FDR of differentially expressed genes. These results can be used to assist the analyst in deciding the most appropriate method for their specific application.

80

Chapter 7 DISCUSSION AND FUTURE WORK As pointed out in the introduction, studies in ecology and conservation would benefit from simultaneous measurement of gene expression and genetic diversity. Typically this is done using two different assays, microarrays for gene expression and sequencing (or genotyping) for genetic diversity. With short-oligonucleotide microarrays there is the potential to assay both gene expression and genetic diversity, simultaneously on the same platform. Discovery of candidate genetic polymorphisms (termed single feature polymorphisms) in microarray data has been performed in numerous studies [55, 14, 19, 56, 18, 41, 39, 21, 75]. Discovery of single feature polymorphisms in the presence of gene expression is tricky; and requires the algorithm to accurately account for differences in expression levels between samples. Briefly, Chapter 4 introduces a novel algorithm that effectively accounts for gene expression differences for improved detection of SFPs. Further, genetic polymorphisms can effect the accurate estimation of expression levels, which can in turn effect the reliability of single feature polymorphism detection producing a chicken and egg problem. Chapter 5 outlines a procedure that addresses this issue. The procedure performs a preliminary prediction of single feature polymorphisms, masks those candidate SFPs from the dataset and re-evaluates the expression estimates. The results show that this procedure allows for a more accurate measurement of both gene expression and genetic diversity on the same platform. Finally, Chapter 6 shows how a database of known variants and prediction of SFPs can be used to evaluate microarray preprocessing routines.

81

7.1

Summary

The results presented in this dissertation provide for a complete approach to dealing with genetic polymorphisms (single feature polymorphisms) in short-oligonucleotide microarray data: • Chapter 4 presented an improved algorithm for the detection of SFPs. We evaluated R Barley the new algorithm using data from two publicly available Affymetrix

(Hordeum vulgare) microarray datasets and compared the new algorithm to two previously published SFP detection algorithms. Our SFP detection algorithm outperformed both previously published algorithms. Algorithmic performance was measured by sensitivity, specificity and false discovery rate determined using a reference database of known genetic polymorphisms. Results show that our algorithm outperformed all other algorithms in all performance characteristics but one in one of the datasets (sensitivity in the BB3 dataset). • In Chapter 5 we evaluated the impact of genetic polymorphisms on differential expression analysis and SFP detection. In this chapter we quantified the false positive rate and false negative rate of differentially expressed genes with SFP probes compared to when SFP probes have been masked out (considering the masked dataset as "truth"). Results show the false positive rate to be significantly increased by the presence of SFPs for both differential expression analysis and SFP detection (14% and 5%, respectively). Further, we provide a procedure for masking out candidate SFPs prior to differential expression and final SFP prediction. The procedure described does not rely on having a reference database of known genetic polymorphisms and can be applied to any microarray dataset where comparisons are being make between populations. • In Chapter 6, we explored the use of known genetic polymorphisms as a means to R microarray preprocessing pipelines. Previous studies show evaluate Affymetrix

82

that the choice of microarray preprocessing pipeline has a significant impact on experimental results. Successful detection of SFP relies most heavily on the accurate estimation of gene expression and therefore the pipeline that best predicts SFP implies better expression estimates. We tested five major preprocessing pipelines (RMA, GCRMA, dChip , MAS5 and PLIER) and all possible combinations of their background correction, normalization, mismatch probe correction and summarization routines (total 96 generated preprocessing pipelines) for their ability to detect single feature polymorphisms in two Barley microarray datasets. Results presented the following important findings: GCRMA performed poorly when used in a small experiment, MAS5 performed poorly in general and the RMA preprocessing pipeline had the overall most consistent results for sensitivity, specificity and FDR across both datasets. Together these results and algorithms provide researchers with new tools and R 3’ IVT microarrays. Specifically this dissertation techniques when using Affymetrix

provides new algorithms for improved SFP detection and SFP correction in gene expression microarray studies and provides for a new method to evaluate the performance R microarray preprocessing steps. These results provide characteristics of Affymetrix R microarrays in ecologists with new tools and techniques when using Affymetrix

ecological gene expression microarray studies. 7.2

Applications

Predicted single feature polymorphisms can be used as candidate genetic markers. Genetic markers are loci associated with a trait. They can be used to type individuals, such as in a breeding program, or used as evidence to help narrow in on a genomic region of importance. Using the same algorithm described in Chapter 4, Coram et. al found SFPs (genetic markers) and gene expression marker (GEM) between two recombinant inbred wheat lines, in which one contained a striped rust resistant gene of unknown location[18].

83

West et. al was able to produce a high-density haplomap between two lines of Arabidopsis Thalia [70]. Other examples include [15, 38, 60, 28, 53]. In expression quantitative trait (eQTL) studies, researchers seek to associate genetic polymorphisms that regulate expression levels. The methods described in Chapters 4 and 5 will allow researcher to conduct eQTL studies without the need for a separate genotyping assay. DeCook et. al performed an eQTL study in Arabidopsis Thalia where single feature polymorphisms provided the data for gene markers[22]. Alberts et. al further showed that the effect of genetic polymorphisms may generate a high number of false cis eQTL [4]. The procedure described in Chapter 5 would help to curtail this particular issue. Single feature polymorphisms can also be used to estimate evolutionary parameters. Single feature polymorphisms are a type of genotyping and therefore many analyses that require genotyping data can be modified to perform on SFPs. For example, Jiang et. al used single feature polymorphism data to estimate the mutation and recombination parameters in a Arabidopsis Thalia population. [35] More accurate estimations of both single feature polymorphisms and expression levels reduces the noise in the above applications and provides for a more biologically meaningful result with a reduced false positive rate. 7.3

Future Work

R This work presents the utility of single feature polymorphisms in Affymetrix

microarray studies with two populations. However, studies can have more than two populations, or no defined populations (genetically diverse samples). We would like to detect single feature polymorphisms in these studies as well. The approach presented here conducts pair-wise comparisons from a linear model, which will work for studies with more than two populations, but not those with no defined populations. When a study contains more than two populations, additional processing is needed to assign a genotype to each population and determine which populations are from the same genotype. In the case of no defined populations, one potential approach, that would also work in the general

84

case, would be to use a clustering technique such as k-means clustering, instead of using linear model statistic tests. Significant separation of defined clusters would indicate the presence of an SFP. Samples would then be assigned a genotype by cluster membership. R microarrays are another microarray platform that are worth exploring NimbleGen

in the context of single feature polymorphisms. This platform offers the significant advantage of being cheaper than Affymetrix when designing custom microarrays. NimbleGen offers the ability to design microarrays for organisms for which there is no R microarray in a cost effective manner. In addition, we can generate Affymetrix

microarrays specific to an experimental populations. Finally, the NimbleGen platform R platform is fixed at 25mer allows for a flexible probe length. Where the Affymetrix

probes; NimbleGen’s manufacturing processes allow for probes as small as 25mers and as large as 70mers. Probes could then be designed specifically for either SFP detection (short probes) or expression analysis (long probes).

85

REFERENCES [1] Affymetrix. GeneChip Data Analysis Fundamentals Manual, 2004. [2] Affymetrix. GeneChip Expression Analysis Technical Manual, 2004. [3] Affymetrix. Guide to Probe Logarithmic Intensity Error (PLIER) Estimation. Affymetrix Technical Note, 2005. [4] Rudi Alberts, Peter Terpstra, Yang Li, Rainer Breitling, Jan-Peter Nap, and Ritsert C Jansen. Sequence polymorphisms cause many false cis eQTLs. PloS one, 2(7):e622, January 2007. [5] David B Allison, Xiangqin Cui, Grier P Page, and Mahyar Sabripour. Microarray data analysis: from disarray to consolidation and consensus. Nature reviews. Genetics, 7(1):55–65, January 2006. [6] Jill T. Anderson and Thomas Mitchell-Olds. Ecological genetics and genomics of plant defences: evidence and approaches. Functional Ecology, pages no–no, September 2010. [7] Leonard H. Augenlicht and Diane Kobrin. Cloning and Screening of Sequences Expressed in a Mouse Colon Tumor. Cancer Res., 42(3):1088–1093, 1982. [8] Line K Bay, Karin E Ulstrup, H Bjø rn Nielsen, Hanne Jarmer, Nicolas Goffard, Bette L Willis, David J Miller, and Madeleine J H Van Oppen. Microarray analysis reveals transcriptional plasticity in the reef building coral Acropora millepora. Molecular ecology, 18(14):3062–75, July 2009.

86

[9] Y Benjimini and Y Hochberg. Controlling the false discovery rate: A practical adn powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B, 57:289–300, 1995. [10] David Benovoy, Tony Kwan, and Jacek Majewski. Effect of polymorphisms within probe-target sequences on olignonucleotide microarray experiments. Nucleic Acids Research, 36(13):4417–4423, 2008. [11] Amy N Bernardo, Peter J Bradbury, Hongxiang Ma, Shengwa Hu, Robert L Bowden, Edward S Buckler, and Guihua Bai. Discovery and mapping of single feature polymorphisms in wheat using Affymetrix arrays. BMC Genomics, 10:251, 2009. [12] B M Bolstad, R A Irizarry, M Astrand, and T P Speed. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19(2):185–193, 2003. [13] Ben Bolstad. Probe level quantile normalization of high density oligonucleotide arrays. Unpublished manuscript, 2001. [14] Justin O Borevitz, David Liang, David Plouffe, Hur-Song Chang, Tong Zhu, Detlef Weigel, Charles C Berry, Elizabeth Winzeler, and Joanne Chory. Large-scale identification of single-feature polymorphisms in complex genomes. Genome Research, 13(3):513–523, March 2003. [15] Liam H Childs, Hanna Witucka-Wall, Torsten Günther, Ronan Sulpice, Maria V Korff, Mark Stitt, Dirk Walther, Karl J Schmid, and Thomas Altmann. Single feature polymorphism (SFP)-based selective sweep identification and association mapping of growth-related metabolic traits in Arabidopsis thaliana. BMC Genomics, 11(1):188, 2010. [16] Timothy J Close, Steve I Wanamaker, Rico A Caldo, Stacy M Turner, Daniel A Ashlock, Julie A Dickerson, Rod A Wing, Gary J Muehlbauer, Andris Kleinhofs,

87

and Roger P Wise. A New Resource for Cereal Genomics: 22K Barley GeneChip Comes of Age1. Plant Physiology, 134(3):960–968, 2004. [17] Tristan E Coram, Matthew L Settles, and Xianming Chen. Large-scale analysis of antisense transcription in wheat using the Affymetrix GeneChip Wheat Genome Array. BMC Genomics, 10:253, 2009. [18] Tristan E Coram, Matthew L Settles, Meinan Wang, and Xianming Chen. Surveying expression level polymorphism and single-feature polymorphism in near-isogenic wheat lines differing for the Yr5 stripe rust resistance locus. TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik, 117(3):401–411, August 2008. [19] Xinping Cui, Jin Xu, Rehana Asghar, Pascal Condamine, Jan T Svensson, Steve Wanamaker, Nils Stein, Mikeal Roose, and Timothy J Close. Detecting single-feature polymorphisms using oligonucleotide arrays and robustified projection pursuit. Bioinformatics, 21(20):3852–3858, 2005. [20] Xinping Cui, Na You, Thomas Girke, Richard Michelmore, and Allen Van Deynze. Single feature polymorphism detection using recombinant inbred line microarray expression data. Bioinformatics, 26(16):1983–1989, 2010. [21] Sayan Das, Prasanna R Bhat, Chinta Sudhakar, Jeffrey D Ehlers, Steve Wanamaker, Philip A Roberts, Xinping Cui, and Timothy J Close. Detection and validation of single feature polymorphisms in cowpea (Vigna unguiculata L. Walp) using a soybean genome array. BMC Genomics, 9(1):107, 2008. [22] Rhonda DeCook, Sonia Lall, Dan Nettleton, and Stephen H Howell. Genetic regulation of gene expression during shoot development in Arabidopsis. Genetics, 172(2):1155–1164, February 2006.

88

[23] Patrik D’haeseleer, Shoudan Liang, and Roland Somogyi. Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics, 16(8):707–726, 2000. [24] Arnis Druka, Gary Muehlbauer, Ilze Druka, Rico Caldo, Ute Baumann, Nils Rostoks, Andreas Schreiber, Roger Wise, Timothy Close, Andris Kleinhofs, Andreas Graner, Alan Schulman, Peter Langridge, Kazuhiro Sato, Patrick Hayes, Jim McNicol, David Marshall, and Robbie Waugh. An atlas of gene expression from seed to seed through barley development. Functional & integrative genomics, 6(3):202–11, July 2006. [25] Laurent Gautier, Leslie Cope, Benjamin M Bolstad, and Rafael A Irizarry. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20(3):307–315, 2004. [26] Robert C Gentleman, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, Laurent Gautier, Yongchao Ge, Jeff Gentry, Kurt Hornik, Torsten Hothorn, Wolfgang Huber, Stefano Iacus, Rafael Irizarry, Friedrich Leisch, Cheng Li, Martin Maechler, Anthony J Rossini, Gunther Sawitzki, Colin Smith, Gordon Smyth, Luke Tierney, Jean YH Yang, and Jianhua Zhang. Bioconductor: open software development for computational biology and bioinformatics. Genome Biology, 5(10):R80, 2004. [27] Greg Gibson. Microarrays in ecology and evolution: a preview. Molecular ecology, 11(1):17–24, January 2002. [28] P K Gupta, S Rustgi, and R R Mir. Array-based high-throughput DNA markers for crop improvement. Heredity, 101(1):5–18, July 2008. [29] Balazs Gyorffy, Bela Molnar, Hermann Lage, Zoltan Szallasi, and Aron C Eklund. Evaluation of microarray preprocessing algorithms based on concordance with RT-PCR in clinical samples. PloS one, 4(5):e5645, January 2009.

89

[30] Youko Horiuchi, Yoshiaki Harushima, Hironori Fujisawa, Takako Mochizuki, Masanori Kawakita, Takayuki Sakaguchi, and Nori Kurata. A simple optimization can improve the performance of single feature polymorphism detection by Affymetrix expression arrays. BMC Genomics, 11:315, 2010. [31] R A Irizarry, B Hobbs, F Collin, Y D Beazer-Barclay, K J Antonellis, U Scherf, and T P Speed. Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics, 4(2):249–264, 2003. [32] Rafael A Irizarry, Benjamin M Bolstad, Francois Collin, Leslie M Cope, Bridget Hobbs, and Terence P Speed. Summaries of Affymetrix GeneChip probe level data. Nucl. Acids Res., 31(4):e15–, 2003. [33] Rafael A Irizarry, Zhijin Wu, and Harris A Jaffee. Comparison of Affymetrix GeneChip expression measures. Bioinformatics, 22(7):789–794, 2006. [34] Hongying Jiang, Ming Yi, Jianbing Mu, Louie Zhang, Al Ivens, Leszek J Klimczak, Yentram Huyen, Robert M Stephens, and Xin-Zhuan Su. Detection of genome-wide polymorphisms in the AT-rich Plasmodium falciparum genome using a high-density microarray. BMC Genomics, 9:398, 2008. [35] Rong Jiang, Paul Marjoram, Justin O Borevitz, and Simon Tavaré. Inferring Population Parameters From Single-Feature Polymorphism Data. Genetics, 173(4):2257–2267, 2006. [36] Jan E Kammenga, Michael A Herman, N Joop Ouborg, Loretta Johnson, and Rainer Breitling. Microarray challenges in ecology. Trends in ecology & evolution (Personal edition), 22(5):273–9, May 2007. [37] Christina Kendziorski and Ping Wang. A review of statistical methods for expression quantitative trait loci mapping. Mammalian genome : official journal of the International Mammalian Genome Society, 17(6):509–17, June 2006.

90

[38] Sung Kim, Keyan Zhao, Rong Jiang, John Molitor, Justin O Borevitz, Magnus Nordborg, and Paul Marjoram. Association Mapping With Single-Feature Polymorphisms. Genetics, 173(2):1125–1133, 2006. [39] Sung-Hyun Kim, Prasanna R Bhat, Xinping Cui, Harkamal Walia, Jin Xu, Steve Wanamaker, Abdelbagi M Ismail, Clyde Wilson, and Timothy J Close. Detection and validation of single feature polymorphisms using RNA expression data from a rice genome array. BMC Plant Biology, 9(1):65, 2009. [40] D. A. Kulesh. Identification of Interferon-Modulated Proliferation-Related cDNA Sequences. Proceedings of the National Academy of Sciences, 84(23):8453–8457, December 1987. [41] Rajesh Kumar, Jing Qiu, Trupti Joshi, Babu Valliyodan, Dong Xu, and Henry T Nguyen. Single Feature Polymorphism Discovery in Rice. PLoS ONE, 2(3):9, 2007. [42] C Li and W H Wong. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error applic ation. Genome Biology, 2:1–11, 2001. [43] Cheng Li and Wing Hung Wong. Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection. PNAS, 98(1):31–36, 2001. [44] R Lipshutz, S Foder, T Gingeras, and D Lockhart. High Density Synthetic Oligonucleodite Arrays. Nature Genetics, pages 20–24, 1999. [45] Z W Luo, E Potokina, A Druka, R Wise, R Waugh, and M J Kearsey. SFP Genotyping From Affymetrix Arrays Is Robust But Largely Detects Cis-acting Expression Regulators. Genetics, 176(2):789–800, 2007. [46] MAQC Consortium. MAQC-II: analyze that! Nature biotechnology, 28(8):761, August 2010.

91

[47] Robert McGill, John W Tukey, and Wayne A Larsen. Variations of Box Plots. American Statistician, 32(1):12–16, 1978. [48] Jakub Mieczkowski, Magdalena E Tyburczy, Michal Dabrowski, and Piotr Pokarowski. Probe set filtering increases correlation between Affymetrix GeneChip and qRT-PCR expression measurements. BMC Bioinformatics, 11(1):104, 2010. [49] Jeanine S Morey, James C Ryan, and Frances M Van Dolah. Microarray validation: factors influencing correlation between oligonucleotide microarrays and real-time PCR. Biological procedures online, 8(1):175–193, 2006. [50] Frederick Mosteller and John W Tukey. Data Analysis and Regression: A Second Course in Statistics, volume 1 of Addison-Wesley Series in Behavioral Science. Addison-Wesley, Reading Mass, 1977. [51] Marjorie F Oleksiak, Gary A Churchill, and Douglas L Crawford. Variation in gene expression within and among natural populations. Nature Genetics, 32(2):261–266, 2002. [52] Craig Parman and Conrad Halling. affyQCReport : A Package to Generate QC Reports for Affymetrix Array Data. October, 2009. [53] Elena Potokina, Arnis Druka, and Michael J Kearsey. Transcript profiling and expression level mapping. Methods In Molecular Biology Clifton Nj, 513:81–92, 2009. [54] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2010. [55] James Ronald, Joshua M Akey, Jacqueline Whittle, Erin N Smith, Gael Yvert, and Leonid Kruglyak. Simultaneous genotyping, gene-expression measurement, and

92

detection of allele-specific expression with oligonucleotide arrays. Genome Research, 15(2):284–291, 2005. [56] Nils Rostoks, Justin O Borevitz, Peter E Hedley, Joanne Russell, Sharon Mudie, Jenny Morris, Linda Cardle, David F Marshall, and Robbie Waugh. Single-feature polymorphism discovery in the barley transcriptome. Genome Biology, 6(6):R54, 2005. [57] Alex Sanchez-Pla, Miquel Salicru, and Jordi Ocanya. Distance based Inference for Gene-Ontology Analysis of Microarray Experiments. Networks, (July 2006), 2006. [58] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science, 270(5235):467–470, October 1995. [59] Leming Shi, Laura H Reid, Wendell D Jones, Richard Shippy, Janet A Warrington, Shawn C Baker, Patrick J Collins, Francoise De Longueville, Ernest S Kawasaki, Kathleen Y Lee, Yuling Luo, Yongming Andrew Sun, James C Willey, Robert A Setterquist, Gavin M Fischer, Weida Tong, Yvonne P Dragan, David J Dix, Felix W Frueh, Frederico M Goodsaid, Damir Herman, Roderick V Jensen, Charles D Johnson, Edward K Lobenhofer, Raj K Puri, Uwe Schrf, Jean Thierry-Mieg, Charles Wang, Mike Wilson, Paul K Wolber, Lu Zhang, Shashi Amur, Wenjun Bao, Catalin C Barbacioru, Anne Bergstrom Lucas, Vincent Bertholet, Cecilie Boysen, Bud Bromley, Donna Brown, Alan Brunner, Roger Canales, Xiaoxi Megan Cao, Thomas A Cebula, James J Chen, Jing Cheng, Tzu-Ming Chu, Eugene Chudin, John Corson, J Christopher Corton, Lisa J Croner, Christopher Davies, Timothy S Davison, Glenda Delenstarr, Xutao Deng, David Dorris, Aron C Eklund, Xiao-Hui Fan, Hong Fang, Stephanie Fulmer-Smentek, James C Fuscoe, Kathryn Gallagher, Weigong Ge, Lei Guo, Xu Guo, Janet Hager, Paul K Haje, Jing Han, Tao Han, Heather C Harbottle, Stephen C Harris, Eli Hatchwell, Craig A Hauser, Susan Hester,

93

Huixiao Hong, Patrick Hurban, Scott A Jackson, Hanlee Ji, Charles R Knight, Winston P Kuo, J Eugene LeClerc, Shawn Levy, Quan-Zhen Li, Chunmei Liu, Ying Liu, Michael J Lombardi, Yunqing Ma, Scott R Magnuson, Botoul Maqsodi, Tim McDaniel, Nan Mei, Ola Myklebost, Baitang Ning, Natalia Novoradovskaya, Michael S Orr, Terry W Osborn, Adam Papallo, Tucker A Patterson, Roger G Perkins, Elizabeth H Peters, Ron Peterson, Kenneth L Philips, P Scott Pine, Lajos Pusztai, Feng Qian, Hongzu Ren, Mitch Rosen, Barry A Rosenzweig, Raymond R Samaha, Mark Schena, Gary P Schroth, Svetlana Shchegrova, Dave D Smith, Frank Staedtler, Zhenqiang Su, Hongmei Sun, Zoltan Szallasi, Zivana Tezak, Danielle Thierry-Mieg, Karol L Thompson, Irina Tikhonova, Yaron Turpaz, Beena Vallanat, Christophe Van, Stephen J Walker, Sue Jane Wang, Yonghong Wang, Russ Wolfinger, Alex Wong, Jie Wu, Chunlin Xiao, Qian Xie, Jun Xu, Wen Yang, Liang Zhang, Sheng Zhong, Yaping Zong, and William Slikker. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology, 24(9):1151–1161, 2006. [60] Sung-Chur Sim, Matthew D Robbins, Charles Chilcott, Tong Zhu, and David M Francis. Oligonucleotide array discovery of polymorphisms in cultivated tomato (Solanum lycopersicum L.) reveals patterns of SNP variation associated with breeding. BMC Genomics, 10:466, 2009. [61] Gordon K Smyth. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3(1):Article3, 2004. [62] Gordon K Smyth. Limma: linear models for microarray data. In R Gentleman, V Carey, S Dudoit, R Irizarry, and W Huber, editors, Bioinformatics and Computational Biology Solutions using R and Bioconductor, number October, pages 397–420. Springer, 2005.

94

[63] A Subramanian, P Tamayo, and V Mootha. Gene set enrichment analysis: a knowledge-based approach for interpreting, 2005. [64] J W Tukey. Exploratory Data Analysis. Addison-Wesley, 1977. [65] Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu. Significance analysis of microarrays applied to the ionizing radiation response. PNAS, 98(9):5116–5121, 2001. [66] Harkamal Walia, Clyde Wilson, Pascal Condamine, Abdelbagi M Ismail, Jin Xu, Xinping Cui, and Timothy J Close. Array-based genotyping and expression analysis of barley cv. Maythorpe and Golden Promise. BMC Genomics, 8:87, 2007. [67] Nicole A R Walter, Shannon K McWeeney, Sandra T Peters, John K Belknap, Robert Hitzemann, and Kari J Buck. SNPs matter: impact on detection of differential expression., 2007. [68] Minghui Wang, Xiaohua Hu, Gang Li, Lindsey J Leach, Elena Potokina, Arnis Druka, Robbie Waugh, Michael J Kearsey, and Zewei Luo. Robust Detection and Genotyping of Single Feature Polymorphisms from Gene Expression Data. PLoS Computational Biology, 5(3):10, 2009. [69] Thomas Werner. Bioinformatics applications for pathway analysis of microarray data. Current Opinion in Biotechnology, 19(1):50–54, 2008. [70] Marilyn a L West, Hans van Leeuwen, Alexander Kozik, Daniel J Kliebenstein, R W Doerge, Dina A St Clair, and Richard W Michelmore. High-density haplotyping with microarray-based expression and single feature polymorphism markers in Arabidopsis. Genome Research, 16(6):787–795, June 2006. [71] F Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83, 1945.

95

[72] Elizabeth A Winzeler, Cristian I Castillo-Davis, Guy Oshiro, David Liang, Daniel R Richards, Yingyao Zhou, and Daniel L Hartl. Genetic diversity in yeast assessed with whole-genome oligonucleotide arrays. Genetics, 163(1):79–89, 2003. [73] Zhijin Wu, Rafael A Irizarry, Robert Gentleman, Francisco Martinez-Murillo, and Forrest Spencer. A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association, 99:909–917, December 2004. [74] Wayne Wenzhong Xu, Seungho Cho, S Samuel Yang, Yung-Tsi Bolon, Hatice Bilgic, Haiyan Jia, Yanwen Xiong, and Gary J Muehlbauer. Single-feature polymorphism discovery by computing probe affinity shape powers. BMC Genet, 10:48, 2009. [75] S S Yang, W W Xu, M Tesfaye, J F S Lamb, Hans-Joachim G Jung, D A Samac, C P Vance, and J W Gronwald. Single-Feature Polymorphism Discovery in the Transcriptome of Tetraploid Alfalfa. The Plant Genome, 2(3):224–232, 2009. [76] Bin Zhang and Steve Horvath. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol, 4:Article17, 2005. [77] Yongqing Zhang, Antonio Ferreira, Cheng Cheng, Yongchun Wu, and Jiong Zhang. Modeling oligonucleotide microarray signals. Applied Bioinformatics, 5(3):151–160, 2006.

96

Appendix A EXPERIMENTAL RESULTS This section contains tables of the complete results from the effect of SFPs on preprocessing results experiments performed in Chapter 6.

Dataset BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3

Background Correction mas none rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma none mas Normalization invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset

MM Correction mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly subtractmm mas Summarization liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong mas

Sensitivity 70.3% 65.7% 72.4% 75.3% 75.3% 71.1% 72.8% 75.3% 73.6% 70.3% 74.1% 74.5% 69.5% 77.4% 77% 75.7% 72.4% 76.6% 77.8% 75.7% 73.2% 76.2% 61% 69.5%

Specificity 99.5% 99.5% 99.3% 98% 99.4% 99.4% 99.1% 97.9% 99.4% 99.4% 99.3% 99.2% 99.4% 99.3% 98.1% 99.2% 99.2% 99.3% 98.1% 99.2% 99.2% 99.3% 99.5% 96%

FDR 7.18% 7.65% 8.47% 13.9% 6.25% 6.59% 8.42% 14.7% 6.38% 6.67% 7.81% 6.32% 5.68% 6.09% 12.8% 6.22% 6.49% 5.67% 12.7% 6.22% 6.42% 5.7% 7.89% 25.2%

GP SFP 1498 1426 1747 1979 1688 1637 1826 1927 1582 1539 1779 1338 1140 1638 1862 1510 1361 1552 1844 1443 1317 1541 1738 2026

MX SFP 1670 1538 1713 2070 1687 1575 1701 2128 1778 1686 1782 1633 1451 1915 2163 1713 1577 1881 2239 1775 1610 1883 1818 1851

GP Genes 777 692 817 974 856 795 863 971 821 765 847 757 662 884 1005 833 768 841 1009 809 753 861 821 925

MX MX GP Genes Exp Exp 955 136 245 851 137 246 927 157 432 1098 243 747 977 121 317 900 120 316 948 117 497 1115 320 590 1013 146 237 941 146 235 977 150 401 967 106 138 877 85 114 1074 160 190 1207 224 550 1002 94 197 942 85 166 1088 92 250 1225 396 344 1029 117 145 952 103 128 1082 137 166 998 222 406 1024 189 354 Continued on next page

Table A.1: Sensitivity, specificity, and false feature polymorphism rate as compared to the sequence confirmation dataset. Also shown are the number of SFP called for each genotype (GP SFP and MX SFP), the number of genes containing at least one SFP for each genotype (GP Genes and MX Genes), and the number of differentially expressed genes (MX Exp and GP Exp) for each of the 89 preprocessing routines performed on the BB3 dataset. The standard pipelines dChip (both PM only and subtract MM variant), MAS5, GCRMA, RMA, and PLIER are shown in bold.

97

Dataset BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3

Background Correction none rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma mas none Normalization invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset

MM Correction mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly Summarization mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish

Sensitivity 64.9% 63.6% 61.9% 72.8% 69% 65.7% 63.2% 72% 69.5% 66.1% 72% 68.2% 74.5% 76.2% 72.4% 69.5% 74.5% 75.3% 72.8% 70.3% 74.5% 70.7% 66.9% 66.1% 65.3% 72.8% 69.5% 67.4% 64.9% 72.8% 69.9% 67.8% 72.4% 68.6%

Specificity 96.3% 96.9% 92.4% 95.8% 95.6% 97.1% 92.2% 96.4% 96% 97.2% 97% 97.3% 96% 94% 96.8% 96.8% 96.8% 94.1% 96.8% 96.8% 96.7% 98.7% 98.6% 98.4% 96.7% 98.7% 98.9% 98.5% 96.5% 98.8% 98.8% 98.4% 98.9% 99.2%

Table A.1 – continued from previous page FDR 25.8% 24.4% 41.7% 25% 26.7% 23% 41.5% 22.9% 25.2% 22.5% 19.2% 18.9% 24.3% 31.6% 21% 21% 19.8% 32.1% 20.5% 20.4% 20.5% 11.5% 12.6% 16.8% 25% 10.8% 10.3% 15.3% 25.8% 10.3% 10.7% 15.6% 8.95% 6.29%

GP SFP 1995 2234 3103 2240 2248 2264 3100 2122 2167 2281 1600 1330 2266 2737 1820 1615 2106 2685 1790 1597 2042 1335 1260 1531 2129 1467 1384 1554 2135 1416 1339 1565 1204 1016

MX SFP 1706 1437 2225 1932 1793 1429 2259 2016 1910 1472 1798 1502 2246 2685 1949 1742 2106 2802 1968 1774 2162 1539 1404 1292 1860 1607 1489 1314 1873 1656 1548 1336 1532 1337

GP Genes 821 798 1089 1015 955 830 1093 971 929 826 837 720 1050 1268 933 853 1026 1254 912 837 1000 794 703 728 1019 863 795 757 1032 842 770 757 752 656

MX MX GP Genes Exp Exp 884 188 356 681 166 498 957 356 1240 1071 181 459 961 180 464 706 142 544 969 403 1013 1099 207 375 1004 208 379 729 163 486 1024 149 205 910 109 167 1177 241 344 1361 346 934 1094 139 314 1012 123 276 1155 144 464 1400 580 601 1101 182 242 1014 148 214 1176 217 284 965 137 246 848 136 238 703 146 404 1016 281 1030 1002 131 314 915 133 317 724 121 459 1029 343 833 1022 147 257 941 149 255 739 144 385 976 108 132 870 69 99 Continued on next page

98

Dataset BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3 BB3

Background Correction rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma mas none rma gcrma mas none rma gcrma mas none rma Normalization invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles

MM Correction pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly Summarization medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier

Sensitivity 76.6% 77.8% 74.5% 71.5% 76.6% 77.8% 74.5% 71.5% 76.2% 70.7% 67.4% 61.1% 63.2% 73.6% 69.9% 63.2% 61.5% 73.2% 69.9% 64.9% 72% 65.3% 57.3% 66.5% 72.8% 70.3% 61.9% 66.5% 73.6% 72% 70.3%

Specificity 98.9% 98.3% 98.9% 98.8% 99% 98.4% 98.9% 98.9% 99% 99.1% 99.5% 99.4% 99.2% 99.2% 99.3% 99.6% 99.4% 99.1% 99.4% 99.6% 99.1% 99.5% 99.4% 99.4% 99.1% 99.2% 99.5% 99.3% 99.1% 99.3% 99.3%

Table A.1 – continued from previous page FDR 8.5% 13.5% 8.72% 9.04% 8.04% 13.1% 8.25% 8.56% 7.61% 7.65% 7.47% 8.75% 10.7% 6.38% 7.22% 6.79% 9.82% 7.89% 6.7% 8.28% 9.47% 8.24% 9.87% 8.09% 9.38% 9.19% 8.64% 8.62% 9.28% 8.51% 8.7%

GP SFP 1459 1774 1339 1202 1421 1816 1313 1190 1421 1192 1140 1023 1159 1284 1239 1065 1192 1253 1197 1087 1238 1175 1272 1319 1380 1348 1168 1266 1284 1233 1382

MX SFP 1785 2050 1635 1505 1774 2067 1648 1513 1782 1514 1398 998 1309 1544 1454 985 1321 1589 1506 1044 1548 1505 1705 1421 1537 1474 1159 1643 1618 1603 1770

GP Genes 879 1059 811 747 855 1098 791 746 854 730 648 518 667 778 731 549 681 762 709 553 719 620 548 743 802 745 605 718 760 689 720

MX Genes 1107 1261 1031 959 1103 1260 1034 963 1098 933 827 579 787 964 883 595 795 986 914 623 932 839 714 839 948 861 663 927 986 921 948

MX Exp 166 258 108 83 108 392 129 105 151 127 128 97 131 114 115 81 161 137 134 100 163 167 310 157 129 131 86 334 175 177 250

GP Exp 209 700 205 161 286 429 156 126 184 214 215 211 393 279 277 204 310 227 227 207 207 203 341 489 334 332 269 288 215 217 304

99

Dataset E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113

Background Correction gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma Normalization invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles

MM Correction mas mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly Summarization liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong liwong

Sensitivity 37.6% 57.6% 51.9% 43.7% 36.6% 53.2% 53.2% 41% 35.9% 57.6% 53.9% 44.1% 45.4% 61% 60.7% 63.1% 44.1% 59% 60.7% 60% 44.1% 60.3% 62.4% 62.4%

Specificity 97.3% 98.2% 96% 97.6% 97.6% 98.6% 97.1% 97.8% 98% 98.3% 96.7% 97.6% 98.4% 98.3% 97.9% 98.4% 98.5% 98.6% 98.3% 98.8% 98.4% 98.3% 97.7% 98.3%

FDR 32.7% 19% 35.2% 29.5% 30.3% 16.5% 28.6% 28.8% 28.4% 18.3% 30.9% 29% 18.3% 15.9% 19.7% 14.7% 17.7% 13.9% 17.1% 11.1% 18.2% 16% 20.3% 14.8%

GP SFP 1617 1754 2717 1450 1232 1351 1929 1311 1316 1663 2572 1498 1476 1698 2123 1987 1266 1436 1736 1573 1320 1640 2182 1918

MX SFP 1585 1997 3191 2000 1657 2155 4041 2317 1653 1951 3111 2092 1495 1806 2076 1888 1598 1929 2538 2033 1542 1780 2134 1934

GP Genes 911 1033 1210 734 686 818 866 659 712 1001 1181 753 1043 1126 1290 1338 895 945 1059 1063 931 1090 1320 1297

Table A.2: Sensitivity, specificity, and false feature polymorphism rate as compared to the sequence confirmation dataset. Also shown are the number of SFP called for each genotype (GP SFP and MX SFP), the number of genes containing at least one SFP for each genotype (GP Genes and MX Genes), and the number of differentially expressed genes (MX Exp and GP Exp) for each of the 97 preprocessing routines performed on the E-TABM-113 dataset. The standard pipelines dChip (both PM only and subtract MM variant), MAS5, GCRMA, RMA, and PLIER are shown in bold. MX MX GP Genes Exp Exp 886 432 225 1222 768 521 1441 765 513 1021 406 195 957 529 183 1337 1235 373 1835 1247 380 1140 409 147 930 394 223 1221 784 488 1443 797 490 1040 386 186 1022 679 571 1164 583 504 1275 384 351 1206 717 558 1089 1126 344 1250 981 356 1499 709 277 1285 951 362 1069 679 563 1156 635 505 1299 430 385 1230 723 551 Continued on next page

100

Dataset E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113

Background Correction none gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma Normalization invariantset invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles

MM Correction subtractmm mas mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly mas mas mas mas mas mas mas mas mas Summarization liwong mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas mas medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish

Sensitivity 51.5% 28.1% 51.9% 45.8% 25.8% 27.8% 50.5% 45.4% 26.4% 29.8% 51.9% 47.8% 27.5% 44.1% 60% 59.7% 61.4% 43.4% 58.6% 60% 60% 43.4% 59% 60.7% 61.7% 33.2% 52.9% 52.9% 34.6% 32.5% 51.5% 51.9% 33.9% 32.5%

Specificity 96.4% 97.2% 98.5% 96.9% 97.1% 97.2% 98.7% 97.1% 97% 97.2% 98.7% 97.1% 97% 98.7% 98.3% 97.7% 98.2% 98.6% 98.4% 97.5% 98.2% 98.5% 98.3% 97.5% 98.2% 98.6% 99.4% 98.4% 98.2% 98.5% 99.4% 98.5% 98.2% 98.4%

Table A.2 – continued from previous page FDR 33.6% 40.7% 16.8% 32.2% 42.9% 42.3% 15.8% 31.3% 43.5% 40.1% 15% 30.5% 42.1% 17.7% 15.3% 20.4% 15.8% 19% 14.8% 21.7% 16.1% 19.5% 15.5% 21.1% 15.7% 23.4% 8.24% 18.8% 27.7% 25% 7.32% 18.2% 28.6% 26.2%

GP SFP 2656 1673 1498 2183 1351 1521 1385 1941 1374 1526 1464 2226 1442 1372 1841 2253 2050 1239 1713 2316 1925 1285 1745 2359 2019 1266 1184 1740 1165 1191 1029 1441 1134 1167

MX SFP 3284 2298 1918 3216 2959 2552 1874 3316 3136 2554 1876 3278 3151 1461 2017 2369 2187 1493 1986 2507 2185 1499 1978 2472 2279 2097 1614 2503 2498 2060 1645 2799 2545 2047

GP Genes 1184 678 894 944 482 601 836 858 466 610 891 983 503 948 1183 1371 1317 860 1096 1421 1216 893 1132 1439 1305 675 846 930 550 641 738 791 520 609


101

Dataset E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113

Background Correction mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none rma gcrma mas none Normalization quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none none quantiles quantiles quantiles quantiles invariantset invariantset invariantset invariantset none none none

MM Correction mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly pmonly mas mas mas mas mas mas mas mas mas mas mas mas pmonly pmonly pmonly pmonly pmonly pmonly pmonly Summarization medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish medianpolish plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier plier

Sensitivity 51.5% 54.2% 34.9% 42.4% 60.3% 62.4% 61.4% 40.3% 59.3% 63.4% 61% 41.4% 59.7% 63.1% 62% 30.5% 57.3% 53.9% 30.5% 29.2% 55.6% 53.2% 30.2% 30.8% 56.6% 54.9% 32.2% 38.3% 58% 55.6% 42% 38.3% 54.2% 54.2%

Specificity 99.4% 98.6% 98% 99.3% 99.3% 99.2% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.3% 99.2% 99.2% 98.9% 99.5% 98.6% 98.8% 99.4% 99.5% 98.7% 98.3% 98.8% 99.5% 98.6% 98.4% 98.6% 99.1% 97.1% 98.7% 99% 99.3% 97.3%

Table A.2 – continued from previous page FDR 8.43% 16.7% 29.5% 10.7% 7.29% 8% 7.65% 11.2% 7.89% 7.43% 7.22% 10.3% 7.37% 8.37% 8.04% 18.2% 6.11% 15% 26.2% 11.3% 5.75% 14.2% 31.5% 19.5% 6.18% 15.2% 30.7% 21.5% 10.9% 27.8% 21.5% 17.5% 9.09% 26.9%

GP SFP 1152 1695 1236 1059 1324 1529 1506 1006 1243 1553 1402 1010 1303 1589 1522 800 1169 1658 912 681 1053 1406 863 706 1158 1661 1001 1233 1401 2977 1845 989 1160 1985

MX SFP 1563 2498 2644 1294 1591 1803 1700 1278 1590 1845 1697 1303 1564 1851 1722 1184 1631 2428 1947 1256 1663 2733 2030 1258 1606 2421 2182 1487 1844 3769 2335 1640 1908 4802

GP Genes 829 934 567 843 961 1075 1107 810 913 1104 1024 816 949 1123 1119 452 790 871 376 388 722 750 355 391 797 881 409 656 913 1332 715 567 768 914


102

Dataset E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113 E-TABM-113

Background Correction rma gcrma mas none rma Normalization none quantiles quantiles quantiles quantiles

MM Correction pmonly pmonly pmonly pmonly pmonly Summarization plier plier plier plier plier

Sensitivity 38.6% 38% 58% 56.9% 43.4%

Specificity 98.9% 98.9% 99.3% 97.3% 98.6%

Table A.2 – continued from previous page FDR 19.1% 18.8% 9.04% 26% 22.9%

GP SFP 1224 1111 1344 2882 1771

MX SFP 2512 1470 1792 3681 2292

GP Genes 512 629 888 1337 732

MX Genes 1082 810 1206 1690 973

MX Exp 352 244 856 890 386

GP Exp 116 164 457 562 213

103

SINGLE FEATURE POLYMORPHISMS

SINGLE FEATURE POLYMORPHISMS

Suggest Documents

Association Mapping With Single-Feature Polymorphisms - Genetics

Single Nucleotide Polymorphisms (SNPs)

Single Nucleotide Polymorphisms Other than

Single Nucleotide Polymorphisms (SNPs)

Single nucleotide polymorphisms/ haplotypes ...

Characterization of single nucleotide polymorphisms ... - Inter Research

Single nucleotide polymorphisms in Mycobacterium tuberculosis and ...

identification of single nucleotide polymorphisms in ...

Single nucleotide polymorphisms of ABCC5 and ...

Identifying Single Nucleotide Polymorphisms that affect ... - CiteSeerX

Single-nucleotide Polymorphisms in the p53 ...

Identification of single nucleotide polymorphisms from the ...

Evaluation of single nucleotide polymorphisms in the ...

Detecting single nucleotide polymorphisms ... - Wiley Online Library

Identification of CYP19A1 single-nucleotide polymorphisms ... - Nature

Genetic epidemiology of single-nucleotide polymorphisms

Association between single-nucleotide polymorphisms ... - FUNPEC-RP

Single Nucleotide Polymorphisms Provide Rapid and Accurate ...

IFNGR1 single nucleotide polymorphisms in ... - Springer Link

Quantitative Analysis of Single Nucleotide Polymorphisms ... - PLOS

Single Nucleotide Polymorphisms with Cis ... - Semantic Scholar

Association of Single-Nucleotide Polymorphisms in ...

Single nucleotide polymorphisms in the ectomycorrhizal ... - CiteSeerX

In-silico single nucleotide polymorphisms (SNP ...