Statistical evaluation of normalization methods for NanoString ...

3 downloads 0 Views 952KB Size Report
INTRODUCTION. NanoString is a novel medium-throughput technology which is becoming widely-accepted in the biomedical community for measurement of ...
Statistical evaluation of normalization methods for NanoString nCounter data ELIKA 1 McGill

1 GARG ,

IVAN

1, 2 TOPISIROVIC ,

AND ROBERT

1, 3 NADON

University, 2 Lady Davis Institute for Medical Research, 3 McGill University and Genome Quebec Innovation Centre

INTRODUCTION

METHODS

Strategy-I

Strategy-II

NanoString is a novel medium-throughput technology which is becoming widely-accepted in the biomedical community for measurement of gene expression. The count data generated by its nCounter machine is sensitive to pre-processing methods, however, and there is as yet no consensus on how best to normalize the data.

Samples required

≥ 2 replicates/condition (in pairs)

≥ 4 replicates/condition (in groups)

Examined entity

Agreement

P-value distribution

based on linear regression of pairs (can be coupled with regression diagnostics)

generated from statistical tests of mean difference between groups (e.g. t-test)

There is a requirement to stabilize the variance and evaluate the normalization methods before application on any dataset. A statistical evaluation of popular normalization methods is discussed and applied to four different datasets. As required by statistical tests, variance within each dataset is stabilized before evaluation.

Assessment tools

Histograms

QQ-plots

Similar values

Eliminates Poor-performers

Sample quantiles

Expects

Concordance Correlation Coefficient = 0.99

Frequency

Replicate 2

Same condition Compares samples

P-value

Replicate 1

Uniform distribution

Opposite condition Compares samples Expects

More differences

Selects

Best-performers

Concordance Correlation Coefficient = 0.82

Sample quantiles

Step-2 Frequency

We illustrate how competing normalization methods can be evaluated for NanoString nCounter mRNA datasets.

CCCs

Step-1

Replicate 2

Two evaluation strategies, each of which has a first elimination step and a second selection step, are presented. The underlying logic of both strategies is that agreement between measurements within the same condition should be high under the null hypothesis assumption of no differences and should be low when measurements from different conditions are compared.

Scatter plots

Replicate 1

P-value

Uniform distribution

DATA

RESULTS

Strategy-I

Strategy-II

• The primary dataset was generated as part of a study examining the role of Estrogen Receptor alpha in prostate cancer. Mice cells with and without ER were grown in culture, then polysomal and total mRNA were extracted from each through polysomal fractionation. • The other datasets were procured from peer-reviewed journals.

Datasets

Data-1, Data-2, Data-3, Data-4

Data-4

Plots

Boxplots of CCCs

Histograms and QQ-plots

Step-1

Number of genes Number of Number of replicates Number of controls Datasets conditions Biological Technical Positive Negative Reference Endogenous used

11 189 9 500 2 65 8 558 conditions used,

P-value

Strategy-I

Low CCCs

Deviation from Strategy-II uniformity

Observed quantiles

1 4 3 6 8 2 2 3 6 2 3 2 3 2 6 8 4 2 6 6 8 Table 1: Description of datasets with respect to number of replicates, controls, and genes.

Elimination

Frequency

Normalization methods

Normalization methods

Uniform distribution Normalization methods

PRE-PROCESSING • Normalization methods were pre-selected based on their popularity in high-impact journals. • They were applied to all datasets through an R-package called NanoStringNorm. . 1 2 3 4 5 Positive Sum control Negative control Reference Geometric genes mean

M-Pos Geometric mean

M-Ref

M-PosNegRef

M-NegRef

Sum

Geometric mean

Mean + 2sd

Mean + 2sd

Geometric mean

Geometric mean

Selection

Strategy-I

Lowest CCCs

Highest Strategy-II deviation from uniformity

Table 2: Description of five pre-selected normalization methods.

• Variance stabilization methods were selected based on their ability in achieving mean-SD (standard deviation) independence. • VSN transformation was applied to all datasets. No transformation

Log transformation

VSN transformation

CONCLUSION • Variance stabilization of data is a necessary preprocessing step. • Normalization by negative controls can be disadvantageous for some NanoString nCounter mRNA data. • These evaluation strategies can help identify the best normalization method for any given dataset.

P-value

Observed quantiles

M-PosRef

Frequency

Step-2

Normalization methods

1 2 3 4 5

M-PosRef M-Pos M-Ref M-PosNegRef M-NegRef

Uniform distribution

Correspondence | Affiliations | Acknowledgements elika.garg@mail .mcgill.ca robert .nadon@mcgill .ca McGill Integrated Cancer Research Training Program

Suggest Documents