Microarray Data Preprocessing To Improve Exploration on Biological Databases Fadoua Rafii and M’hamed Aït Kbir (1)
B. D. Rossi Hassani (2)
1
(2) LABIPHABE Laboratory, UAE Tangier, Morocco
[email protected]
( ) LIST Laboratory, UAE Tangier, Morocco
[email protected],
[email protected]
Abstract— The appearance of the innovative technology Microarray has allowed the possibility to measure the activities of hundreds of genes simultaneously. The Microarray technology is now widely being applied to address complex and considerable scientific questions. It has promoted enormous ameliorations to the biology studies by benefiting from the worth information obtained from Microarray experiments. The huge amount of data generated by these experiments has made the manipulation and analyses of resulted information more complicated. The data obtained earn interest of researchers and specifically the biologists. In this paper, we will focus on the preparation and the pretreatments of Microarray data, in order to reduce the origin data and address the errors caused by the procedures implicated in the experiments. Keywords— Microarray; Pretreatments; Errors; Microarray experiments
I.
INTRODUCTION
Before using the Microarray data, we must appeal the techniques of pretreatments that aim to eliminate nonsignificant data and data affected by imperfections that can arise in biological experimentation. After these treatments, Microarray data will be ready to apply further analysis. Previously to any transaction involving Microarray data obtained through experimentation, we must perform a number of transformations on the data to adjust the measures, eliminate noise that can affect the data and select the genes that are significantly expressed. This challenge has an ultimate goal to facilitate comparisons and prepare the Microarray data for the implementation of complex methods which aim exploitation of useful information. A. Microarray Technology: Microarrays are miniaturized arrays of hundreds to thousands of discrete DNA fragments or synthetic oligonucleotides that have been attached to a solid substrate (e.g., glass) using automated printing equipment such that each spot (element) in a fixed position on the array corresponds to a unique DNA [1] [2]. With the benefits of high-throughput technology, Microarray is a promising tool that allows biologists to measure hundreds of thousands of gene expressions simultaneously [3]. The concept of microarrays was first proposed in the late 1980s [4]. One of the first descriptions of DNA Microarrays in the literature was provided by Augenlicht and his colleagues, who spotted 4000 complementary DNA (cDNA) sequences on
nitrocellulose and used radioactive labeling to analyze differences in gene expression patterns among different types of colon tumors in various stages of malignancy [5] [6]. In a Microarray, many thousands of spots are placed on a rectangular grid with each spot containing a large number of pieces of DNA from a particular gene [7]. B. Microarray Experiment: The Microarray experiment is like any experience that starts with an experimental conception where there are samples corresponding to experimental conditions, size of the sample, and important common aspects are performed. Microarray experiments can be regarded as multilayered in the sense that they involve several nested levels at which variability may be introduced [8]. The main object of these experiences is to reveal the function of many genes, and the genes that have common expression profiles represent a worth information to deduct the function of a new gene. II.
ELEMENTARY MICROARRAY CONCEPTS
A. Gene expression: DNA or oligonucleotide arrays have been used to monitor messenger RNA (mRNA or transcript) abundance levels of differentially expressed genes under different cell growth conditions or in response to environmental perturbations or genetic mutations [9] [10] [11] [12].
Gene Expression Profile
Gene Profile
The description of expression values for one gene in many samples or conditions.
Array Profile
The description of expression values for many genes under one condition or one sample.
Fig. 1. The difference between Gene profile and Array profile
B. Microarray Matrix: As a result of a Microarray experience, we obtain a matrix (Fig.2) which is described by N x M, where:
N represents the dimension of genes;
M represents the dimension of samples or conditions. S1
S2
. . . SM
G1 G2 . . .
SM
deviation from the exact value by a quantity that we call an error: 𝑴𝒆𝒂𝒔𝒖𝒓𝒆 = 𝑬𝒙𝒂𝒄𝒕 𝒗𝒂𝒍𝒖𝒆 + 𝑬𝒓𝒓𝒐𝒓 Many factors can influence the measured gene expression level, so the observed expression difference may be attributed to two parts: biological factor and measurement errors [20]. Potential sources of systematic errors include array surface chemistry, microarray printing, labeling methods, hybridization parameters, image analysis and RNA isolation [21] [22] [23] [24] [25].
Gene Profile
V.
S1 S1 Array Profile
GN Fig. 2. The structure of Microarray matrix of gene expression
III.
MICROARRAY RESOURCES
A. Microarray Databases: Due to the potential values provided by the Microarray technology, it was necessary to create public databases for the management of Microarray data. The data obtained by Microarray researches projects are stored in different databases because of the different needs, limitations and resources. The wide availability of Microarray data has fueled the development of exploratory research and the generation of new hypothesis about specific biological processes based on the analysis of large amounts of data [13]. B. GEO Database: Currently, there are numerous databases containing Microarray data. In our study, we have chosen GEO [14] which is the largest public repository of Microarray data. The primary role of GEO is data archiving, functioning as a hub for data deposit, and retrieval [15] [16]. These data address a very broad diversity of biological themes, including disease, development, evolution, metabolics, toxicology, immunity, ecology, and transgenesis [17]. GEO stores over 20 000 microarray- and sequence-based functional genomics studies, and continues to handle the majority of direct high-throughput data submissions from the research community [18]. IV.
PROBLEMATIC
The huge amount of data produced by a specific Microarray experiment makes the extraction of significant biological information very complex. Microarray experiments offer a potential wealth of information but also present a significant data analysis challenge [19]. For this reason, it is necessary to appeal the techniques of filtering out the noise and correcting the systematic errors existing in experimental data. The ideal expression matrix contains numerical values that reflect the right level of abundance transcribed or the ratio abundance in the measure of sample combinations. The value of measurement obtained by the instruments contains always a
MICROARRAY DATA PREPROCESSING
Microarrays provide a huge amount of data in order to respond to many simultaneous hypotheses, but the results are not constants and need to be prepared attentively and analyzed to have the satisfied results. Gathering, organizing and preparing the data for further analyses is a useful process. These steps are referred collectively to the pretreatments of data. A. Logarithm transformation: A better transformation procedure is to take the logarithm base 2 value of the expression ratio (i.e. log2 (expression ratio)) [26]. There is a general convention that the logarithm transformation of the most Microarray data provides a good approximation of the normal distribution with minor exceptions. A log2 transformation makes ratios particularly convenient to work with because it is simple to conceptualize the foldchange in a ratio given a log2 value [27]. 𝐥𝐨𝐠(𝒙) − 𝐥𝐨𝐠(𝒚) is equivalent to 𝐥𝐨𝐠(𝒙⁄𝒚) B. Missing values: In contrast to many other analytical measurements, Microarray data are often characterized by a significant proportion of missing values [28]. Some of the causes of missing values, we mention:
Corruption of image
Insufficient resolution
Dust or scratches on the slide
Robotic methods used to create arrays
C. Filtering data: By keeping only array elements that are significantly above the background, we can increase the reliability of measurements [29]. The concept of filtering consists on getting a small data matrix from the origin one. The large matrix contains hybridizations arranged on the columns and substances on the rows and we may have lot of different types of observations of each hybridization that are arranged on columns. The filtering implements three general concepts:
Selection
Average
VI.
Estimation THE COMPARED PREPROCESSING METHODS
A. Centering and Rescaling data (CR): Centering is a very commonly used method for comparing multiple arrays [30]. This technique is particularly important when using the ratios to control changes occurred in the gene expression. Centering method is performed by subtracting the global average from each point of data.
U is the matrix where the columns are constructed by the Eigen vectors of the covariance matrix.
It’s evident that the new attributes of vectors have null means and an identity covariance matrix. VII. METHODOLOGY A. Process of the methodology:
∗ 𝑋𝑖𝑗 = 𝑋𝑖𝑗 − mean(𝐴𝑗 )
𝐴𝑗 represents the jth array profile of the Microarray matrix
𝑋𝑖𝑗 is the origin expression value
∗ 𝑋𝑖𝑗 is the new expression value
mean(Aj ) is the mean of the jth array profile
GEO Database
Disease
Toxicology
The re-scaling method is the division of a variable by its variance to provide to data a unit standard variance. The variance in hybridization is the result of two components that are: the substances and the error. Re-scaling is a necessary and strong component for the normalization. ∗ 𝑋𝑖𝑗 =∗ 𝑋𝑖𝑗 /𝑠𝑑(𝐴𝑗 )
Microarray Matrix
When there is a correlation between the attributes, another form which is more adequate was introduced by Fukunaga [32]: ~ X n 1 / 2U t ( X n X )
C1
C2
...
Cm
Gene 1
V11
V12
...
V1m
Gene 2
V21
V22
...
V2m
Gene 3
V31
V32
...
V3m
Gene 4
V41 . . . . . . Vn1
V42 . . . . . . Vn2
... . . . . . . ...
V4m . . . . . . Vnm
. . . . . Gene n
Origin Matrix which is very big
Apply Pretreatment Methods on Microarray Data
Values Biases
F −1 𝐷𝑖 C. Correlation Based Method (CBM): When the changes of the components of the vector attribute are not of the same order of magnitude, the distance between forms may not be sensitive to changes in certain attributes, which is due to the difference of the order of magnitude between them. For this, a pretreatment step based on the normalization of vectors is essential to remedy this problem.
is a diagonal matrix that has the Eigen values of the covariance matrix of all the existent observations.
Metabolics
Select a specific Dataset
𝑠𝑑(𝐴𝑗 ) is the standard deviation
B. Quantile Method (QM): The goal of the Quantile method is to make the distribution of probe intensities for each array in a set of arrays the same [31]. The Quantile normalization ensures the same empirical distribution of the intensities across arrays and across channels. Let Di denoting the empirical distribution of intensities on the ith array, and F denoting the empirical distribution of the averaged sample quantiles. In order to achieve the identical empirical distributions of intensities, the intensities should be normalized on the ith array by the composite function:
Immunity
Ecology
Empty
Significant Microarray Data C1
C2
...
Cm
Gene 1
V11
V12
...
V1m
Gene 2
V21
V22
...
V2m
. . .
. . .
. . .
. . .
Vz1
Vz2
...
Vzm
. . . Gene z
Resulted Matrix which is reduced
Fig. 3. The methodology required to get significant Microarray data
To illustrate the techniques of pretreatments already defined, we have required a process in the object of getting the desired results:
Selecting from the database GEO a specific experiment which is responding to biological questions and hypotheses
Getting the Microarray matrix of the selected experiment
Filtering the information
genes
representing
valuable
Applying the predefined preprocessing methods for normalizing Microarray data
Computing and comparing the number of outliers detected on each method
Getting the significant Microarray data for further analysis
B. Algorithms of pretreatments: The enormous Microarray data must be prepared before any kind of transformation, in order to select the pertinent information. By investigating and studying Microarray data, we have found many types of gene expression profiles depicting the biases on data. We have implemented algorithms for treating these biases. It has allowed us to reduce the original data obtained from a specific Microarray experiment. The types of gene expression profiles depicting the biases on data are:
Gene expression profiles where the values are marked as NaN (Not a Number)
Gene expression profiles where the variance is less than the 10th percentile of the variance: for each gene profile, if the variance is less than the 10th percentile of the variance, the gene profile is not important for further analysis
Gene expression profiles containing absolute values less than a critical value: we fix a value which allow us to detect the gene profiles that aren’t representing pertinent data.
The submission date to GEO is: May 20, 2013
The last update date is: March 18, 2015.
The title of this experiment: Gene expression profiling in true interval breast cancer reveals over activation of mTOR signaling pathway.
TABLE II.
THE SAMPLES OF THE MICROARRAY EXPERIMENT USED IN THE PRESENT STUDY
GEO Accession
Features Title
GSM1145068
TIBC1
GSM1145069
SDBC1
GSM1145070
TIBC2
GSM1145071
SDBC2
GSM1145072
SDBC3
GSM1145073
SDBC4
GSM1145074
TIBC3
GSM1145075
SDBC5
GSM1145076
TIBC4
GSM1145077
TIBC5
Condition
True interval tumors (TIBC) Screen-detected tumors (SDBC) True interval tumors (TIBC) Screen-detected tumors (SDBC) Screen-detected tumors (SDBC) Screen-detected tumors (SDBC) True interval tumors (TIBC) Screen-detected tumors (SDBC True interval tumors (TIBC) True interval tumors (TIBC)
Phenotype
TN LumA LumA HER2+ LumB LumA LumA LumA LumB HER2+
B. Results: The ratios give measure of the expression changes. Most analyses of differential expression rely on the log transformation of the intensity values. To illustrate the importance of the Log transformation, we have traced two graphs of genes, randomly chosen from the data set that contains eight genes, to compare and visualize clearly the characteristics.
VIII. MICROARRAY CASE STUDY A. Input data: The present study used microarray dataset from the database Gene Expression Omnibus (GEO) [33]. TABLE I. GEO Accession GSE47108
FEATURES OF THE IMPORTED MICROARRAY EXPERIMENT Features Rows
Columns
Organism
Experiment type
33252 Gene profiles
10 Samples
Homo Sapiens
Expression profiling by array
The Microarray experiment is identified by GSE47108 [34], is described by the following elements:
Fig. 4. Comparison between the genes before and after the Logarithm transformation
By comparing the two plots of the eight genes, we remarked that the Logarithm transformation of the specified Microarray data provides a clear visualization of the data. This transformation has allowed distinguishing between genes and especially those that are under-expressed. The second graph on the figure 4 shows that the genes that are up- and downregulated are treated in a similar way. It treats the expression ratios symmetrically. To depict the difference between the selected preprocessing methods, we have chosen Box plots to draw for values of gene expression of each condition which is represented by column in Microarray Matrix. These box plots enabled us to study the distributional characteristics of genes as well as the level of gene expressions. We have implemented the methods on treated data after the application of the pretreatment algorithms predefined, and the data are logarithm transformed. Bellow On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points the algorithm considers to be not outliers, and the outliers are plotted individually.
Fig. 7. The box plots of each condition of the experiment after applying CBM
The box plots on the figures 5, 6 and 7 show a huge difference by comparing the obtained results on each condition for the applied preprocessing methods. We remark that the method based on correlation (CBM) has produced different results compared to the other methods. In fact, outliers are not only related to high extreme data points, but low extreme data points are also concerned. This normalization method spread out the condition values as much as possible which improve the discrimination capabilities of the similarity measures that can be used in the data mining algorithms. IX.
Fig. 5. The box plots of each condition of the experiment after applying SC
Fig. 6. The box plots of each condition of the experiment after applying QM
CONCLUSION
Microarrays have responded to many hypotheses and questions in lot of domains such as pharmacology, diagnostic of diseases and organism studies. To achieve the aims of using this powerful technology, we have chosen to work on real data obtained by Microarray experiments. The fact that many scientists and especially biologists find problems to analyze the Microarray data, due to its huge number, has encouraged us to research and apply techniques in this sense on the object of comparing and selecting the efficient preprocessing method. The selected preprocessing methods have indicated that there are differences between the starting Microarray data and the final data obtained after applying the predefined techniques. By comparing the methods SC, QM and CBM, the different results depicted on the box plots have highlighted on the advantages of each method. We have concluded that the best technique is the one that permits detecting outliers on data and showing discriminate data in order to be useful for implementing Data mining techniques. Our aim is to facilitate the tasks for investigators interested in Microarrays, in view of treating the significant data, gaining time and making further analysis. We look forward to make a complete solution; resolving many obstacles; available for researchers to develop all the fields concerning Microarrays.
REFERENCES [1]
[2] [3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13] [14] [15]
[16]
[17]
[18]
[19]
[20]
Schena, M., Heller, R., Theriault, T., Konrad, K., Lachenmeier, E., and Davis, ''Microarrays: biotechnology’s disovery platform for functional genomics'', Trends Biotech. 16, pp. 301–306, 1998. Schena, M., ''Microarray analysis'', John Wiley & Sons, New York, NY., 2003. C.-R. Chen, W.-Y. Shu, M.-L. Tsai, W.-C. Cheng, and I. C. Hsu, ''THEME: A web tool for loop-design microarray data analysis'', Computers in Biology and Medicine, vol. 42, no. 2, pp. 228–234, Feb. 2012. Jizhong Zhou and Dorothea K. Thompson, ''Microaarray technology and applications in environmental microbiology'', Advances in Agronomy, Volume 82, 2004. Augenlicht, L., Wahrman, M., Halsey, H., Anderson, L., Taylor, J., and Lipkin, M.,''Expression of cloned sequences in biopsies of human colonic tissue and in colonic–carcinoma cells induced to differentiate invitro'', Cancer Res. 47, pp. 6017–6021, 1987. Augenlicht, L., Taylor, J., Anderson, L., and Lipkin, M., ''Patterns of gene expression that characterize the colonic mucosa in patients at genetic risk for colonic cancer'', Proc. Nat. Acad. Sci. 88, pp. 3286– 3289, 1991. C. Naidu and Y. Suneetha, "Review Article: Current Knowledge on Microarray Technology - An Overview", Tropical Journal of Pharmaceutical Research, vol. 11, no. 1, Mar. 2012. T. K. Karakach, R. M. Flight, S. E. Douglas, and P. D. Wentzell, ''An introduction to DNA microarrays for gene expression analysis'', Chemometrics and Intelligent Laboratory Systems, vol. 104, no. 1, pp. 28–52, Nov. 2010. Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., and Brown, E. L., ''Expression monitoring by hybridization to high-density oligonucleotide arrays'', Nat. Biotechnol. 14, pp. 1675–1680, 1996. Schena, M., Shalon, D., Heller, R., Chai, A., Brown, P., and Davis, R. W., ''Parallel human genome analysis: microarray-based expression monitoring of 1000 genes'', Proc. Natl. Acad. Sci. USA 93, pp.10614– 10619, 1996. DeRisi, J. L., Iyer, V. R., and Brown, P. O., ''Exploring the metabolic and genetic control of gene expression on a genomic scale'', Science 278, pp.680–686, 1997. Wodicka, L., Dong, H., Mittmann, M., Ho, M.-H., and Lockhart, D. J., ''Genome-wide expression monitoring in Saccharomyces cerevisiae'', Nat. Biotechnol. 15, pp. 1359–1367, 1997. Z. Ma," Database modeling in biology: practices and challenges". New York [etc.]: Springer, 2006. http://www.ncbi.nlm.nih.gov/geo/ Barrett, T., Suzek, T. O., Troup, D. B., Wilhite, S. E., Ngau, W. C., Ledoux, P., Rudnev, D., Lash, A. E., Fujibuchi, W., and Edgar,"NCBI GEO: Mining millions of expression profiles—database and tools", Nucleic Acids Res. 33, pp. D562–D566, 2005. Edgar, R., Domrachev, M., and Lash, A. E., "Gene Expression Omnibus: NCBI gene expression and hybridization array data repository", Nucleic Acids Res. 30, pp. 207–210, 2002. T. Barrett and R. Edgar, "[19] Gene Expression Omnibus: Microarray Data Storage, Submission, Retrieval, and Analysis", in Methods in Enzymology, vol. 411, Elsevier, pp. 352–369, 2006. T. Barrett, D. B. Troup, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, R. N. Muertter, M. Holko, O. Ayanbule, A. Yefanov, and A. Soboleva, "NCBI GEO: archive for functional genomics data sets--10 years on", Nucleic Acids Research, vol. 39, no. Database, pp. D1005–D1010, Jan. 2011. G. B. Whitworth, ''An Introduction to Microarray Data Analysis and Visualization'', in Methods in Enzymology, vol. 470, Elsevier, pp. 19– 50, 2010. Yuanyuan Ding and Dawn Wilkins, ''The Effect of Normalization on Microarray Data Analysis", pp. 635-642, 2004.
[21] Claverie, J.M., "Computational methods for the identification of differential and coordinated gene expression", Hum. Mol. Genet. 8, pp. 1821-1832, 1999. [22] Schuchhardt, J., Beule, D., Malik, A., Wolski, E., Eickhoff, H., Lehrach, H., and Herzel, H., "Normalization strategies for cDNA microarrays", Nucl. Acids Res. 28: E47, 2000. [23] Lou, X.J., Schena, M., Horrigan, F.T., Lawn, R.M., and Davis, R.W., "Expression monitoring using cDNA microarrays. A general protocol", Meth. Mol. Biol. 175, pp. 323-340, 2001. [24] Tseng, G.C., Oh, M.K., Rohlin, L., Liao, J.C., and Wong, W.H., ''Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects'', Nucl. Acids Res. 29, pp. 2549 -2557, 2001. [25] Yue, H., Eastman, P.S., Wang, B.B., Minor, J., Doctolero, M.H., Nuttall, R.L., Stack, R., Becker, J.W., Montgomery, J.R., Vainer, M., and Johnston, R., "An evaluation of the performance of cDNA microarrays for detecting changes in global mRNA expression". Nucl. Acids Res. 29: E41, 2001. [26] http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/chapterfinal.pdf [27] G. B. Whitworth, "An Introduction to Microarray Data Analysis and Visualization", in Methods in Enzymology, vol. 470, Elsevier, pp. 19– 50, 2010. [28] Tobias K. Karakach, Robert M. Flight, Susan E. Douglas, Peter D. Wentzell, ''An introduction to DNA microarrays for gene expression analysis'', pp. 28-52, 2010 [29] Xiaofeng Zhou, Hiroshi Egusa, Steven W. Cole, Ichiro Nishimura, and David T.W. Wong, ''Methodology of Microarray Data Analysis'', pp. 1529, 2005 [30] Dov Stekel, ''Microarray Bioinformatics'', CAMBRIDGE, 2003 [31] B. M. Bolstad, R. A. Irizarry, M. \AAstrand, and T. P. Speed, "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias", Bioinformatics, vol. 19, no. 2, pp. 185-193, 2003. [32] K. Fukunaga "Introduction to pattern recognition (Second ed.)". Academic Press (San Diego), 1990 [33] Barrett, T., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., Tomashevsky, M., et al., "NCBI GEO: Archive for functional genomics data sets—Update", Nucleic Acids Research, 41(Database issue), pp. D991–D995, 2013. [34] http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE47108