Parallel processing of genomics data

3 downloads 5080 Views 874KB Size Report
Oct 28, 2016 - Parallel processing and analysis of thermographic data ... DMET-Miner [15] two software tools for the analysis of DMET-data able to read and ...
Parallel processing of genomics data Giuseppe Agapito, Pietro Hiram Guzzi, and Mario Cannataro Citation: AIP Conference Proceedings 1776, 080007 (2016); doi: 10.1063/1.4965364 View online: http://dx.doi.org/10.1063/1.4965364 View Table of Contents: http://scitation.aip.org/content/aip/proceeding/aipcp/1776?ver=pdfcov Published by the AIP Publishing Articles you may be interested in Multi-frequency magnonic logic circuits for parallel data processing J. Appl. Phys. 111, 054307 (2012); 10.1063/1.3689011 Nonlinear Analysis of Time Series in Genome‐Wide Linkage Disequilibrium Data AIP Conf. Proc. 978, 34 (2008); 10.1063/1.2891412 A web server for mining Comparative Genomic Hybridization (CGH) data AIP Conf. Proc. 953, 144 (2007); 10.1063/1.2817337 Learning Large‐Scale Graphical Gaussian Models from Genomic Data AIP Conf. Proc. 776, 263 (2005); 10.1063/1.1985393 Parallel processing and analysis of thermographic data AIP Conf. Proc. 615, 558 (2002); 10.1063/1.1472847

Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 66.34.33.160 On: Fri, 28 Oct 2016 15:10:06

Parallel Processing of Genomics Data Giuseppe Agapito1,b) , Pietro Hiram Guzzi1,c) and Mario Cannataro1,a) 1

University Magna Græcia of Catanzaro, Italy a)

Corresponding author: [email protected] b) [email protected] c) [email protected]

Abstract. The availability of high-throughput experimental platforms for the analysis of biological samples, such as mass spectrometry, microarrays and Next Generation Sequencing, have made possible to analyze a whole genome in a single experiment. Such platforms produce an enormous volume of data per single experiment, thus the analysis of this enormous flow of data poses several challenges in term of data storage, preprocessing, and analysis. To face those issues, efficient, possibly parallel, bioinformatics software needs to be used to preprocess and analyze data, for instance to highlight genetic variation associated with complex diseases. In this paper we present a parallel algorithm for the parallel preprocessing and statistical analysis of genomics data, able to face high dimension of data and resulting in good response time. The proposed system is able to find statistically significant biological markers able to discriminate classes of patients that respond to drugs in different ways. Experiments performed on real and synthetic genomic datasets show good speed-up and scalability.

INTRODUCTION The exponential growth of automated technologies capable of determining the complete sequence of an entire genome has contributed to a dramatic growth in data volume especially in genomic analysis. The genomic analysis has made it possible to shed light on different properties and variations of the genome, producing enormous quantities of data on a scale much larger than the traditional genetic studies have done before. Genomic analysis is done using Next Generation Sequencing (NGS) as well as using microarray for genotyping of single nucleotide polymorphisms (SNPs) [1]. Nucleotides are the chemical building block of DNA called adenine (A), cytosine (C), guanine (G) and thymine (T). Pharmacogenomics experiments involve the gene sequencing and the individuation of SNPs by using microarray technology and computational analysis. Personalized medicine refers to the possibility to tailor therapies according to the genome of patients, under the assumption that different genomic variants may impact on the response to drugs [2, 3, 4, 5]. A set of SNPs, known to be related to adverse drug reactions (ADR), have been determined in the past [6]. ADRs are reactions that occur most frequently when a drug has a narrow therapeutic index. The therapeutic index is a measure of the amount of drug that may cause a lethal effect. Consequently, the investigation of these polymorphisms may limit the incorrect dosage of drugs and then the insurgence of reactions since their presence/absence may favourite ADRs. High-performance computing may play a significant role in many phases of the analysis of genomic data, from data preprocessing [7] to data integration [8] and analysis, including data exploration and visualization. Therefore, different high-performance computing techniques are more and more used in bioinformatics [9, 10, 11, 12]. The DMET (drug metabolism enzymes and transporters) Plus Premier Pack is a novel microarray assay developed by Affymetrix1 for gene profiling, designed specifically to test drug metabolism associations [13]. It enables the investigation of 1936 different nucleotides that are the candidates for possible polymorphisms in 255 genes that are related to drug absorption, distribution, metabolism and excretion (ADME in short). Recently we implemented DMET-Analyzer [14] and DMET-Miner [15] two software tools for the analysis of DMET-data able to read and extract significant information from Affymetrix DMET data. DMET Analyzer is a software platforms for the automatic statistical analysis of DMET data that employs the well-known Fisher test and several statistical corrections such as Bonferroni or False Discovery Rate. To provide efficient and scalable data analysis when the amount of data to analyze increases, we extended the 1

www.affymetrix.com

Numerical Computations: Theory and Algorithms (NUMTA–2016) AIP Conf. Proc. 1776, 080007-1–080007-4; doi: 10.1063/1.4965364 Published by AIP Publishing. 978-0-7354-1438-9/$30.00

080007-1 Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 66.34.33.160 On: Fri, 28 Oct 2016 15:10:06

core of DMET-Analyzer to be able to analyze DMET dataset in a parallel fashion. Thus, we developed coreSNP [10] a platform-independent parallel software built in Java that supports the massively parallel Fisher’s Test analysis of DMET-based pharmacogenomics data. The main drawback of DMET-Analyzer and coreSNP is that they can discover only the association among a single allelic variant and the clinical conditions. Nevertheless, many diseases such as cancer are known to be multifactorial, i.e. related to variation in more than a single gene. DMET-Miner [15] is a novel methodology for the simultaneous analysis of genomic variants in more than a gene. DMET-Miner is based on a modified Frequent Pattern-Growth (FP-Growth) [16] algorithm, a well-known methodology in the data mining field. Association rules mining in DMET-Miner is made by using an optimized version of FP-Growth algorithm. The modified version of FP-Growth needs to scan only twice the database to build an FP-Tree, a structure based on extended prefix-tree, making it possible to store in a compressed way crucial information about the frequent patterns. Despite the improvements introduced by FP-Growth, DMET-Miner presents a little parallelism and requires many resources (CPU and memory) while working on huge microarray-based genomics datasets. In this work, we propose ParDMET-Miner, the multi-threading version of DMET-Miner based on a modified version of Parallel FP-Growth (PFP-Growth) [17] to obtain better performance and higher parallelism even facing with huge biological datasets. The main contributions of ParDMET-Miner are: i) a customized multi-threading version of PFPGrowth to analyze DMET data, and ii) a novel dataset-partitioning able to produce independent computational tasks that can be run concurrently on several independent threads. As described later, this allows to overcome the limitations of traditional data mining tools such as Weka, RapidMiner, and Knime to analyze the biological datasets, as stated in [15]. The rest of the paper is organized as follows. Section 2 presents the DMET data format, Section 3 describes the ParDMET-Miner algorithm, and Section 4 underlines the future work and concludes the paper.

DMET DATA The DMET (drug metabolism enzymes and transporters) platform developed by Affymetrix, enables the investigation of 1936 different nucleotides that are the candidates for possible polymorphisms in 255 genes that are related to drug absorption, distribution, metabolism and excretion (ADME in short). Usually, in a case-control study the outcome provided by the Affymetrix DMET Platform is arranged like a huge m × n table of Single Nucleotide Polymorphisms (SNPs). Where m is the number of probes (m = 1936 for current DMET chips) and n is the number of samples (patients), as depicted in Figure 1(a). Each entry i, j of such matrix contains strings such as A/A, in which the first letter represents the nucleotide of the first allele, and the second letter represent the nucleotide of the second allele. In such a way a genomic variant in two samples may be expressed like a form: A/A for patient 1, and A/A for patient 2. To mine the DMET data as those in Figure 1(a), the dataset needs to be converted in a new format compatible with a transaction database. In a transaction database each entry (row) is identified by a transaction identifier (e.g. the purchases in a shop) and a set of items (the purchased items). In our case, a transaction identifier (id) is the identifier of a patient si , while the items are the SNPs detected in that sample si (i = 1 . . . n) on each probe p j ( j = . . . m). Input data table produced by DMET Console (see e.g. Figure 1(a)) is loaded and transposed obtaining a n × m matrix of alleles named Transaction Database (T D) see Figure 1(b), where n is the number of samples (patients) and m is the number of probes (m = 1936 for current DMET chips). In this way, each row of the T D table represents a transaction, where all SNPs detected in a patient, on the various probes, are the items of the transaction.

FIGURE 1: The left table in figure, represents a simple DMET SNP microarray dataset. The right table is the transposed DMET SNP microarray dataset. S and P respectively refer to sample and probe identifiers.

080007-2 Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 66.34.33.160 On: Fri, 28 Oct 2016 15:10:06

FIGURE 2: Figure shows the speedup and execution times obtained by PArDMET-Miner by using 1, 2, 4, 6 slaves thread.

METHOD ParDMET-Miner is based on a multi-threaded Master-Slave architecture. The Master Thread (MT) is responsible for partitioning and distributing the load to each slave thread and for collecting results, while Slave Threads (ST) compute locally the association rules. Figure 2 reports the execution times and the speedup obtained analyzing a synthetic input DMET dataset composed by 100 columns (subjects) and 1936 rows (probes), on 1, 2, 4, 6 processors. The parallel algorithm of ParDMET-Miner shows good response times and speedup. Master Thread (MT) The MT receives in input the dataset and the number of available cores #cores (this information can be automatically detected by using an opportune system call), the minimum support minS upp and the confidence values that are used to mine association rules. The first step of MT is to compute the frequent items, constructing FrequentItemsList (FLIST). FLIST contains only the frequent items, thas is, the i-th item for which (support(i) ≥ minS upp). Subsequently the items in FLIST are used to remove the infrequent items from each transaction, to compute the keys with which to build the independent transactions. Independent transaction are couples Key-Values, where the keys are the less frequent items into the current transaction under analysis, and the values are obtained iteratively by removing the less frequent item (the key), and the prefix part became the value. These couples are collected in an opportune data structure based on HashTable (see Figure 3(a)). It is worthy to note that now, each keys is an identifier of independent transactions, which are partitioned by MT among the STs running on the #cores available cores (see Figure 3(b)). Then the MT concurrently starts #cores instances of ST and as last step, MT collects and merges the results. Slave Thread (ST) Each ST receives a set of keys-values in input, with which to build its local FP-tree to mine association rules, and finally returns to the MT the significant association rules mined. The ST runs locally on its FP-Tree (the core association rule mining algorithm), to extract relevant knowledge from the own set of independent transactions. As last step each ST sends back the mined association rules to the MT.

CONCLUSIONS We proposed ParDMET-Miner a multi-thread parallel version of the FP-Growth algorithm for the parallel extraction of association rules from DMET data. However despite the improvement in terms of data processing speed obtained exploiting the parallel computing, the problem related to the high number of possible candidate patterns remain, especially, when facing huge datasets such as those obtained by using microarray based on Genome Wide Association Studies (GWAS) and Next Generation Sequencing (NGS). To face efficiently those datasets as future work we

080007-3 Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 66.34.33.160 On: Fri, 28 Oct 2016 15:10:06

FIGURE 3: The table in the left part of the image, the first column contains the transaction identifier (Id), the second column contains the raw items, whereas the third column contains only the frequent items (items with support value ≥ 2) with which to compute the keys. In the right table, the independent transactions obtained from the transaction dataset are reported. are defining a new methodology capable to improve the generation of key-value couples for limiting the number of possible candidate patterns.

ACKNOWLEDGMENTS This work has been partially supported by the PON03PE 00001 1 ”BA2Know-Business Analytics to Know” research project, funded by MIUR.

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]

C. Phillips, Single Nucleotide Polymorphisms, Volume 578 43–71 (2009), methods in Molecular Biology (Clifton, N.J.). P. H. Guzzi, G. Agapito, M. Milano, and M. Cannataro, Briefings in bioinformatics p. bbv076 (2015). E. Rumiato, E. Boldrin, A. Amadori, and D. Saggioro, Cancer Chemotherapy and Pharmacology 72, 483– 488 (2013). M. Arbitrio, M. T. Di Martino, V. Barbieri, G. Agapito, P. H. Guzzi, C. Botta, E. Iuliano, F. Scionti, E. Altomare, S. Codispoti, et al., Cancer chemotherapy and pharmacology 77, 205–209 (2016). M. T. Di Martino, M. Arbitrio, P. H. Guzzi, E. Leone, F. Baudi, E. Piro, T. Prantera, I. Cucinotto, T. Calimeri, M. Rossi, et al., British journal of haematology 154, 529–533 (2011). J. Li, L. Zhang, H. Zhou, M. Stoneking, and K. Tang, Human Molecular Genetics 20, 528–540 (2011). M. Cannataro, P. H. Guzzi, T. Mazza, G. Tradigo, and P. Veltri, “Preprocessing of mass spectrometry proteomics data on the grid,” in 18th IEEE Symposium on Computer-Based Medical Systems (CBMS’05) (IEEE, 2005), pp. 549–554. M. Cannataro, C. Comito, A. Guzzo, and P. Veltri, “Integrating ontology and workflow in proteus, a gridbased problem solving environment for bioinformatics,” in Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on, Vol. 2 (IEEE, 2004), pp. 90–94. P. H. Guzzi and M. Cannataro, BMC bioinformatics 11, 315+ (2010). P. H. Guzzi, G. Agapito, and M. Cannataro, Computers, IEEE Transactions on 63, 2961–2974 (2014). B. Calabrese and M. Cannataro, Scalable Computing: Practice and Experience 16, 1–18 (2015). M. Cannataro, D. Talia, G. Tradigo, P. Trunfio, and P. Veltri, Future Generation Computer Systems 24, 222–234 (2008). T. M. Sissung, B. C. English, D. Venzon, W. D. Figg, and J. F. Deeken, Pharmacogenomics 11, 89–103. P. Guzzi, G. Agapito, M. Di Martino, M. Arbitrio, P. Tassone, P. Tagliaferri, and M. Cannataro, BMC Bioinformatics 13, p. 258 (2012). G. Agapito, P. H. Guzzi, and M. Cannataro, Journal of biomedical informatics 56, 273–283 (2015). C. Borgelt, “An implementation of the fp-growth algorithm,” in Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations (ACM, 2005), pp. 1–5. H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “Pfp: parallel fp-growth for query recommendation,” in Proceedings of the 2008 ACM conference on Recommender systems (ACM, 2008), pp. 107–114.

080007-4 Reuse of AIP Publishing content is subject to the terms at: https://publishing.aip.org/authors/rights-and-permissions IP: 66.34.33.160 On: Fri, 28 Oct 2016 15:10:06