processing Next-Generation Sequencing using Hadoop-BAM library. ... Besides, the advancement of technology modern makes the. Next-Generation Sequencing ... sequencing [7], the NGS enforces several applications or tools for precision ...
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)
Processing Next Generation Sequencing Data in MapReduce Framework using Hadoop-BAM in a Computer Cluster Rifki Sadikin, Andria Arisal
Rofithah Omar, Nur Hidayah Mazni
Research Center for Informatics Indonesian Institute of Sciences Bandung, Jawa Barat 40135, Indonesia {rifki.sadikin,andria.arisal}@lipi.go.id
Faculty of Science and Information Technology Universiti Teknologi PETRONAS 32610 Seri Iskandar, Perak, Malaysia {rofithah_22427, nurhidayah_22399}@utp.edu.my
Abstract— Next-Generation Sequencing in bioinformatics produce a massive amount of data volume. Big data technologies are needed to reduce computation time in data processing. In this paper, we implement Hadoop Map-Reduce framework for processing Next-Generation Sequencing using Hadoop-BAM library. Our implementation process a Binary Alignment Map (BAM) file which contains a reference sequence and many aligned/not-aligned reads by spitting the BAM file into Hadoop data blocks. To process the BAM file in a computer cluster, we implement a mapper and a reducer of Hadoop Map-Reduce framework. The mapper processes the BAM file to produce key value pairs. While, the reducer summary the key value pairs into a meaningful output. Here the mapper and reducer are created to summarize the number of bases in a BAM file. We conduct the experiment in a LIPI Hadoop cluster. The cluster consists of 96 CPU cores. The result of our experiments show that our mapreduce implementations are gaining speed-up compare to serial Next-Generation Sequencing with Picard tools. Keywords— bioinformatics;
map-reduce;
Next-Generation
Sequencing;
I. INTRODUCTION With the emergence of the sequencing technologies of bioinformatics, it becomes the fiercest issue in solving enormous of big data problems especially in sequencing reads [1]. The introduction of the tools has led to analyze gigantic amount of biological data activity more accurately and speedily [2]. Besides, the advancement of technology modern makes the Next-Generation Sequencing (NGS) more systematic and already surpass boundaries which all the data are being process and analyze in parallel using a scalable technique [3]. Apache Hadoop acts as noticeable establishment of the current usage of Hadoop in providing the parallelized data sets within bioinformatics community [3]. In this paper, we will describe the design and implementation of Hadoop Map-Reduce in the direction of Next-Generation Sequencing (NGS) data and make summarizing the number of bases in a BAM file. The implementation of Hadoop Map-Reduce is crucial for the Binary Alignment Map (BAM) files executed in parallel [4]. The mapper and reducer stages parallelization the splitting a big
978-1-5386-0658-2/17/$31.00 ©2017 IEEE
421
BAM file into Hadoop data blocks [1] and Picard tool to process the BAM files [5]. In a broad-spectrum, Hadoop-BAM is a library for manipulating the Next-Generation Sequencing file format in various platforms that written in Java Programming Language through the framework of Hadoop Map-Reduce [4]. As a library for Hadoop framework, Hadoop-BAM is utilized to cater the interrelated matters regarding the BAM splitting using any suitable Application Program Interfaces (API) [4]. Therefore, we are using tools from Hadoop BAM in order to process NextGeneration Sequencing (NGS) data in a Hadoop cluster. The allotment of this research paper is arranged as follows: Section 2 describes about Next-Generation Sequencing includes definition, data format and the way of the processing NGS data with big data technology. Section 3 presents the Map-Reduce implementation for summarizing the NextGeneration Sequencing (NGS) Data encompass of the Map and Reduce algorithm. The information about LIPI Hadoop Cluster will be presented in Section 4 including the result of the differentiation between Hadoop-BAM and Picard. The discussion has been conducted to discuss about the current result and also as conclusion of this research paper. II. NEXT-GENERATION SEQUENCING A. Definition The Next-Generation Sequencing (NGS) technologies known as a provider of various types of application to generate millions fragment of reads in a single runtime by parallelization of sequencing process [6, 7, 8, 9, 10]. NGS technologies applied and publicized internationally by most genomic researchers in 2005 [9, 11] as a new platform of the DNA sequencing process. The NGS technologies are capable to emit accurately the numerous data sequencing reduced precipitously the cost of research, decreasing the runtime elapsed of alignment reads and produce in greater qualities rather than previous sequencers [6,8,9,12]. Time by time, NGS technologies have been enhanced better and modulated the sequencers based on particular process [12].
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)
B. NGS Data Format Deal with the sophisticated data and unforeseen genomic sequencing [7], the NGS enforces several applications or tools for precision reads the DNA sequencing such as the data of human genomes to discover any related research [6]. Therefore, NGS technologies consist different data format for specific alignment tools [13] such as SAM, BAM, FASTQ and VCF format [14]. Sequence Alignment Map (SAM) file format applied in NGS intended for short reads alignment and more than 128 MB which can be storing, splitting, indexing and executing different tools [13]. BAM format or known as a Binary Alignment Map file format for binary of SAM information [13]. BAM signifying as a library that compressed the data into block of BGZF and need to rely on Hadoop Map-Reduce framework [4, 14]. FASTQ known as the file format for paired-end of two alignment reads contiguous together to form a single file [ 14, 15].
III. MAP-REDUCE IMPLEMENTATION FOR SUMMARIZING NGS DATA Map-Reduce programming is known as a platform for processing huge of data sets in parallel by writing the act directly to the framework in order to process the data [1, 5]. This framework breaks down into two different major phases namely as Map phase and Reduce Phase [1, 5, 18]. The map phase consists of the input, splitting, mapping, shuffling and sorting where the reduce phases is reducing the number of tuples (key-value pairs) and produce the output [1, 18].
The FASTQ file applied two separated tools, Joiner and Splitter tools. The data that reads attached the two files as one using FASTQ Joiner and contradictory with the Splitter tools, processing each data of FASTQ file and splits into two [15]. The Variant Call Format (VCF) created to support the reference genome merging, comparison, base quality scores as well as other related tools and save it VCF file which is compressed using BGZF block [14, 16]. C. Processing NGS Data with Big Data Technology NGS technologies need to run into enormous and flexible data storage which provide the efficient ways when deal with genomic sequencing analysis [11]. Previous studies proved that the implementation of big data technologies with cloud computing as a new approach in bioinformatics research solved the problem of data storage [5, 17] At the present time, big data technologies widely held with the depiction of Apache Hadoop Map-Reduce and Hadoop Distributed File System (HDFS) [17]. The growth of sequencing data made researchers implemented the HadoopBAM Java library that dependencies of Hadoop Map-Reduce and Picard operated in parallel sequencing process [4]. As a result, the Hadoop programming codes not have the trouble regarding compressed file of BGZF blocks, alignment reads boundary, blockage the detection of boundary or deconstructing the binary data. BAM dependencies on Picard API because to make available the adapted of huge amount data with Hadoop-BAM [ 4]. NGS sequencing with big data in Halvade Framework also carried out the sequence analysis using Hadoop Map-Reduce framework for high performance of data in cloud computing [5]. The result of the Halvade framework was created new available approach to NGS data sequencing run together parallelism throughout alignment reads and the variant calling using different tools in each phase. Due to massive data streams, Halvade controls the NGS data sequencing by make available the different multithreaded instances of tool for easy to run on multiple nodes and can replaced new version without change the tools [5].
422
Fig. 1. Example of a Mapper algorithm
Using the Hadoop-BAM tools, starting from Map phase, the set of genome data will be used at the input part and converted into another set of data [1]. At the splitting part, Hadoop mapper will break down the set of data into a few data block before storing them into the disk [19]. Each of the data block consist of 128MB of storage in parallel [13] and for one set of data will have many data block. The Hadoop Framework have make it as a standardize storage in order to make the system more speedily in processing the data [1,5]. At the mapping for key and value pairs, the mapper will start produce the intermediate key based on the data from each of data blocks [19]. As illustrate at the picture above, all the words that appear in the data block will be count one by one. Then, at the shuffling and sorting part, all the reads that align to the identical intermediate key-value pairs will be grouped together [1, 5, 19]. The key that have been created by the mapper will be sorted first and handover the values to the reducer phase.
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)
NameNode / ResourceManager: TAUKE02
LIPI Clusters
DataNode: KLEREK14
Secondary NameNode: TAUKE02
Fig. 2. Example of a Reducer algorithm
Fig. 3. Example of the LIPI Clusters
For the second stage which is the reduce phases, the Hadoop Reducer will receive all the key and value from the shuffling as an input and filtering the input by examining the differentiation from the previous key in order to decrease time at the reducer phase. Next, it will calculate the total of the intermediate key [1,19]. After that, the outputs from the total will be composed and gathered in one folder [19].
d) TAUKE02: The total capacity of TAUKE02 is 3.6 Terabyte, which Distributed File System(DFS) used about 174.59 GB and 466.82 GB for non-Distributed File System (DFS). For TAUKE02, the block pools consist of 174.59 GB which the capacity contains the block files in the namespace and encompass of 12 name nodes.
IV. RESULT AND DISCUSSION A. LIPI Hadoop Clusters Indonesian Institute of Sciences (LIPI) has provided us with the Hadoop Cluster of Research Center for Informatics, recognized as P2I LIPI, for carrying out the experiment. We are using LIPI Hadoop clusters at Bandung site where the specifications as below: a) Basic node: Contains of 34 basic nodes, each node containing 2 processors and each processor have 4 cores. P2I LIPI cluster using Dual Intel EON E5-2609 product family, with the memory speed 2.4GHz, 8 GB RAM DDR3-1600 of memory with 500 GB HD SATA of hard disk space. It is also using Linux (CentOS) operating system including dual GB interconnection network. b) GPU node: Graphic Processing Unit (GPU) consists of 4 nodes, each node containing 2 processors and every processor contain 4 cores. GPU node using Dual Intel Xeon E5-2609 @ 2.4 GHz memory speed and 8GByte of RAM DDR-1600 including 500GB HD SATA of the hard disk space. Inside the GPU also provided Dual GB interconnection, NVIDIA Tesla M2075 GPGPU and also using Linux (CentOS) operating system. c) Master node: Encompass of 2 nodes, including 2 processors per node, where in each processors containing 8 cores. Using dual INTEL EON E5-2650 as product family and 2.0 GHz memory speed with 128 GB of RAM DDR3-1600 of memory, 24 TB HD SATA (Raw), and RAID 5. The nodes are interconnected by Dual 10 GB interconnection and using Linux (Centos) operating system.
423
e) KLEREK14: It contains 307.18 GB of the storage and have used 14.88 GB for DFS and 43.14 GB for non DFS. KLEREK14 consist of 254 of blocks and used 4.85% (14.88GB) to test the alignment read of Hadoop-BAM file. B. Results and Discussion When experimentation of NGS data processing in MapReduce framework using Hadoop-BAM, we run the reads of BAM files and make the comparison of the Hadoop-BAM performances with the Picard HTSJDK. The performance between two tools has been tested including of the elapsed time spent, the number of splits and the number of read alignment in Picard htsjdk-2.3.0 and Hadoop-BAM 7.8.1 version with Hadoop 2.7.3 version. We investigate the performance of the tools by setting the different genome data size starting from 1.0 GB, 2.1 GB, 3.1 GB, 4.0 GB, 5.1 GB, 6.0 GB, 7.0 GB, 8.1 GB, 9.1 GB and 10.0 GB of BAM files format. The genome data consist of four types of nitrogenous bases that contain of Adenine (A), Cytosine (C), Guanine (G) and Thymine (T) [20]. For that reason, we used Hadoop-BAM tools to reads alignment of nitrogenous bases. When evaluated the counting reads of BAM files thru Picard, we have accustomed specific mechanism to count the alignment reads of BAM files first. After customize the required Picard tool, the Picard managed the data in serial of java garbage collector, which is one BAM files entry accessed at a single time. This tool differentiated against Hadoop-BAM which is as a parallel collection processing, all entries run simultaneously. The details of each data size, time elapsed and result of each data presented in the Table 1.
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)
TABLE I.
V. CONCLUSION AND FURTHER STUDIES
THE RESULT CALCULATION BASED ON THE DIFFERENTIATION OF ELAPSED TIMES USING PICARD AND DIVIDED THE TIME TAKEN OF HADOOPBAM.
Time Elapsed in Seconds Data size
Picard
Hadoop-BAM
1.0 GB
37.09
4.64
9
Speed Up Picard/ HadoopBAM 7.99
2.1 GB
120.37
14.98
17
8.04
3.1 GB
135.28
16.01
25
8.45
4.0 GB
157.07
16.48
32
9.53
5.1 GB
225
23.38
41
9.62
6.0 GB
237.71
24.28
48
9.79
7.0 GB
289.77
29.38
57
9.86
8.1 GB
336
33.69
65
9.97
9.1 GB
390.52
37.63
74
10.38
10 GB
429.46
37.11
81
11.57
BAM Splits
In this work, we show that the Map-Reduce version for processing Next-Generation Sequencing data speeds up the computation time. This means with Hadoop Map-Reduce running on a computer cluster could reduce computation time for processing NGS data which usually has a very large data size. However, our study is limited to summarizing reads data on a large BAM file which is naturally easy to parallel. Further studies are needed to show how it is effective for other type of computation in NGS data processing. ACKNOWLEDGMENT This Hadoop Cluster is fully provided by P2I LIPI, Research Center for Informatics, Indonesian Institute of Sciences. The financial funding for student internship program is fully supported by Yayasan Universiti Teknologi PETRONAS (YUTP) Scholarship. REFERENCES [1]
[2]
[3]
[4]
[5]
Fig. 4. Speed up gain: Map-Reduce VS serial on summarizing BAM File
[6]
As shown in Figure 4, we illustrate the Map-Reduce version of summarizing BAM file gain speed up significantly. The speeding up of the computation time is supported by the number of splitting done by Hadoop-BAM. In Hadoop file system, a large file was divided into the number of Hadoop blocks (128 MB at large). Here, the Hadoop-BAM library splits a large BAM file into Hadoop blocks decisively.
[7]
Our Map-Reduce implementation runs parallel toward these BAM files process besides concerning the number of data nodes that available. For that justification, the different of BAM data size brought the huge changes in runtime performance between serial processing of Picard and Hadoop-BAM. The speed up between 1.0 GB and 2.1 GB not too obvious where the different only 0.05. However, the data noticeably different when compared the speed up of 2.1 GB with 3.1 GB and 3.1 GB with 4.0 GB including the splits of Hadoop-BAM blocks. As the consequences, the graph shown that the speed up were increased when enlarging the BAM data size.
424
[8]
[9]
[10] [11] [12]
[13]
[14]
Matti, N. (2013, 11 16). Analysing sequencing data in Hadoop:The road to interactivity via SQL. Retrieved 09 28, 2017, from aaltodoc.aalto.fi: https://aaltodoc.aalto.fi/bitstream/handle/123456789/11886/master_niem enmaa_matti_2013.pdf Sehar, U., Ahmad, N., & Mehmood, M. A. (2014). Use of Bioinformatics Tools in Different Spheres of Life Sciences. Retrieved 09 28, 2017, from Data Mining in Genomics & Proteomics: https://www.omicsonline.org/open-access/use-of-bioinformatics-toolsin-different-spheres-of-life-sciences-2153-0602-5-158.pdf Driscoll, A. O., Daugelaite, J., & Sleator, R. D. (2013, July 18). ‘Big data’, Hadoop and cloud computing in genomics. Retrieved 09 30, 2017, from sciencedirect.com: http://www.sciencedirect.com/science/article/pii/S1532046413001007 Niemenmaa , M., Kallio , A., Schumacher , A., Klemelä , P., Korpelainen, E., & Heljanko, K. (2012, 03 15). Hadoop-BAM: directly manipulating next generation sequencing data in the cloud . Retrieved 09 30, 2017, from academic.oup.com: https://academic.oup.com/bioinformatics/articlelookup/doi/10.1093/bioinformatics/bts054 Decap , D., Reumers , J., Herze, C., Costanza, P., & Fostier, J. (2015, 03 26). Halvade: scalable sequence analysis with MapReduce. Retrieved 09 30, 2017, from ncbi.nlm.nih.gov: https://www.ncbi.nlm.nih.gov/pubmed/25819078 Patel RK, Jain M (2012) NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data. PLoS ONE 7(2): e30619 Behjati S, Tarpey PS (2013) What is next generation sequencing? Arch Dis Child Pract Educ 98:236–238 H.P.J. Buermans, J.T. den Dunnen (2014) Next generation sequencing technology: Advances and application. Biochimica et Biophysica Acta 1842: 1932–1941 Kchouk M, Gibrat JF, Elloumi M (2017) Generations of Sequencing Technologies: From First to Next Generation. Biol Med (Aligarh) 9:395. doi:10.4172/0974-8369.1000395 J. Shendure & H. Ji (2008) Next-generation DNA sequencing. Nat. Biotechnol.26:1135–1145 O. Morozova, M.A. Marra (2008) Applications of next-generation sequencing technologies in functional genomics. Genomics 92: 255–264 Rashmi Tripathi, Pawan Sharma, Pavan Chakraborty & Pritish KumarVaradwaj (2016) Next-generation sequencing revolution through big data analytics, Frontiers in LifeScience, 9:2, 119-149, DOI: 10.1080/21553769.2016.1178180 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078–2079. doi:10.1093/bioinformatics/btp352 Decap D, Reumers J, Herzeel C, Costanza P, Fostier J (2017) HalvadeRNA: Parallel variant calling from transcriptomic data using MapReduce.
2017 2nd International Conferences on Information Technology, Information Systems and Electrical Engineering (ICITISEE)
[15]
[16]
[17] [18]
[19]
[20]
PLoS ONE 12(3): e0174575. https://doi.org/10.1371/journal.pone.0174575 D Blankenberg, A Gordon, GV Kuster, N Coraor, J Taylor, A Nekrutenko, et al (2010) Manipulation of FASTQ data with Galaxy, Bioinformatics,26(14),1783-1785 Danecek P, Auton A, Abecasis G, Albers C.A, Banks E, DePristo M.A, et al (2011) The variant call format and VCFtools, Bioinformatics, 27(15), 2156–2158, Singh P (2016) Big Genomic Data in Bioinformatics Cloud . Appli Microbio Open Access 2:113. doi:10.4172/2471-9315.1000113 Maharjan, M. (2011, 06 15). Genome Analysis with MapReduce. Retrieved 10 30, 2017, from tcs.hut.fi: http://www.tcs.hut.fi/Studies/T79.5001/reports/2011-Maharjan.pdf Mohammed, E. A., Far, B. H., & Naugler, C. (2014, 07 22). Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends. Retrieved 09 30, 2017, from biodatamining.biomedcentral.com: https://biodatamining.biomedcentral.com/track/pdf/10.1186/1756-03817-22?site=biodatamining.biomedcentral.com Francesco, E. D., Santo, G. D., Palopoli, L., & Rombo, S. E. (2009). A Summary of Genomic Databases: Overview and Discussion. Retrieved 09 27, 2017, from math.unipa.it: http://math.unipa.it/rombo/files/publications/chapter09c_draft.pdf
425