Evolutionary Bioinformatics
S oft w a r e o r d atabase r e v i e w
Open Access Full open access to this and thousands of other papers at http://www.la-press.com.
SOLiDzipper: A High Speed Encoding Method for the Next-Generation Sequencing Data Young Jun Jeon1, Sang Hyun Park1, Sung Min Ahn2 and Hee Joung Hwang1 SDLAB, Gachon University of Medicine and Science, 406-799 Yeonsu-dong, Incheon, Korea. 2Laboratory of Genomics and Genomic Medicine, Lee Gil Ya Cancer and Diabetes Institute, Gachon University of Medicine and Science, Incheon, Korea. Corresponding author email:
[email protected];
[email protected] 1
Abstract Background: Next-generation sequencing (NGS) methods pose computational challenges of handling large volumes of data. Although cloud computing offers a potential solution to these challenges, transferring a large data set across the internet is the biggest obstacle, which may be overcome by efficient encoding methods. When encoding is used to facilitate data transfer to the cloud, the time factor is equally as important as the encoding efficiency. Moreover, to take advantage of parallel processing in cloud computing, a parallel technique to decode and split compressed data in the cloud is essential. Hence in this review, we present SOLiDzipper, a new encoding method for NGS data. Methods: The basic strategy of SOLiDzipper is to divide and encode. NGS data files contain both the sequence and non-sequence information whose encoding efficiencies are different. In SOLiDzipper, encoded data are stored in binary data block that does not contain the characteristic information of a specific sequence platform, which means that data can be decoded according to a desired platform even in cases of Illumina, Solexa or Roche 454 data. Results: The main calculation time using Crossbow was 173 minutes when 40 EC2 nodes were involved. In that case, an analysis preparation time of 464 minutes is required to encode data in the latest DNA compression method like G-SQZ and transmit it on a 183 Mbit/s bandwidth. However, it takes 194 minutes to encode and transmit data with SOLiDzipper under the same bandwidth conditions. These results indicate that the entire processing time can be reduced according to the encoding methods used, under the same network bandwidth conditions. Considering the limited network bandwidth, high-speed, high-efficiency encoding methods such as SOLiDzipper can make a significant contribution to higher productivity in labs seeking to take advantage of the cloud as an alternative to local computing. Availability: http://szipper.dinfree.com. Academic/non-profit: Binary available for direct download at no cost. For-profit: Submit request for for-profit license from the web-site. Keywords: bioinformatics, NGS, DNA compression, cloud computing
Evolutionary Bioinformatics 2011:7 1–6 doi: 10.4137/EBO.S6618 This article is available from http://www.la-press.com. © the author(s), publisher and licensee Libertas Academica Ltd. This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited. Evolutionary Bioinformatics 2011:7
1
Jeon et al
Introduction
Next-generation sequencing (NGS) methods, which are revolutionizing genomics research by reducing sequencing cost and increasing its efficiency,1 pose various computational challenges of handling large volumes of short read data. For example, human genome re-sequencing at ∼30X sequencing depth requires a level of computational power achievable only via large-scale parallelization.2 One potential solution to these computational challenges is the use of cloud computing. Langmead and colleagues3 genotyped data comprising 38-fold coverage of the human genome in ∼4 h on the Amazon cloud (Amazon EC2) using the Crossbow genotyping program. In a recent study, Parul Kudtarkar and colleagues4 computed orthologous relationships for 245,323 genome-to-genome comparisons on the Amazon cloud using the genomic tool, Roundup, at a lesser cost. Applied Biosystems provides a cloud computing service as an alternative to maintaining an in-house computing infrastructure for NGS data analysis (ie, SAMtools5) to the SOLiD system users (ABI SOLiD system). Despite the promises and potential of cloud computing, the biggest obstacle to moving to the cloud may be network bandwidth, since it may take at least a week to transfer a 100 gigabyte NGS data file across the internet in a typical research environment.6 The more dramatically the advantages of NGS data sequencing or analysis in a cloud environment are revealed, the more apparent will the limitations of access to a cloud environment become. For example, we are currently working on the Amazon cloud wherein we can control the number of nodes needed for the analysis at our chosen time and predict the cost required for the analysis operation. However, we cannot say that our chosen transmission time will guarantee optimal bandwidth when transmitting large volumes of NGS data to a cloud environment. In addition, the usable bandwidth in a lab is a limited resource. Thus, we can decrease the rate of offsetting time benefit, one of the many benefits during an experiment on a cloud, by applying a proper encoding method and using a transmission bandwidth efficiently to transmit NGS data. Furthermore, the important points that we should take into consideration are the encoding/decoding time and the possibility of parallel/selective decompression as well as 2
the compression rates when adopting an encoding method aimed at cloud transmission unlike the traditional DNA compression method aimed at efficient storage. Efficient encoding methods may enable to overcome the problems of transferring such a large dataset. Recently, Tembe and colleagues7 showed that NGS data can be reduced by 70%–80% in size by using their algorithm. However, when a large dataset is encoded for transferring, the time required for encoding and decoding is equally as important as the encoding efficiency. Accordingly, an ideal compression algorithm that can be used in combination with cloud computing for sequence data analysis needs to have the following features: 1) high encoding/decoding rate; 2) high encoding efficiency; 3) a parallel technique to decode and split compressed data in the cloud. Here, we present SOLiDzipper, a new encoding method, by which we can encode NGS data with high speed and high efficiency. SOLiDZipper is best optimized to encode csfasta and QV files from ABI SOLiD system.
Methods
The basic strategy of SOLiDzipper is to divide and encode. NGS data files contain both the sequence and non-sequence information whose encoding efficiencies are different. In SOLiDzipper, the non-sequence information including the sequence IDs and number in plain text format is encoded by a general purpose compression algorithm (ie, gzip, bzip2, lzma(LZMA SDK)), whereas the sequence information consisting of ‘0123’ in csfasta format, which has random patterns and thus a low encoding efficiency, is encoded by bitwise and shift operations. Figures 2 and 3 summarize the encoding process and the method of SOLiDzipper, respectively. Decoding of SOLiDzipper is basically a reverse of encoding, except for non-calls. In SOLiDzipper, noncalls (‘.’ in csfasta files) are converted into temporary binary data when encoded. When decoded, QV values are used in order to recover temporary binary data to previous non-calls. Unlike other general encoding methods, SOLiD zipper does not use compression dictionary scheme or statistical pattern matching (ie, palindromes, string comparisons, repeat detection, data permutation),7–11 thereby minimizing computing resource requirements Evolutionary Bioinformatics 2011:7
SOLiDzipper: A high speed encoding method for the NGS 150
Uncompressed gzip lzma G-SQZ SOLiDzipper
Rt (hours)
100
50
25
10 5 0 0.1
0.2
0.3
0.4
0.5 0.6 0.70.8 0.9 1
2
3
4
5
6
7
8 6 10
Data transfer speed (megabytes/second) Figure 1. Rt changes according to the data transfer speed and encoding methods. Notes: When data transfer speed across the internet exceeds a certain threshold, it offsets the advantages of encoding NGS data. For example, LZMA does not provide any advantage when the transfer speed is 10 megabytes per second. Within the current limitations of data transfer speed, SOLiDzipper shows the best performance among the algorithms compared, providing a definite advantage in Rt. X axis (logarithmic scale); Y axis (linear scale).
and dictionary exploring time. For example, G-SQZ7 utilizes Huffman coding12 method, which generates a Huffman tree in the process of highly efficient encoding. And the DNA Compress10 program shows fast and effective encoding using detection of repeats. Preprocess block
Read block Do/read data block Exit/remain more block Exit/end of file
Entry/tokenize (delimiter) block Do/split block check Do/merge split plaintext line Do/merge QV or DNA base line
Mainprocess block
D
A
Write block
Encode quality value
Do/write encoded qv, csfasta Do/write encoded plaintext
Do/extract plaintext line Do/convert quality value to 1 byte Do/reallocate 4 quality value to 3 byte
C
Encode csfasta Compression plaintext
B
Do/extract plaintext line Do/mapping ACGT or 0123 to 2 bit Do/combine 4 base to 1 byte
Figure 2. Encoding process of SOLiDzipper. Notes: Sequence IDs are extracted from QV and csfasta files and then are bitwise-encoded (A and B). Extracted sequence IDs are combined and compressed using the general purpose compression methods (eg, gzip) (C). Encoded data are stored in a data block (D).
Evolutionary Bioinformatics 2011:7
Such a dictionary exploring time or statistical pattern matching time used in the computational method can require a significant amount of time for the encoding process that should not be ignored when processing huge volumes of NGS data. Thus, it will be more effective to complete encoding as fast as possible even by lowering the encoding rate a little when transferring data to cloud computing for high-performance sequence analysis. SOLiDzipper performs high-speed, high-efficiency encoding on the bitwise level by taking advantage of the characteristic features of NGS data. In SOLiDzipper, encoded data are stored in binary data blocks that do not contain the characteristic information of a specific sequence platform, which means that data can be decoded according to a desired platform even in cases of Illumina, Solexa or Roche 454 data.
Implementation
SOLiDzipper is implemented in Java 1.6 commandline mode at 64 bit Linux machine (Linux: 2.6.29.4167.fc11.x86_64 Fedora 11 64 bit, Intel(R) Core (TM) 2 Duo CPU E8400 3.00GHz, 4 GByte memory). 3
Jeon et al A), B) Bitwise encode method 11 −1 6
A)
2
00 11 00 00
001011 111111 000110 000010
C) Compression of extracted sequence IDs 001011 00 111111 00 000110 10
0 x 2c 0 x fc 0 x 1a
Matching extracted ID
Combine
>1_6_33F3
1.00
01 00 00 00
0x40
csfasta >1_6_33_F3 11 −1 6 2 11 23 3 14 5 15 22−1 6 3 2 7 6 9 18 6 14 16 17 15 4 12 20 4 4 12 13 8 11 6 22 −1 10−1 15 15 9 17 11−1 10 17−1 11 20 17 >1_6_73_F3 14 8 18 7 8 7 8 4 3 22 21 5 2 11 17 14 8 6 16 10 16 7 13 16 4 7 5 6 13 11 6 13 11 2 20 6 9 20 8 14 3 2 8 4 16 14 14 19 19 17 >1_6_33_F3
1 2 3 4 5 6 7
1 >1_6_33_F3 2 >1_6_33_F3
>1_6_33F3
B)
Compress (gzip, bzip2…)
QV
T1.001210020.00022000000000000000000.0.00000.00.000 >1_6_73_F3 T12011100001000020000200000100000000000000003000000
>1_6_33_F3 >1_6_73_F3 >1_6_105_F3 >1_6_142_F3 >1_6_148_F3 >1_6_179_F3 >1_6_246_F3
D) Storage of encoded data as data blocks of the fixed size 00000000h: 00000010h: 00000020h: 00000030h: 00000040h:
2C 43 2F 55 2C
FC 47 50 08 09
1A 10 46 2F 52
2C 31 FF 46 24
5F 50 50 20 53
16 10 44 1A 22
17 32 00 42 0C
3F 34 0A 41 09
5B 2F 38 1C 20
18 1B 21 34 41
0D 5B 4B 10 38
0B 28 20 1D 3B
18 FF 1D 16 4C
25 3F 20 34 44
4A 27 0C 2F 00
38 47 59 19 0A
00000000h: 40 64 20 02 80 00 00 00 00 00 00 00 00 0A 61 50 00000010h: 04 02 00 80 04 00 00 00 03 00 00 0A
Figure 3. Encoding methods of SOLiDzipper. Notes: a) The quality value in QV files from ABI SOLiD system ranges from -1 to 40, which requires 6 bit space. The remaining 2 bit space out of 1 byte can be used for storing another quality value in part. b) Csfasta files contain the sequence information in four digits, ‘0123’, which require 2 bit space. Provided that ‘0’, 1 byte character in csfasta files, is mapped as binary data ‘00’, ‘1’ as ‘01’, ‘2’as ‘10’, ‘3’ as ‘11’, 4 byte data ‘0113’ can be encoded into 1 byte character 0x17(00010111) through shift operation. c) Sequence IDs are extracted from QV and csfasta files, combined, and compressed using the general purpose compression methods. d) Encoded data are stored as data blocks of fixed size. This allows for selective decoding.
Table 1 shows the comparison experiments for zipping using high speed zipping option (—fast) of general purpose compression tool gzip (version 1.3.12), highest zipping efficiency option (mx = 9) of LZMA (version 4.65) (LZMA SDK) and G-SQZ (version 0.6).7 133 gigabytes of mate-paired data from the ABI SOLiD 3.5 system were used as the test data set.
Results and Discussion
Encoding efficiency is usually regarded as the most important criterion to determine the performance of encoding algorithms, especially when it is used to reduce the long-term storage cost. However, when encoding is used in combination with cloud computing, NGS data need to be encoded in the local servers and then decoded in the cloud as quickly as possible (ie, in this case, encoding is used to facilitate transfer, not for long-term storage).
When cloud computing is used for NGS data analysis, ready to job time (Rt) represents the sum of time required for compression in the local servers, transferring the compressed data to the cloud and decompression in the cloud. Rt increases in proportion with the increase in time required for compression and decompression, thereby offsetting the advantages of using cloud computing for higher efficiency. The Ready to job time (Rt) of cloud computing for NGS data analysis can be calculated using equation below (1). Rt = Encode(NGS data)t size of encoded NGS data + t data transfoor speed Decode(encoded NGS data)t + decoding unit count
Table 1. Comparison of encoding efficiencies and time between the different encoding methods (decoding unit count = 1). 133 GBytes of csfasta and QV files
Compression time (minutes)
Decompression time (minutes)
Compression rate
SOLiDzipper gzip (—fast) lzma (mx = 9 ultra) G-SQZ
61 60 3640 177
62 54 72 571
74.1% 64.9% 77.0% 77.1%
4
Evolutionary Bioinformatics 2011:7
SOLiDzipper: A high speed encoding method for the NGS
When time factor is considered, the advantages of using encoding methods are offset when the data transfer speed exceeds a certain threshold (Fig. 1). However, within the current limitations of data transfer, SOLiDzipper is more time-efficient than both gzip (low compression rate and high operation speed) and G-SQZ (high compression rate and low operation speed). In addition, SOLiDzipper does not use compression scheme, generating data blocks of the same length. Since there is no link between the compressed data blocks, encoded data can be distributed for parallel decoding, thereby drastically enhancing the decoding rate in the cloud. The contributory point of SOLiDzipper to bioinformatics was to address the issue of high-speed transmission infrastructure that could not expand easily at a lower cost, which can be achieved with a DNA analysis environment in a cloud environment like Amazon EC2. The objective of SOLiDzipper is to minimize the percentage of preparation time in the entire DNA analysis process time so that the analysis environment can smoothly move to the cloud, rather than to merely increase the encoding speed or compression efficiency. We divided the entire processing time into two parts; the first part is the preparation time for analysis
and represents the time required to compress DNA data produced on a sequence platform, transmit them over a network, and decode them on a cloud; and the other is the main computation time that represents the time required to carry out computation analysis on a cloud in parallel. Table 2 presents the calculations of the required time until the final analysis results considering the communication bandwidth and data compression method based on the whole-genome computation time of Crossbow. According to the main computation time of Crossbow, 10 workers took less than 7 hours to compute the whole genome and 40 workers achieved the same within 3 hours in the Amazon cloud environment. In addition, it took more than an hour to transmit (183 Megabit/second transfer speed) the compressed data set (103 GigaByte) to Amazon s3. In a situation where the transmission bandwidth is limited the data compression time should be considered, which raises an important issue since more time is required to prepare the analysis than the actual analysis time in a cloud environment. For example, a data set of about 300 GB compressed using G-SQZ with high compression rate, is transmitted through 45 Megabits bandwidth, and is decompressed in parallel at 40 nodes to conduct the Crossbow analysis. In such a case, it takes roughly three times to prepare the operation
Table 2. Comparison of the total processing time in the cloud-based NGS dataset computation. Encoding method
Transfer speed (Megabit/s)
EC2 nodes (workers)
Processing time (minute) Encoding
Transfer
Parallel decoding
Crossbow computation
Total
Gzip
45
10 40 10 40 10 40 10 40 10 40 10 40
137 137 137 137 399 399 399 399 135 135 135 135
306 306 77 77 200 200 50 50 226 226 57 57
5 1 5 1 57 14 57 14 6 2 6 2
390 173 390 173 390 173 390 173 390 173 390 173
838 617 609 388 1047 787 896 637 758 536 588 367
183 G-SQZ
45 183
SOLiDzipper
45 183
Notes: Encoding time was calculated by assuming that the dataset was 300 GigaByte and applying the encode time and compression rate from Table 1. The parallel decode time was obtained by dividing the decode time from Table 1 by the number of nodes at EC2 based on the assumption that the decode operation would be performed in a cloud environment. The transfer speed of 183 Megabit/second was used during uploading in the Crossbow computation. In addition, 1/4th of 183 Mb/s was also used in the calculation of transfer speed just like 1/4th of the 40 workers were used in the Crossbow computation.
Evolutionary Bioinformatics 2011:7
5
Jeon et al
than the actual computation time in the cloud. Thus, compression time should be considered important in addition to compression efficiency when considering transmission to a cloud environment. These findings indicate that the entire processing time can be reduced according to the encoding methods used, if the same communication bandwidth is adopted. Considering the limited network bandwidth, high-speed, high-efficiency encoding methods such as SOLiDzipper can make a significant contribution to higher productivity in labs seeking to take advantage of the cloud as an alternative to the local computing cluster.
5. Li, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–9. 6. Stein LD. The case for cloud computing in genome informatics. Genome Biology. 2010;11:207. 7. Tembe W, et al. G-SQZ: Compact Encoding of Genomic Sequence and Quality Data. Bioinformatics. 2010;26:2192–4. 8. Adjeroh, et al. DNA sequence compression using the burrows-wheeler transform. Proc IEEE Comput Soc Bioinform. 2002. 9. Brandon, et al. Data structures and compression algorithms for genomic sequence data. Bioinformatics. 2009;25:1731–8. 10. Chen X, et al. DNACompress: fast and effective DNA sequence compression. Bioinformatics. 2002;18:1696–8. 11. Soliman, et al. A lossless compression algorithm for DNA sequences. Int J Bioinform Res. 2009;5:593–602. 12. Huffman DA. A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE. 1952;40:1098–102.
Conclusions
The unique features of SOLiDzipper are: 1) It divides information in csfasta files for high encoding efficiency and speed; 2) It combines two different compression methods (ie, bitwise/shift and general purpose compression), allowing for the optimal preparation time (Rt) for Cloud Computing; 3) Data can be decoded selectively without unzipping the whole encoded file because encoded data are stored as data blocks of the fixed size; 4) In the cloud, encoded data can be distributed for parallel decoding; 5) It requires minimal computing resources. In summary, SOLiDzipper is a fast encoding method that can efficiently encode and decode NGS data. This method can be especially more suited to typical research environments where the data transfer speed across the internet is limited.
Disclosure
This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.
References
1. Metzker ML. Sequencing technologies the next generation. Nature Reviews Genetics. 2010;11:31–46. 2. Ahn SM, et al. The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome Research. 2009;19:1622–9. 3. Langmead B, et al. Searching for SNPs with cloud computing. Genome Biology. 2009;10:R134. 4. Parul Kudtarkar, et al. Cost-Effective Cloud Computing: A Case Study Using the Comparative Genomics Tool, Roundup. Evolutionary Bioinformatics. 2010;6:197–203.
6
Publish with Libertas Academica and every scientist working in your field can read your article “I would like to say that this is the most author-friendly editing process I have experienced in over 150 publications. Thank you most sincerely.” “The communication between your staff and me has been terrific. Whenever progress is made with the manuscript, I receive notice. Quite honestly, I’ve never had such complete communication with a journal.” “LA is different, and hopefully represents a kind of scientific publication machinery that removes the hurdles from free flow of scientific thought.”
Your paper will be: • Available to your entire community free of charge • Fairly and quickly peer reviewed • Yours! You retain copyright http://www.la-press.com
Evolutionary Bioinformatics 2011:7