Parallelisation of Maximal Patterns Finding Algorithm ...

2 downloads 0 Views 439KB Size Report
G.T. COLE, "A Recombinant aspartyl protease of coccidioides posadasii induces protection against pulmonary coccidioidomycosis in mice," Infect Immun, vol 74, ...
2016 3rd International Conference On Computer And Information Sciences (ICCOINS)

Parallelisation of Maximal Patterns Finding Algorithmof inMaximal BiologicalPatterns Sequences Parallelisation Finding Algorithm in Biological Sequences Ahmad MohdAziz Hussein1, Nuraini Abdul Rashid2 and Rosni Abdulah3 Ahmad MohdAziz Hussein, AbdulRashid, Rosni Abdulah School of Nuraini Computer Sciences School of Computer Sciences Universiti Sains Malaysia Universiti Sains Malaysia Penang, Malaysia Penang,2 Malaysia 1 3 [email protected], [email protected], [email protected] Email: [email protected], [email protected], [email protected]

Abstract—The rapid increase of the biological data opens up new challenges for scientists to discover new methods to manage, analyses and understand them effectively. One of the methods in analysing these biological data is by looking at the maximal patterns that exists in the data. Discovering the relationship among the biological sequences is based on the importance of the maximal patterns of these sequences. These maximal patterns can be used to build indexes a faster search. In this research, we used parallel methods to improve the speed of an existing maximal pattern finding algorithm, TEIRESIAS. There are two phases in the algorithm, which are the scanning and the convolution phases. The first phase detects short patterns in the biological data and the second phase combines the short patterns into longer patterns without sacrificing the meaning. The output will be maximal patterns. The first phase of the algorithm is very compute intensive. We improve the overall process of finding maximal patterns by decomposing the biological database and by distributing it to be input into to the TEIRESIAS algorithm. We applied the master-slave model and used OpenMP to implement the model. Our results show that the performance decreased when we used 8 threads. The results also show that there are 1.6 time and 2.0 times improvement in terms of the overall speed of the algorithm when we used two threads and four. Keywords—component; Teiresias; finding patterns; discovery motifs; pattern matching; finding sequence similarities

I.

INTRODUCTION

The genetic codes of the pertaining the inheritance and the hormones are included in the DNA molecule, where this molecule forms the protein. The foundation material in the structure of most living things is proteins. In the entire biological systems, proteins play many different important roles of structural and functional tasks. Proteins can be described in four structure levels which are primary, secondary, tertiary, and quaternary levels. A protein that contains the linear arrangement of amino acid residues is called the primary structure. Current advance technology in producing sequence primary protein has caused the rapid growth of primary sequence databases. As the size of the primary and sequence database grows beyond the human comprehension, the automation of sequence analysis process becomes crucial. Computer scientists, mathematicians, and engineers are concerned on the computation process in analyzing the large data. Several algorithms have been developed for mining the large biological database. One of

the most popular algorithms for data mining is the pattern discovery or the pattern finding. Complex relations between protein sequences can be revealed by analysing the patterns that exist shared by these protein sequences. They can be used to represent a group of closely related protein sequences, which are called families, On the other hand, a major issue in biology comprises the analysis of the protein, which is being performed by the scientists [12], [16]. Providing a fast and a scalable algorithm for discovering the patterns in large database has become crucial when the database size is growing rapidly. The size of the SWISSPROT database grew from 550116 in 2015 and grew to 550299 in 2016. In this work, the parallel design of the TEIRESIAS algorithm, the data partitioning method and the environment of the experiments are discussed. The outlines of this paper are organized as follows: in Section 2, the state-of-the-art and a background of the previous proposed algorithms of other studies are introduced. In Section 3, a design of the TEIRESIAS algorithm is proposed where it is broken up into two parts: the first is the parallel data partitioning, whereas the second is the design of a parallel algorithm. In Section 4, a discussion of the analysis and results is represented. Finally, in Section 5, the conclusions and future works are given. II.

RELATED WORK

Many algorithms have been proposed to solve the pattern finding problem. They include, the Gibbs Motif Sampler algorithm, the Delphi algorithm, MEME algorithm, TOP-DOWN motif, HMM algorithm, PROSITE algorithm, PRATT2 algorithms, the TEIRESIAS algorithm. These algorithms are broken up into two main groups, one which uses multiple sequence alignment (MSA) of and the other which is not. Gibbs sampling algorithm was introduced by Lawrence et al, (1993) [8], [23]. Gibbs sampling algorithm is an algorithm which consists of a local MSA process designed to align n protein sequences and to find specific patterns/motifs in all sequences that have a fixed length W, it is very much similar to the expectation maximization algorithm. It is guaranteed to find the best motif if run infinitely and useful for locating motifs and for determining possible binding sites for transcription factors in closely related species [1].

978-1-5090-2549-7/16/$31.00 ©2016 IEEE

978-1-5090-2549-7/16/$31.00 ©2016 IEEE

[227]

2016 3rd International Conference On Computer And Information Sciences (ICCOINS) Floratos et al (2001) introduce Delphi algorithm to identify sequence similarities between sequence queries of proteins, it is a recent computational tool for computing. The main idea is used as a pattern; as patterns are uniform expressions to present families of polypeptides. The polypeptide family is represented by one pattern, which is likely to include extensions of related amino acids [7], [18].

TABLE 1. SUMMARY OF PATTERN FINDING/DISCOVERY ALGORITHMS

MEME algorithm (Multiple EM for Motif Elicitation) is one of the many used tools for searching novel patterns in the sets of biological sequences [3], [19]. Bailey and Elkan, (1994) developed MEME algorithm [2], [19]. MEME works by searching for repeated, ungapped sequence patterns that appear in the protein sequences supplied by the user. The Gibbs Sampler and the MEME algorithms are the two most common motif discovery algorithms that are being used and compared with each other. The second algorithm is better than from the first [4]. TOP-DOWN Motif Discovery This type of algorithm was developed by Martinez et al. (1988) [6] As a solution that enhances the motif discovery, whereas this solution takes the advantages from the genetic algorithm and data mining, meanwhile, the cost of using the MSA is reduced by this algorithm. In fact, a bottom-up strategy are gradually being used by these mining methods towards the largest frequent item set when shorter frequent items are first found [4], [20]. Many different types of online recognition issues experienced the application of the HMM (Hidden Markov Model) in a successful manner. Such these issues comprise: handwritten character recognition ,speech recognition, emotion recognition in text, speech analysis, etc) [11], [21], [22]. Conserved motifs are somewhat identified based on the use of the PROSITE algorithm. In fact, these motifs are identified in the homology translated polypeptide along to the reported proteins [14]. Any known structure of a protein is apart from the unknown sequence of a protein in terms of exploring the resemblance by the entire sequence alignment. Nonetheless, the occurrence of particular clusters of residue types within the motifs sequence identifies the protein. In particular, the residue types are known to be as a signature, pattern, fingerprint or even motif [5]. PRATT2 Algorithm Jonassen (1995) introduces the PRATT2 algorithm for the purpose of identifying patterns without the need of aligning them. In fact, these patterns come from the biological sequences [9] [10]. PRATT2 looks for patterns, which are conserved where k sequences of the entire sequences are at least appearing .A pattern graph is created when a part of sequences is used by the PRATT2. Next, patterns are attempted to be extracted [15]. In our study, we will discuss use of the TEIRESIAS algorithm that was developed by Rigoutsos and Floratos (1998), for pattern discovery within the biological sequence. This algorithm is able to discover and report all existing patterns in a series of input sequence. TEIRESIAS works in two phases: Scanning and convolution. Scanning: - Scan the sequences in the input series and locate all elementary patterns with minimum support (Scan the sequences into the input series and into the entire elementary patterns of minimum support) and where elementary patterns are located, its length is in (L, W).

[228]

Convolution: - Construct maximal patterns from elementary patterns in a time/space- efficient way. TEIRESIAS begins with producing a scan in the input set S for the sequences and positioning all elementary patterns, which have at least k support. The elementary pattern only includes L residues. An associate offset list is contained in each elementary pattern set. The result of the scanning phase is a set, which includes the entire elementary patterns, which satisfy the minimum support requirement. The convolution phase is by means to reconstruct the original pattern P by dividing both pairs A and B of the intermediate patterns where a suffix of A is similar to a prefix of B. Using convolution as the main tool for reconstructing the original patterns is very efficient as in the naïve approach. In this study, we discussed TEIRESIAS, a novel algorithm for the discovery patterns in unaligned biological sequences [13]. Table 1 Start gradually reaching to the largest frequent item. III.

PARALLEL DESIGN OF TEIRESIAS ALGORITHM

A. Parallel Data Partitioning The data that was used as an input for the pattern finding process represents the protein sequences. We used an input sequence in a FASTA format that was downloaded from the SWISS-PROT database along to the cluster. To partition these protein sequences, the sets of sequences are divided into subsets and then merged with the results based on the available number of processors, such as in below Fig. 1.

Fig.

1.

Partitioning

Result

the

data

slave1

Master

and

distribute

to

Nodes

using

Master-Slave

mode

Result Result Result

Result

slaven

TEVR RYT G

TEIRESIAS

CHIC KG KG

TEIRESIAS

slave3

Master

TEIRESIAS

slave2

ELN YIPK NR

ELRL RYC APA

TEIRESIAS

TVFV 2E|TV FV

TEIRESIAS

>gi|532319|pir|TVFV2E|TFV2E elvelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRTQIWQKHRTSN >gi|532319|pir|TVFV2E|TVFV3 elvelope protein APPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIE

2016 3rd International Conference On Computer And Information Sciences (ICCOINS)

[229]

2016 3rd International Conference On Computer And Information Sciences (ICCOINS)

B. Design of Parallel Algorithm The paradigm illustrated in Fig. 1 for which will be used in implementing our parallel system, is the Master/ Slave paradigm, running on N nodes. One is the master and the others are slaves. The data sets are decomposed. We will discuss the parallel methods, which are applied to the TEIRESIAS algorithm. The current technology in parallel computing has made it possible to achieve better computing time at less cost. The emphasis for this is for a low cost workstations cluster yet producing results that are helpful for biologists in running protein sequence analysis. The search will be centered on the sequences of each group of protein sequences process under the node and finding the frequency of sequence(s) that have elementary patterns in the same family and subsequently processing these patterns using the convolution phase to complete this task. The output will be the maximal patterns. IV. RESULTS AND DISCUSSIONS There are two stages of the research evaluation we made. The first stage was the sequential evaluation, we performed one test to run program, and we determined the value of L=12, K=10 and at different values of W including 15, 20, 25 and 30, with databases containing 10 sequences. The second stage is the parallel evaluation, which has used 2, 4 and 8 processors and we determine the value of L=12, W=15, and K=10. The aim of our experiments is to evaluate the results in terms of speedup and efficiency, which will be used to evaluate the performance of the parallelization process. A. Sequantial evaluation In this test we determined the value of L=12, K=10 and at different values of W=15, 20, 25 and 30, as shown from the results in Fig. 2, where the chart presents the value of W in the x axis and the running time in the y axis, as is clearly shown from the figure. We choose a high value of L to proof that the running time will be high when the value is large. Similarly the value of W will be large when the value of L is large. In general when the elementary pattern size indicated by W increased, the running time is also increased because of the time needed to process. It is worth knowing that the main reason of the above is that the number and size of the elementary patterns will be determined by the computation time where finding short patterns takes less time than finding longer patterns.

TABLE 2. EXECUTION TIME/MINUTE FOR SEQUENTIAL AND PARALLEL TEIRESIAS USING OPENMP

Note: number for the L=12, W=15, K=10 is fixed and processors will be varies.

B. Parallel Implementation Part The results show that parallel run time is less than that of the sequential run time as shown above in table 2. This Table explains the running time in a sequential run time and a parallel run time, we used the large values of W and L, because the sequential algorithms takes a long time to find elementary patterns. The time will become less when the number of threads increases. However, when the number of thread reaches to 8, the run time increases because the overhead between the masters and slaves will be high, and when we increased the number of sequences, the computation time increased. B.1. Speed Up The results in Fig. 3 show the speedup of the parallel version of the TEIRESIAS algorithm. This indicates that the speedup of the sequences when we run it on a number of threads 2, 4 and 8. While the number of thread is 2 or 4 the speedup is acceptable, but when we run it on 8 threads, the speedup is worst because of the high communication overhead between the master and the slave.

Fig. 3. Speedup of parallel TEIRESIAS using OpenMP

Fig. 2. Execution time for sequential TEIRESIAS with L=12, K=10 and varying W

[230]

The results in Fig. 4 show the average of speedup of parallel TEIRESIAS algorithm. This indicates that the average of speedup for the sequences, the number of sequences are equals to 10, 15, 20 and 25, and the number 1 means the average of the speedup for the 10 sequences on 2, 4 and 8 threads, number 2 means the average of the speedup for the 15 sequences on 2, 4 and 8 threads, and so on. So when we increase the number of sequences, the average will increase.

2016 3rd International Conference On Computer And Information Sciences (ICCOINS)

Fig. 4. Average speedup of parallel TEIRESIAS using OpenMP

B.2. Efficiency The results in Table 3 show that the efficiency for the parallel TEIRESIAS algorithm using OpenMP. The result in the table is acceptable when we used two threads and the efficiency is high and when we run our algorithms on four threads the efficiency decreased into half when we used two threads but when we run our algorithms on 8 threads the efficiency is worse since because the speedup in this case very minimal. The best efficiency is 1. TABLE 3. EFFICIENCY TABLE FOR PARALLEL TEIRESIAS USING OPENMP

Fig. 5 shows the efficiency for each thread. It also shows three points from up to down. Number 1 includes 3 points; the first point shows the results of executing 10 sequences when utilizing 2 threads. The second point shows the results of executing 10 sequences when utilizing 4 threads. The third point shows the results of executing 10 sequences when utilizing 8 threads. Number 2 includes 3 points; the first point shows the results of executing 15 sequences when utilizing 2 threads. The second point shows the results of executing 15 sequences when utilizing 4 threads. The third point shows the results of executing 15 sequences when utilizing 8 threads. As a result of increasing the number of sequences, we notice the increase of efficiency in most points. There for, numbers 3 and 4 represent the sequences 20 and 25 respectively. The result shows that when we use four threads, the efficiency of the processor utilization increases when the number of sequence increase. This is because all the four cores are on 1 CPU. The entire four core shares the same cache and this minimizes communication overhead.

Fig. 5. 5. Efficiency Efficiency of of parallel parallel TEIRESIAS TEIRESIAS using using OpenMP OpenMP Fig.

V.

CONCLUSIONS AND FUTURE WORK

Pattern or motif discovery in data is widely used as a means of ‘understanding’ large volumes of data such as DNA or protein sequences. The basic assumption is that highly repetitious patterns are a potential source of information. Unique pattern or sometimes referred as maximal patterns can be used as indexes for database search. This can enhance the search as the search spaces are now smaller compared to the original database size. Providing a fast and scalable algorithm for discovering the patterns in large database has become crucial when the database size is growing rapidly. In this work we have chosen a pattern matching algorithm TEREISIAS based on the fact that this algorithm is very compute intensive where the known computation time is [13]. The advance technology in parallel computing with the current technologies in multicore machines has encouraged us to utilize the technology towards improving the existing sequential algorithms. This technology has motivates the computer scientist to rediscover the existing sequential pattern finding algorithms and modify them to be executed for large dataset at an acceptable time. We implemented the parallel algorithm on Khawarizmi multicore cluster available at the PDCC lab at the school of computer science, in USM. The code was written in C language and OpenMP was used as the parallel library. The cluster consists of 2 processors where each processor is a Quad core. Our results show that average speedup of the algorithm for 2 and 4 processors are acceptable but the speedup decreases for 8 processors. The result is encouraging and can be used as a basis for future experiments.

ACKNOWLEDGMENT This research is supported by Universiti Sains Malaysia and has been funded by the Ministry of Higher Education Malaysia via Fundamental Research Grant Scheme (FRGS) titled by “Image Segmentation Using Hybrid Fuzzy CMeans Clustering and Watershed Regions-Based Segmentation Algorithms on General-Purpose Graphics Processing Units” (203/PKOMP/6711425) and we would like to thank the Bioinformatics group at the PDCC lab of University Science Malaysia (USM) for their cooperation and technical support.

[231]

2016 3rd International Conference On Computer And Information Sciences (ICCOINS) REFERENCES [1] [2]

[3]

[4]

[5]

[6] [7] [8] [9] [10] [11]

[12]

[13] [14]

[15]

[16]

[17] [18] [19] [20]

G. Thijs et al., “A Gibbs Sampling Method to Detect Overrepresented Motifs in the Upstream Regions of Coexpressed Genes,” Journ. Copm. Biol., vol 9, pp. 447-464, No. 2, 2002. T.L. BAILEY, and C. ELKAN, Curr. Prot. Bioinf.: "Discovering novel sequence motifs with MEME," [Online] (2006). California [Accessed Jan, 24 2016]. Available from World Wide Web: . T.L. BAILEY, N. WILLIAMS, C. MISLEH and W.LI, W., "MEME: discovering and analyzing DNA and protein sequence motifs," Nucleic Acids Research, Inst. Molec. Biosci., The University of Queensland, St Lucia, QLD 4072, Australia and SDSC, UCSD, La Jolla, CA,USA. Nucleic Acids Research, 34, 369-373, 2006. U.B. BALOGLU, and M. KAYA, "Topdown motif discovery in biological sequence datasets by genetic algorithm," IEEE Intern. Confe. Hybr. Infor. Tech. (ICHIT'06) Washington, DC, USA, IEEE Computer Society, 2, pp. 103-107, 2006. P. BORK and M. SUDOL, "The PROSITE database of protein domains, families and functional sites," User Manual [Online], 1998, S.-P. Ed. 20 ed. Geneva, [Accessed January, 30 2016]. Available from World Wide Web: M. ESTER and X. ZHANG, "A Top-Down method for mining most specific frequent patterns in biological sequence data," proc. SIAM Int. Conf. on Data Mining (SDM'2004), pp. 90-99, 2004. A FLORATOS, I. RIGOUTSOS, L. PARIDA and Y. GAO, "DELPHI:A pattern-based method for detecting sequence similarity," IBM J. RES. & DEV, vol. 45, pp. 455-473, May 2001. H. ISHWARAN, and L.F. JAMES, "Gibbs sampling methods for stick-breaking priors," American Statistical Association,vol. 96, pp. 453-466, 2001. I. JONASSEN, "Efficient discovery of conserved patterns using a pattern graph," Comp. Appl Biosci., vol. 13, pp. 509-522, 1997. I. JONASSEN,F. COLLINS and D.G. HIGGING, "Finding flexible patterns in unaligned protein sequences," Protein Sci, vol. 4, pp. 1587-1595, 1995. Y. MITOBE, H. MIYAO and M. MARUYAMA, "A Fast HMM algorithm based on stroke lengths for on-line recognition of handwritten music scores," Proceedings of the 9th Int'l Workshop on Frontiers in Handwriting Recognition (IWFHR-9) Japan, IEEE, pp. 521-526, 2004. K. NING, H.K. NG, and H.W. LEONG, "Finding patterns in biological sequences by longest common subsequences and shortest common supersequences," IEEE Comp. Soci., Washington, pp. 5360, Oct. 2006. I. RIGOUTSOS and A. FLORATOS, "Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm," BIOINFORMATICS, vol. 14, pp. 55-67, 1998. E.J. TARCHA,V. BASRUR,C.-Y. HUNG, M.J. GARDNER, and G.T. COLE, "A Recombinant aspartyl protease of coccidioides posadasii induces protection against pulmonary coccidioidomycosis in mice," Infect Immun, vol 74, pp. 516-527, 2006. K. YE, W.A. KOSTERS and A.P. IJZERMAN, "An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences," Oxford, vol 23, pp. 687693, 2007. L. David, A. Miguel, and R-L. Alvaro, “Finding patterns in protein sequences by using a hybrid multiobjective teaching learning based optimization algorithm,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 12, pp. 656-666, no. 3, may/june 2015. C. Vens, M-N. Rosso and E. G. J. Danchin, “Identifying discriminative classification-based motifs in biological sequences,” vol. 27, no. 9, pp. 1231-1238, Mar 2011. T.-H. Tsai and S.-Y Lee, “SmSearcher: A Local similarity search engine for biological sequence databases,” IEEE Fif. Intern. Sympo. Multi. Soft. Eng., pp. 305-312, no. 3, 2003. D. Quang and X. Xie, “EXTREME: an online EM algorithm for motif discovery,” Oxford University Press. vol. 30, No. 12, pp. 16671673, Feb 2014. H. Liu, X. Wang, J. He, J. Han,D. Xin and Z. Shao, “Top-down mining of frequent closed patterns from very high dimensional data,” Information Sciences, vol. 179,No. 7 , pp. 899-924, Mar 2009.

[232]

[21] [22] [23]

C. Quana, F. Ren, “Weighted high-order hidden Markov models for compound emotions recognition in text,” Information Sciences, vol. 329, pp. 581-596, Jan 2016. R. Takhanov and V. Kolmogorov, “Inference algorithms for patternbased CRFs on sequence data,” ICML Intern. Conf. Ma-chine Learning, vol 28, 2013. D.B. Woodard and J.S. Rosenthal, “Convergence rate of markov chain methods for genomic motif discovery,” Institute of Mathematical Statistics, vol. 41, No. 1,pp. 91–124, Mar 2013.

Suggest Documents