PARALLEL PPI PREDICTION PERFORMANCE ...

4 downloads 7579 Views 881KB Size Report
For the shared memory platforms we implemented the multithreaded STRIKE ... 64GB RAM. OS .... node is affected by the file server's ability to handle these multiple read .... Double speed is achieved in distributed-system due to the dedicated ...
Journal of Circuits, Systems, and Computers  World Scientific Publishing Company

PARALLEL PPI PREDICTION PERFORMANCE STUDY ON HPC PLATFORMS Ali A. El-Moursy Electrical and Computer Eng., University of Sharjah, Sharjah, UAE* [email protected] Wael S. Afifi Computer and Systems, Electronics Research Institute, Cairo, Egypt [email protected] Fadi N. Sibai Computer Operations Dept., Saudi Aramco, Dhahran, Saudi Arabia [email protected] Salwa M. Nassar Computer and Systems, Electronics Research Institute, Cairo, Egypt [email protected] Received (Day Month Year) Revised (Day Month Year) Accepted (Day Month Year) STRIKE is an algorithm which predicts protein-protein interactions and determines that proteins interact if they contain similar substrings of amino acids. Unlike other methods for protein-protein interaction (PPI) prediction, STRIKE is able to achieve reasonable improvement over the existing protein-protein interaction prediction methods. Although its high accuracy as a PPI prediction method, STRIKE consumes a large execution time and hence it is considered to be a computeintensive application. In this paper, we develop and implement a parallel STRIKE algorithm for high-performance computing systems. Using a large-scale cluster, the execution time of the parallel implementation of this bioinformatics algorithm was reduced from about a week on a serial uniprocessor machine to about 16.5 hours on 16 computing nodes, down to about 2 hours on 128 parallel nodes. Communication overheads between nodes are thoroughly studied. Keywords: protein-protein interaction; parallel computing; performance analysis; cluster computing; protein classification.

1. Introduction Proteins are crucial for almost all functions in the cell, including metabolic cycles, DNA transcription and replication, and signaling cascades. They rarely perform their functions alone; instead they cooperate with other proteins by forming a huge network of proteinprotein interactions (PPIs). In cells, protein-protein interaction (PPI) is responsible for the

1

2

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

majority of cellular functions and controls the biological process. For decoding mechanism of diseases, investigators attempt to predict PPI. In the past decades, many innovative techniques studying PPIs have been developed to identify the functional relationships between proteins. However, these experimental techniques are significantly time consuming and mathematically complex to program. Accordingly, a growing need for the development of computational tools capable of simplifying and speeding up the PPI identification is inevitable. Researchers have developed many impressive computational techniques to predict PPI. Each of these techniques has its own advantages and disadvantages, especially with regard to the sensitivity and specificity of the method. Some of the state-of-the-art techniques have employed domain knowledge to predict PPI such as the Association Method (AM) 1, Maximum Likelihood Estimation (MLE) 2, Maximum Specificity Set Cover (MSSC) 3 and Domain-based Random Forest 4. The idea behind the domain-based methods is that molecular interactions are typically mediated by a great variety of interacting domains. Other developed techniques are based on the assumption that some of the interactions between proteins are mediated by a finite number of short polypeptide sequences. PIPE (Protein-Protein Interaction Prediction Engine) 5 is the most famous method representing this assumption. One advantage of the techniques mentioned above is the identification of short polypeptide sequences that are typically shorter than the classical domains, and used repeatedly in different proteins and contexts within the cell. However, identifying domains or short polypeptide sequences is a long and computationally intensive process. Domain-based methods and those depending on short polypeptide sequences rely on the domain information of the protein partners. Hence, this affects negatively their accuracy and reliability. Accordingly, they are unpopular techniques for PPI prediction. A novel algorithm termed STRIKE 22 was also introduced to predict PPI. It is based on the ―String Kernel‖ (SK) approach which has been shown to achieve good performance on text categorization tasks 7 and protein sequence classification 8. The basic idea of this approach is to compare two protein sequences by looking at common subsequences of fixed length. The string kernel is built on the kernel method introduced by 9 and 10. The kernel computes similarity scores between protein sequences without ever explicitly extracting the features. A subsequence is any ordered sequence of amino acids occurring in the protein sequence, where the amino acids are not necessarily contiguous. The subsequences are weighted by an exponentially decaying factor of their full length in the sequence, hence emphasizing those occurrences that are more contiguous. We understand that the subsequences’ similarity between two proteins may not necessarily indicate interaction; however, it indicates high probability for occurrences of interaction. Subsequence similarity helps in inferring homology. Homologous sequences usually have the same or very similar structural relationships. Multi-core/many-core 11,12 computing has emerged as a strong force to be reckoned with. Today, products from various hardware vendors pack hundreds of processing cores per chip and provide hundreds of Gflops of performance per Watt. Software development

Parallel PPI Prediction Performance Study on HPC Platforms

3

kits such as the Stream SDK and languages such as CUDA 13 and OpenCL 14 support the development of HPC code on many-core GPUs. Other products such as the Sun Microsystems 15 also include multiple simple multithreaded cores to achieve high throughputs in data centers. In 6 we developed a parallel STRIKE algorithm based on the multithreading technique then implemented this version on multicore systems. The Message Passing Interface (MPI) is a standard library based on the consensus of the MPI Forum. The goal of the MPI is to establish a portable, efficient, and flexible standard for message passing to be widely used for writing message passing programs for distributed systems. As such, MPI is the first standardized, vendor independent, message passing library. The advantages of developing message passing software using MPI closely match the design goals of portability, efficiency, and flexibility. MPI is not an IEEE or ISO standard but has, in fact, become the "industry standard" for writing message passing programs on HPC platforms. Message-passing massively parallel multicomputers 16 have existed for several decades. Such massive computers employ message passing for communication between compute nodes because cache coherency cannot scale to such large volume of nodes with relatively large inter-node distances 17. Multicomputers by Cray 18 and IBM 19, such as the IBM BlueGene/Q 20, and the world’s top supercomputers fit this category. The performance of such systems depends more on the high number of nodes rather than the computational might of each node. Many studies have been conducted to optimize scientific applications/kernels on today’s HPC platforms and compare the performance of those applications on those platforms. The authors proposed in 22 a new technique termed ―STRIKE‖ to predict PPI based on the ―SK‖ method. We also implemented our novel technique on multicores systems using multithreading. The study in this paper compares the MPI implementation of STRIKE to the multithreaded implementation of the application done in 6. The focus of the research in our paper extends the parallel STRIKE proposed in 6 to handle even larger data sets for the scalability regards rather than to show the good performance and accuracy of STRIKE itself. In this paper, we develop MPI-based versions of the STRIKE algorithm that is implemented and tested on two HPC Systems. In section 2 we discuss a short literature survey about related work. In Section 3, we explain the serial protein sequence decomposition and matching algorithm. Section 4 formally presents the parallel sequence matching algorithm and its complexity analysis, and then in sections 5 and 6 we discuss the multithreaded and MPI implementations respectively of the application on multiple nodes. Section 7 reports HPC platforms used in our experiments. Section 8 presents the performance analysis on both computer cluster systems. The paper conclusion is in Section 9. 2. Related Work Protein-protein interactions (PPIs) are crucial for almost all cellular processes, including metabolic cycles, DNA transcription and replication, and signaling cascades. Unfortunately, the experimental methods for identifying PPIs are both time-consuming

4

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

and expensive. Researchers have developed many computational approaches for predicting PPIs, some of them depend on the domain knowledge (e.g., AM 1, MLE 2, MSSC 3 and DRF 4), other methods may depends on the short polypeptide sequences (e.g., PIPE 5), and currently novel methods use only the information of protein sequences (e.g., STRIKE 6). The best way to compare the performance among these methods is by examining the sensitivity and specificity of each method using the same source of data. When using the protein domain family database called Pfam 21 for yeast proteins interaction pairs in domain-based methods to predict interactions between proteins, they yield about 39% specificity and 79.7% sensitivity 8. These results are superior to PIPE. PIPE performance resulted in 89% specificity, 61% sensitivity, and 75% overall accuracy when tested to detect yeast protein interaction pairs 11. STRIKE was able to achieve 83.1% precision (i.e., specificity), 98% recall (i.e., sensitivity) and 89% accuracy. STRIKE has two further advantages when compared to PIPE. Firstly, the PIPE method is computationally intensive and the evaluation of PIPE’s performance over the same dataset took around 1,000 hours of computation time, compared to only 54.9 minutes using STRIKE. Secondly, as indicated by the PIPE authors, their method is expected to be weak if it is used for the detection of novel interactions among genome wide largescale datasets which is not the case using STRIKE 22. It is clear that STRIKE achieved the highest performance results among all the mentioned algorithms above. A parallel PIPE (Massively-Parallel PIPE (MP-PIPE)) is proposed in 23 for large scale, high throughput protein interaction prediction. MP-PIPE is shown to be more scalable than PIPE. However, the MP-PIPE used the same PIPE database which is carefully constructed to avoid false data and stores only protein interactions that have been independently verified by multiple experiments. Although they achieved good scalability the aforementioned comparison between STRIKE and PIPE motivates us to focus on STRIKE parallelization in this research. . Another novel algorithm proposed in 24 also depends on the information of protein sequences only. It is based on a learning algorithm called ―Extreme Learning Machine (ELM)‖ combined with a novel representation of local protein sequence descriptors. The local descriptors account for the interactions between residues in both continuous and discontinuous regions of a protein sequence, thus this method enables researchers to extract more PPI information from the protein sequences. When performed on the PPI data of Pfam, the proposed method achieved 89.09% prediction accuracy with 89.25% sensitivity at the precision of 88.96%. These results did not achieve a noticeable improvement in the performance compared to STRIKE. Hence, we focus on parallelizing STRIKE algorithm which proofs its superiority compared to other algorithms as the literature demonstrated 21,8,11,22, 24. We are also encouraged by the success of a recently published work employing pair-wise alignment as a way to extract meaningful futures to predict PPI. The PPI based on Pairwise Similarity (PPI-PS) method consists of a representation of each protein sequence by a vector of pair-wise similarities against large subsequences of amino acids. Those subsequences are created by a sliding window which passes over concatenated protein

Parallel PPI Prediction Performance Study on HPC Platforms

5

training sequences. Each coordinate of the vector is typically the E-value of the SmithWaterman score 25. One major drawback of the PPI-PS is that each protein is represented by computing the Smith-Waterman score against a large subsequence. Those subsequences are created by concatenating protein training sequences. However, comparing short sequences to very long ones potentially may end up of missing some valuable alignments. The SK however, tackles this weakness by capturing any match or mismatch present in the protein sequence of interest. Another algorithm is used in 26 for multiple string matching. This proposed algorithm is called "the Aho-Corasick algorithm". It consists of the following two steps: (1) Constructing a string-matching machine (a finite state machine) for a given set of patterns, (2) Processing the input string using the string matching machine in a single pass. The algorithm is shown to be scalable when parallelized on multicores using multithreading, but one disadvantage of this algorithm is that the time consumed for creating the state machine would reasonably effect the total time taken for pattern matching. Authors in 27 analyze performance and characteristics of another bioinformatics application called Multiple Sequence Alignment (MSA) software using Parallel approaches. MSA can be seen as a generalization of Pair-wise Sequence Alignment as it allows comparison of multiple sequences by simultaneously aligning a set of these sequences. The authors proposed implementing this approach on multiprocessors using multi-threading or MPI, but they did not actually implemented their proposal. Whilst authors in 28 used another application for MSA called ClustalW 29 which is a widely used application to perform multiple sequence alignment. They evaluate the performance impact of implementing this application on multicore architectures. Authors in 30 used other application for MSA called the center star method proposed by Gusfield 31. They also evaluate the performance with two different implementations (i.e., MPI and CUDA) on Clusters and GPGPUs. Many studies have been conducted to optimize scientific applications/kernels on today’s HPC platforms and compare the performance of those applications on those platforms. Authors in 32 propose new techniques to implement LU Decomposition for IBM Cyclops64 many-core architecture. The authors propose a method that adjusts the load and data distribution during each successive elimination step of the LU computation to keep all cores usefully busy, thus maximizing the register tiling performance potential of the following two techniques. They also developed a method for register tiling that determines the optimal data tile parameters and maximizes data reuse according to register size constraints. They demonstrate that their method is inherently general and that it should have a much broader applicability beyond the Cyclops-64 architecture. Authors in 32 shows that Blue Gene/L core is five times slower than Cell BE SPU for High-performance Linpack (HPL) implementation of LU algorithm. No further analysis to the communication performance is conducted. In 33, authors show that Cell BE is an order of magnitude higher performance for FFT application compared with IBM 970 (G5), Pentium 4 Xeon, Opteron 275, and Free-scale 7448. Hiroki Nakano 34 discusses the

6

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

differences between the Blue Gene/L and Cell BE but did not compare the performance of those two systems for the same application. The authors in 35 propose a new 3Dtranspose method for multi-core processors for volumetric three-dimensional data structures the data presentation mostly used in scientific Cartesian space. They show highly utilized machine resources and very efficient algorithm on Cell BE processor. In 36 , the authors develop an algorithm for Cell processors using thread-level parallelization, exploiting fine-grained task granularity and a lightweight decentralized synchronization to achieve an efficient parallelization between the short-vector Single-Instruction, Multiple-data (SIMD) cores for Solving Systems of Linear Equations. Guo et al. 37discuss porting different scientific applications to different HPC platforms. In 38,39,40, image processing applications is ported to the BlueGene and performance analysis is studied. We previously compare in 41 the Cell BE to the Blue Gene/L on two of the most popular image processing kernels, image resizing, and edge detection. We have developed a parallel HMM algorithm in 42; but the study in this paper is the first to our knowledge that compares the multicores to the clusters on one of the bioinformatics applications that used to predict PPIs. 3. Serial Implementation of STRIKE Algorithm Two strings of characters S1=‖lql‖ and S2=‖lqal‖ represent two short protein sequences. Each sequence is decomposed into number of substrings. The shortest substring (pattern to match) is two characters. In other words, these sequences are implicitly transformed into feature vectors, where each feature vector is indexed by two-character-long not to be consecutive. Table 1 shows the decomposition of each of the two sequences into 2character substrings. Each sequence is decomposed into all possible ordered (from left to right) combinations of characters included in the sequence such that the 2 characters need not to be consecutive. The first three (from the left) 2-character substrings represent the decomposition of the S1 sequence, while all six 2-character substrings represent the decomposition of the second sequence S2. Table 1. Mapping Two strings ―lql‖ and ―lqal‖ to Six Dimensional Feature Space.

S1 = Ø(lql) S2 = Ø(lqal)

lq

Ll

ql

la

qa

al

λ2 λ2

λ3 λ4

λ2 λ3

0 λ3

0 λ2

0 λ2

Parallel PPI Prediction Performance Study on HPC Platforms

Fig.1.a Flowchart of the serial code of STRIKE algorithm

Fig.1.b Flowchart of the processing done inside the matching block in the serial code of STRIKE algorithm

7

8

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

When a 2-character substring appears in a sequence such that these 2 characters are consecutive in the sequence, the substring’s dom –degree of matching— in that sequence is represented by λ2, where λ is a decay factor. For instance, the substring ―lq‖ fits this case in the first sequence. When these 2 characters are separated by another character (gap of 1), the substring’s dom is λ2+gap of 1= λ3. The substring ―ll‖ fits this case in the first sequence. When the 2 characters in the appearing substring are further spaced by exactly gap characters, the dom is represented by λ2+gap. The doms for the substrings for the first sequence and all substrings for the second sequence are computed in that fashion as shown in Table 1. When matching the 2 sequences, the 2-character substrings to impact and increase the degree of matching must exactly appear in both sequences. To reflect the degree of matching between the S1 and S2 sequences, the un-normalized string kernel (SK) for the 2 sequences, k(lql,lqal) can clearly be computed as the dot product of the 2 rows of Table 1 containing the doms, i.e. λ4+λ7+λ5 Assuming that the decay factor λ equals to 0.5, k(lql,lqal)=0.102. The higher the un-normalized kernel is, the higher the indication of matching between the two sequences is, and the higher is the interaction. The serial implementation consists of a main procedure which reads the input protein sequences, one from the training sequence set and the other from the testing sequence set, and amino acid matrix. Then a pairing operation is done for each amino acid in the training/testing sequence set with one subsequent amino acid. After that a matching process is done - as illustrated in the previous example - between each protein sequence in the training set and all sequences found in the testing set, this matching algorithm generates the pairs of amino acids and their inter-distances and compute the whole score matrix. The matrix contains all amino acid weights corresponding to all characters in the protein sequence. Afterwards, control is passed back to the main procedure for printing the score matrix, and computing the execution time. Flowcharts are also introduced in Figure 1.a and Figure 1.b to simplify tracking the levels of processing done inside the serial algorithm. 4. Parallel Protein Sequence Matching Algorithm Performing the STRIKE algorithm computations for protein sequence is a computeintensive process. This motivates its parallelization. To parallelize this algorithm, we describe a high-level parallel algorithm consisting of the following 3 steps: (i) Decomposition; (ii) Sorting; (iii) Inner Product. In the ―decomposition‖ step, the amino acid sequences are allocated to processing nodes, one sequence per node. For instance let us assume that the SKs of the 4 amino acid sequences ―lyq,‖ ―qyla‖, ―yqla‖ and ―qla‖ are computed on four parallel computing nodes. The goal is to find mutual interaction between these 4 sequences. The processing node allocation of the 4 sequences proceeds, as shown in Figure 2.

Parallel PPI Prediction Performance Study on HPC Platforms

9

Fig. 2 Allocation of sequences to processing cores (nodes)

In all nodes, the decomposition of each protein sequence proceeds in parallel and their execution times overlap. Each sequence is decomposed into 2-amino acid substrings starting with adjacent amino acids, as shown in Figure 3. The ―2‖ in ―(ly 2)‖ refers to the power of the weighted decay factor (λ) (i.e. λ2) indicating no gap (i.e. 3rd character) between the ―l‖ and the ―y‖.

Fig. 3 Decomposition of each protein sequences into substrings of Length=2 and Distance=1

Since the amino acids in the derived substrings are not necessarily required to be contiguous, the decomposition into 2-amino acid substrings with a non-adjacent amino acid separated by another amino acid takes place, as illustrated in Figure 4.

Fig. 4 Decomposition into substrings of Distance=2

Again, the ―3‖ in ―(lq 3)‖ refers to the power of the weighted decay factor (λ) (i.e. λ 3), meaning that the ―l‖ and ―q‖ are separated by another amino acid (―y‖) in the sequence ―lyq‖. Finally the decomposition into 2-amino acid substrings composed of non-adjacent amino acid separated by 3 other amino acids takes place, as shown in Figure 5.

10

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

Fig. 5 Decomposition into substrings of Distance=3

As nodes 1 and 4 have shorter sequences to process in the ―decomposition‖ step, they will complete the ―decomposition‖ step ahead of processing nodes 2 and 3. Thus nodes 1 and 4 can immediately proceed to the ―sorting‖ step, while nodes 2 and 3 will proceed to the ―sorting‖ step immediately after completing the ―decomposition‖ step. In the ―sorting‖ step of the parallel algorithm, the 2-amino acid substrings generated in the ―decomposition‖ step are sorted alphabetically based on their 2-letter string content. Again each node sorts its strings alphabetically in parallel with the other nodes so the string sortings in all 4 nodes overlap in time. After step two is completed, the 4 processing nodes will have the sorted strings shown in Figure 6.

Fig. 6 Content of the nodes is sorted substrings

In the ―inner product‖ step, the inner products are carried out on half (4/2=2) the nodes with the largest substring set cardinality. This choice is made to minimize the total inter-node communication time. In our example, nodes 2 and 3 have the highest number of generated substrings. Each of these nodes maintains its 2-amino acid substrings and receives 3 amino acid strings generated by the node which is allocated the other sequence to match with its sequence. To simplify this example, let us say that our goal is to match only the protein sequences ―lyq‖ (node 1) and ―qyla‖ (node 2) together, and the protein sequences ―yqla‖ (node 3) and ―qla‖ (node 4) together, in the ―inner product‖ step, and not all the 4 sequences with each other. As a result, the following data communications will take place, as shown in Figure 7.

Parallel PPI Prediction Performance Study on HPC Platforms

11

Fig.7 Inter-node communication

Node 1 sends its generated 2-amino acid substrings to node 2, and node 4 sends its generated 2-amino acid substrings to node 3. The communication method depends on the parallel paradigm used as will be discussed in the following sections. After the communication done, the contents of each node will be as shown in Figure 8.

Fig. 8 Contents of each node after inter-node communications

In the same example, node 2 and 3 then start performing the inner products between their strings generated in the ―decomposition‖ step and the received strings generated by the neighboring node, as shown in Figure 9. The inner product (α n) . (β m) succeeds when the 2 strings match i.e. α=β, producing the number n+m (representing λn+m). Otherwise if α is different from β, then it’s a mismatch (resulting in 0). Thus nodes 2 and 3 will simultaneously perform the following inner products. Note that node 2 will take the product of ly (followed by lq, and yq, respectively) with the substrings in the other set. The results are presented by each node involved in the inner product step as follows: Node Result 2 0 3 4,6,4: λ4+ λ6 + λ4 = 2 λ4+ λ6

12

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

Fig. 9 Inner products

On a processing node, matching two substrings starting with the same amino acid will speed up the kernel computation step. After matching a string with all other substrings starting with the same amino acid, the remaining strings in the second sequence can be skipped as the strings have been sorted in alphabetical order in the ―sorting‖ step. For instance, referring to the above results, after matching (ly 2) to (la 2), processing of the string (ly 2) stops as the remaining strings in the second set do not start with the amino acid l. This could be implemented by a simple indexing mechanism based on the starting amino acid of the substrings. In the absence of such mechanism, a string will have to be matched (i.e. its inner product taken) with all strings in the other set until a match is found or until all the strings in the other set have been exhausted. The ―inner product‖ step can be repeated as many times as needed to match other protein sequences allocated to other processing nodes. For instance to match the lyq (allocated to processing node 1) and qla (allocated to processing node 4) sequences, processing node 4 sends its 2-amino acid substrings generated in the ―decomposition‖ step to processing node 1 which carries out the inner product step. Thus the parallel algorithm is capable of matching as many sequences in parallel as desired based on the availability of processing nodes. STRIKE is scalable and is expected to achieve excellent performance scalability with increasing hardware resources. 5. Shared-Memory (Multithreaded) Implementation on Multicore For efficiency and proper indexing, first; we skip the sorting step and perform matching between all the 2-character substrings of the two protein sequences to match. Second, to improve the matching accuracy, we modify the SK to be the weighted inner product of the doms (α n) . (β m), i.e. from λn+m to λn+m x matrix(c1) x matrix(c2), where c1 and c2 are the first and second characters appearing in the matching substrings α=β=‖c1 c2‖, and matrix(c1) is a weight given to characters, such that alphabetical characters A, B, C,…..etc can be assigned different weight age helping direct the matching towards character-orientation.

Parallel PPI Prediction Performance Study on HPC Platforms

13

The parallel multithreaded implementation consists of a main procedure which reads the input protein sequences, one from the training set and the other from the testing set, and amino acid matrix, launches parallel jobs which are assigned an equal number of sequences to match and which generate the pairs of amino acids and their inter-distances and compute the portion of the score matrix corresponding to the sequences assigned to these jobs. The matrix contains all amino acid weights corresponding to all characters in the protein sequence. Afterwards, control is passed back to the main procedure for printing the score matrix, and computing the execution time. In our shared memory system, processing nodes 2 and 3 read the 2-amino acid substring data generated by nodes 1 and 4 from shared memory. After the data is received or read by the destination nodes, the processing nodes 2 and 3 will hold the substrings shown in Figure 8. No need to keep Node 1 and Node 4 active. 6. Distributed Memory (MPI) Implementation on HPC The main difference between MPI implementation and the multithreaded implementation is in the way of sharing data among nodes. In case of a distributed memory system, the data communication takes place in the form of messages sent by the sender nodes (1 and 4) to the destinations nodes (2 and 3), of the example in Figure 8. The processing done in the parallel MPI implementation is similar to the multithreaded implementation until the pairing operation is finished for all amino acid in the training/testing sequence set. After that our proposed Parallel MPI algorithm divides the training sequences pairs equally among all nodes used to run the code. A matching process is then performed - inside each node - between each protein sequence in the training set that is allocated to the current running node, and all sequences found in the testing set. This matching step generates the pairs of amino acids and their inter-distances and compute the partial score matrix related to each node. The whole score matrix contains all amino acid weights corresponding to all characters in the protein sequence. Afterward, it is collected from all nodes to one node using an MPI collective communication API (e.g., MPI_Gather or MPI_Allgather). A Flowchart is shown in Figure 10 to simplify tracking the levels of processing done inside the MPI parallel code. 7. Experimental Setup For the shared memory platforms we implemented the multithreaded STRIKE application and tested it on x86 dual-core and quad-core PCs. The dual-core PC has ―Intel Core 2 Duo dual-core 2GHz‖ processor with ―1GB‖ DRAM and ―Windows XP SP3‖ OS, while the quad-core PC has ―Intel Core 2 quad-core 2.66GHz‖ processor with ―4GB‖ DRAM and ―Windows Server 2003‖ OS. In the experiments, we launched ―NumThreads‖ parallel threads with an equal load of sequences (except for last thread), each processes the sequences in the shared memory so that no communication time is resulted, but the drawback of this implementation is the synchronization needed among threads. The maximum number we could reach in this case is 16 threads due to the hardware limitations found in the multicore systems available.

14

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

Fig. 10 Flowchart of the MPI code of STRIKE algorithm

For the distributed memory platforms, we also implemented the MPI STRIKE and tested it on two different high performance computing (HPC) platform (cluster); the main difference between them is the scaling that we could reach in the number of nodes. The small-scale cluster is a heterogeneous system which has 10 PCs, one acts as the server, while the others are the clients. The server has 2 single core processors. Each processor is 3.4 GHz Intel Pentium 4 CPU, while the clients are classified as follows: (1) Three clients of single core processor, each processor is 3.4 GHz Intel Pentium 4 CPU, (2) Six clients of two dual core processors, each processor is 2.4 GHz Intel Pentium Dual CPU. The cluster nodes are connected via a 3Com LAN switch 10/100/1000 MHz and LAN cables of types CAT5 (which enable data rates up to 100 MHz) and CAT5e (which enable data rates up to 1000 MHz). We used gcc complier, version 3.4.6 20060404 (Red

Parallel PPI Prediction Performance Study on HPC Platforms

15

Hat 3.4.6-8) which has a lot of optimized implementations for the different libraries on Linux families. For MPI experiments, we used MPICC compiler, version mpich-1.2.4 which is a freely available 43, portable implementation of MPI (Message Passing Interface) used to allow computers to communicate with each other. The large-scale cluster is the Sun Microsystems cluster that considered to be a homogenous system with the specifications shown in Table 2 15.

Table 2 Bibliotheca Alexandrina Sun Microsystem Technical Description Number of Nodes Processors/node

Memory/node

Node-node interconnect

pre- and post-processing nodes

OS

128 eight-core compute nodes 2 quad-core sockets per node, each is Intel Quad Xeon E5440 @ 2.83GHz 8 GB memory per node, Total memory 1.05 TBytes (132 * 8GB) Ethernet & 4x SDR InfiniBand network for MPI 4x SDR InfiniBand network for I/O to the global Lustre filesystems 6 management nodes, incl. two batch nodes for job submission w. 64GB RAM OS, Compute Node: RedHat Enterprise Linux 5 (RHEL5) OS, Front End & Service Nodes: RedHat Enterprise Linux 5 (RHEL5)

For MPI experiments, we vary the number of tasks to run our experiments on 1, 2, 4, 6, 8, 16, 32, 64 and 128 nodes. Each node read all the training sequences while the testing sequences are divided equally among all processing nodes. Then each processing node processes the two sequences set in its local memory, produces a partial score matrix based on the partial sequences it possesses, and finally the score matrix is collected in one node (node of rank zero) memory. Communication time results account for assembling the score matrix in one node memory (rank zero). The score matrix is a two dimension array of type "float", where the length of its first dimension equals to the number of testing sequences used and the length of its second dimension equals to the number of training sequences used. This is because the score matrix reflects the degree of matching between a training sequence and a corresponding testing sequence as illustrated in details in section 4. Description of the experiments performed on both the multicores and the clusters are shown in Table 3. Number of nodes/cores is not the only parameter we run experiments for. Another parameter we used to vary in our experiments is the sequence length and we have three cases as follows: 1) Short-sequence set: In which we used a training file that contains 18 sequences and a testing file that contains 12 sequences; each sequence in both files consists

16

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

of 166 amino acids at maximum, and each amino acid is represented by an alphabetical letter which occupies 1 byte in memory. 2) Medium-sequence set: In which we used a training file that contains 128 sequences and a testing file that contains 128 sequences; each sequence in both files consists of 5000 amino acids at maximum, and each amino acid is represented by an alphabetical letter which occupies 1 byte in memory. 3) Long-sequence set: In which we used 8 training files, each file contains 1800 sequences at maximum; and each file has a corresponding testing file which contains 2010 sequences at maximum. Each sequence in any file consists of 5000 amino acids at maximum, and each amino acid is represented by an alphabetical letter which occupies 1 byte in memory.

Table 3 Experiments for both multithreaded and MPI implementations Experiments ID

Platforms

1

Measured Time Communication and Computation Time

2

Communication and Computation Time

3 4

Setup, Communication and Computation Time Overall Time

5

Overall Time

Application Name Small-scale cluster Large-scale cluster Large-scale cluster dual-core and quad-core PCs Large-scale cluster

8. Results and Analysis 8.1. Distributed Memory Experiments 8.1.1. Short Sequence Set Typeset sub-subheadings in medium face italic and capitalize the first letter of the first word only. Section numbers to be in roman For experiment 1 and 2 in table 3, the first factor to examine is the computation time; Figure11 shows only the computation speedup for STRIKE application (on the y-axis) relative to one node for increasing number of nodes (on the x-axis). As expected the computation time is decreased in a semi-linear manner by increasing the number of nodes as the data size processed in each node decreases. The reason for the semi-linearity is due to the unequal number of amino-acids inside each sequence, which causes imbalance in the computational load on the processing nodes. This semi-linearity will convert to non-linearity in longer sequence sets

Parallel PPI Prediction Performance Study on HPC Platforms

17

experiments. This is mainly due to the high variation in the amino-acid sequence length as we will discuss later. The next factor to examine is the communication overhead. Figure 12 compares the relative communication time with the relative computation time (on the y-axis) for increasing number of nodes (on the x-axis). The main observation is that the communication overhead is negligible for small number of nodes (i.e., 2 and 4 nodes) that account for less than 7% of the total execution time. Even in higher number of nodes (6 nodes) the communication overhead does not exceed 12%. Low communication overhead is due to the coarse-grain data decomposition used in our application. When comparing the performance of large-scale cluster execution to small-scale cluster execution as seen in Figure13, we can infer that the small-scale takes slightly lower computation time on 6 nodes than the large-scale. This is due to the heterogeneous cluster used in the small scale in which the speedup for parallelization is dominated by the slowest machine in the system. Accordingly when we measure the performance for the serial run on the fastest machine, we will not be able to achieve a linear speedup with the parallel run since the slow machines will introduce an idle time compared to the fast machines. Regarding the communication time, we can observe also in Figure 13 that the small-scale cluster has slightly higher communication time on 6 nodes than the largescale. This refers to the InfiniBand interconnect which has higher throughput and lower latency than 3Com 46 interconnects. Also InfiniBand is characterized by quality of service, failover and scalability. More about InfiniBand will be discussed in next experiment.

Fig. 11 Computation speedup vs. number of nodes using short-sequence set on small-scale cluster for the MPI implementation

18

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

Fig. 12 Application performance using short-sequence set w.r.t one node on small-scale cluster for the MPI Implementation

1 0.8 0.6 0.4 0.2

Relative Execution Time

1.2

Communication Computation

0 6 nodes

1 node

6 nodes

1 node

large-scale cluster small-scale cluster No. of Nodes

Fig. 13 Comparing Relative Execution Time between 1 and 6 nodes on both small-scale cluster and large-scale cluster using short-sequence for the MPI implementation.

8.1.2. Medium Sequence Set

Parallel PPI Prediction Performance Study on HPC Platforms

19

For experiment 3; the first factor to examine is the scalability of computation component on the large-scale cluster as shown in Figure 14. We need to recall that this experiment run on large-scale cluster. The increase in computation speedup (on the y-axis) is not linearly proportional to the increase in number of nodes (on the x-axis). This happened because of the way of processing done in STRIKE algorithm as discussed in section 3. In our application a string will have to be matched (i.e. its inner product taken) with all strings in the other set until a match is found or until all the strings in the other set have been exhausted. Hence, decreasing number of sequence sets processed on each node will decreases the computation on the one hand, but on the other hand the variation in the string length is increased. Also, the score matrix is not calculated until a match is happened. This leads to unbalanced computation load across nodes. This deep nonlinearity in computation speedup does not appear in the previous experiments due to the larger variations in the lengths of sequences found in the medium-sequence set than those found in the short-sequence set.

Fig. 14 Computation time versus number of nodes on large-scale cluster for the MPI implementation

Second in experiment 3, we analyze the communication time on the large-scale cluster done using medium-sequence sets. Figure15 shows the communication time (on the yaxis) relative to 8 nodes for all considered number of nodes on the x-axis. There are two opposite factors affecting the irregular changes in the communication time when increasing the number of nodes. On one hand, increasing the number of processing nodes reduces the amount of data transferred per node, and accordingly, the total time to transfer the data should be decreased. In addition, since the communication is carried through the InfiniBand network which is a switched fabric communications link, it should be able to function as a simple point-to-point interconnect and also scale to handle thousands of nodes resulting in an increase in the used bandwidth that consequently reduces the time to transfer the data. On the other hand, the latency of single data rate

20

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

switch chips is 200 nanoseconds, and the end-to-end latency range spans from 1.07 microseconds MPI latency to 2.6 microseconds MPI latency. InfiniBand also provides Remote Direct Memory Access (RDMA) capabilities for low CPU overhead. The latency for RDMA operations add about 1 microsecond to the overall latency 47.

Fig. 15 Communication time versus number of nodes on large-scale cluster for the MPI implementation

For experiment 3 we observe a new component of time in Figure 16-a— the setup time required to divide the work among nodes and to handle I/O operations (i.e., opening/closing files and reading from/writing to files). I/O operations parallelization does not help since parallel access to the file system overcome any benefits from the parallelization itself due to the slow secondary storage of the system drive (we did this test without showing its results in the paper). We need to observe the contribution of those components in the overall execution time. We noticed that the setup time increases slightly with increasing the number of nodes. This is mainly due to the increase in I/O operations when increasing the number of nodes as many ―Read‖ operations from each node is affected by the file server's ability to handle these multiple read requests at the same time, and accordingly the buffering time will be increased. Also as the I/O operations must be conducted over the network due to the file server system used in our HPC systems (i.e., NFS), this causes severe bottlenecks. However, the setup time can be neglected as it does not exceed 11.6% from the overall time even in larger number of nodes. Figure 16-a is showing the relative times in log-scale in order to distinguish the very small percentages of the setup time component among other components of time. Figure16-a distinguishes between different components of overall execution time (i.e., computation, communication and setup time) in log scale (on the y-axis) and compares them among different number of nodes (on the x-axis). As noticed most of time is spent in the computation for any number of nodes as the communication time does not exceed

Parallel PPI Prediction Performance Study on HPC Platforms

21

3% from the overall time even in the largest number of nodes (i.e., 128 nodes). The raw time values (in hours) of all components of time are also shown in Table 4.

Table 4 Values of different components of execution time for different number of on large-scale cluster for the MPI implementation

Setup time (hours) Comm time (hours) Comp time (hours) Overall (hours)

1 node

8 nodes

16 nodes

32 node

64 node

128 nodes

0.02232

0.06791

0.08172

0.31803

0.6334

0.92690033

0

0.016257

0.01515

0.01574

0.01817

0.21888342

28.9408

16.7982

10.7951

8.17581

6.85820375

29.02496

16.89503

11.12884

8.82739

8.0039875

167.654 8 167.677 1

Finally by getting it all together in experiment 3, the overall speedup (on the y-axis) increases by increasing number of nodes (on the x-axis) as seen in Figure16-b. This increase in overall speedup - corresponds to a decrease in overall execution time - on the y-axis is not linear with the increase in number of nodes on the x-axis as the variations happened in each component of time (i.e., computation, communication and setup time) will affect the resulting total execution time. The scalability of the application is now dominated by the amount of work done per nodes and the atomic characteristics of the search and comparisons per string. The efficiency of utilizing the compute nodes reduced significantly (16%) however still speed-up can be achieved. The availability of largerscale system may achieve more speed-up but on the cost of compute-resource efficiency.

64 node 32 node 16 nodes 8 nodes 1 node 0.1

0.01

0.001

Relative Execution time in log scale

1 128 nodes

Comp time Comm time Setup time

0.0001

No. of Nodes

Fig. 16.a Overall execution time (in log scale) versus number of nodes on large-scale cluster for the MPI implementation

22

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

180 140 120 100 80 60 40

Execution Time (hour)

160

20 0 8 PCs and 16 threads/PC (1 file/PC)

16 threads on 1 PC (8 files/PC)

1 thread on 1 PC (8 files)

No. of threads Fig. 16.b Overall speedup versus number of nodes on large-scale cluster for the MPI implementation

8.2. Shared Memory Experiments In experiment 4, we revealed that the execution time (on the y-axis) needed to run the application (with input files that have many long protein sequences) drops to 1.5 days on the quad core system versus a whole week on the single threaded implementation as shown in Figure17. This was achieved by the 16 thread version. Other versions with 8, 4, 2 threads consumed longer execution times on quad core and dual-core systems in comparison with the 16 thread version. Figure 18 plots the ratio of execution time with 2 threads over the execution time with 4, 8 and 16 threads measured on the Core 2 Dual core 2GHz machine reflecting the performance with increasing number of threads. Higher relative performance with 8 and 16 threads is obtained on the quad-core machine with more hardware resources, and in general on machines with more cores and memory. For long sequence files, the sequences were partitioned into 8 different files and 8 different runs executing concurrently on 8 independent quad-core x86 systems cutting the execution time to 4.5 hours (compared to 1.5 days on a single quad-core x86 computer and a week on the single threaded version) as seen in Figure 17. Launching more than 16 parallel threads further resulted in shrinking speedup gains due to context switching overhead.

Parallel PPI Prediction Performance Study on HPC Platforms

23

25

15 10

Overall Speedup

20

5 0 128 nodes

64 node

32 node

16 nodes

8 nodes

1 node

No. of nodes

5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 16 threads

8 threads

4 threads

2 threads

Speedup

Fig. 17 Application performance using long-sequence set on multicores for the multithreaded implementation 6

1 thread

No. of threads Fig. 18 Relative Performance with 1, 2, 4, 8, and 16 threads with respect to Performance with one thread on Dual Core Laptop 6.

8.3. Platform Comparison Lists To compare the multithreaded implementation for shared-memory systems to the MPI implementation for distributed-memory systems, we test the application with multiple long sequence files (i.e., 8 files) on large-scale cluster in experiment 5 as we did on multicore systems in experiment 4. Then we compare the overall execution time the application takes to run on both platforms. In this experiment we have three cases

24

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

depending on the number of processing nodes used and/or the way of processing used in reading the sequences from the input files (i.e., serially or concurrently). For the one processing node run (serial run), we noticed as shown in Figure 19 that the application takes lower execution time in the MPI implementation for the serial run. We believe that this due to the improvements found in the technical specifications of the cluster machines compared to the multicore PCs. In the second case we run the application and take into consideration executing the eight input sequence files serially (i.e., one after the other) on total of sixteen processing nodes. For the multicore/multithreaded machine, 1 PC launches 16-threads on shared-memory system and we compare it to the run with 16nodes for the distribute-memory system. We can observe that the execution time shrinks by 50% in case of the MPI implementation compared to multi-threaded implementation. Double speed is achieved in distributed-system due to the dedicated resources per node of the distributed system and the insignificance of the communication overhead compared to the resource sharing of the multithreaded implementation. The last case is to execute the 8 sequence files concurrently on 8 PCs, in which each PC launches 16 threads in case of the shared memory system, and on 128 nodes in case of the distributed memory system (i.e., each group of 16 nodes from the overall 128 nodes will process 1 file). In this scenario we are using our cluster in a high-throughput mode. This lead to approximately the same shrinking ratio we obtained in the previous trial. Scaling between 16-nodes and 128 nodes is linear. We believe that this happened because the technical specifications of the cluster machines are better than the multicore PCs so this advantage leads to better computation processing in the cluster than the multicore PCs. Instead of achieving an ideal 8X speed-up on massively parallel system and due to the reduced efficiency in using the nodes we can achieve only double the speed-up. 1.2 1 thread (8 files)

1 0.8 0.6 0.4

Relative Execution Time

16 nodes (8 files/16

128 nodes (1 fil/16

1 node (8 files)

Multithreaded Implementation MPI Implementation

0.2 0 No. of nodes and files Fig.19 Relative execution time comparison of Parallel STRIKE application between multithreaded and MPI implementations using eight long-sequence files

Parallel PPI Prediction Performance Study on HPC Platforms

25

The scalability that is to be achieved through the shared-memory (multicore/multithreaded) systems is dominated by the ability to integrate more processing nodes (cores / threads) on a single chip. Figure 20 shows a comparison between the two implementations (multithreaded and MPI) for a single long-sequence file for their scalability. Parallel STRIKE can achieve 4X and 6X speedup with eight threads on shared-memory and eight nodes on distributed-memory respectively. Multithreaded machines can hardly adds 10% performance by doubling the threads to sixteen while massively parallel distributed system achieves almost double the speedup. Speedup reaches a saturation level for the shared memory while distributed memory can go up to 21X speed up with 128-nodes. Sharing the processing resources limits the achieved speed up and still massively parallel systems can compete in scalability. 25

Relative speed up

20

15

Multithreaded Implementation MPI Implementation

10

5

0 1 node

8 nodes

16 nodes

128 nodes

Fig.20 Performance comparison of STRIKE application between multithreaded and MPI implementations using one long-sequence files.

9. Conclusion STRIKE is able to achieve reasonable improvement over the existing protein-protein interaction prediction methods. We developed a parallel algorithm on both distributed and shared memory system. We also studied the performance of our implemented parallel STRIKE on multicore computers (i.e., dual-core and quad-core PCs) and on two of the HPC platforms namely the heterogeneous cluster and the 8-core compute nodes of massively parallel Sun Microsystem supercomputer, then we compared our results on both of those systems. On long protein sequence sets, the execution time of a parallel implementation of this bioinformatics algorithm was reduced from about a week on a serial uniprocessor machine to about 16.5 hours on 16 computing nodes, down to about 2 hours on 128 parallel nodes. PC cluster with 128 nodes takes a long

26

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

communication time compared to a lower number of nodes, but it scales the computation time near to linear. A higher number of nodes will improve computation performance much beyond 128 nodes if available, but this will also affect the communication cost. Our implementation was shown to scale very well with increasing data size and number of nodes. Also our application was proven to take lower execution time in the distributed-memory implementation than the sharedmemory one. Scalable applications still could take advantage of massively parallel systems compared to newly developed multicore systems. References 1. Sprinzak E. and Margalit H., "Correlated sequence-signatures as markers of protein-protein interaction". J. Mol Biol., 311, 2001, pp. 681–692. 2. Deng M., Mehta S., Sun F. and Cheng T., "Inferring domain-domain interactions from protein-protein interactions". Genome Res., 12, 2002, pp. 1540-1548. 3. Huang T.W., Tien A.C., Huang W.S., Lee Y.C., Peng C.L., Tseng H.H., Kao C.Y. and Huang C.Y., "POINT: a database for the prediction of protein-protein interactions based on the orthologous interactome". Bioinformatics, 20, 2004, pp. 3273-3276. 4. Xue-Wen C. and Mei, L., "Prediction of protein–protein interactions using random decision forest framework". Bioinformatics, 21, 2005, pp. 4394–4400. 5. Sylvain P., Frank D., Albert C., Jim C., Alex D., Andrew E., Marinella G., Jack G., Mathew J., Nevan K., Xuemei L. and Ashkan G., "PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs". BMC Bioinformatics, 7, 2006, pp. 365. 6. Sibai F.N. and Zaki N., ―Parallel protein sequence matching on multicore computers‖, IEEE Soft Computing and Pattern Recognition (SoCPaR), 2010. 7. Lodhi H., Saunders C., Shawe-Taylor J., Cristianini N. and Watkins C., "Text Classification using String Kernels". J. of Machine Learning Res., 2, 2002, pp. 419-444. 8. Zaki N.M., Deris S. and Illias R.M., "Application of string kernels in protein sequence classification". Applied Bioinformatics, 4, 2005, pp. 45-52. 9. Haussler D., "Convolution kernels on discrete structures". Technical Report UCSC-CRL- 9910, University of California Santa Cruz, 1999. 10. Watkins C., "Dynamic alignment kernels". Advances in Large Margin Classifiers, Cambridge, MA, MIT Press, 2000, pp. 39-50. 11. Sanders J. and Kandrot E., ―CUDA by Example: An Introduction to General-Purpose GPU Programming‖. Pearson Education, Inc., 2010. 12. Sibai F. N., "Evaluating the performance of single and multiple core processors with PCMARK® 05 and benchmark analysis", ACM SIGMETRICS Performance Evaluation Review 35 (4), pp. 62-71, 2008. 13. "NVIDIA CUDA Programming Guide 2.0": http://www.nvidia.com/object/cuda_develop.html 14. Munshi, A., Gaster B., Mattson T., Fung J. and Ginsburg D., ―OpenCL Programming Guide‖, Pearson Education, Inc., 2011. 15. http://www.bibalex.org/ISIS/Frontend/Projects/ProjectDetails.aspx 16. Kirk D. B. and Hwu W. W., ―Programming Massively Parallel Processors: A Hands-on Approach‖, Morgan Kaufmann, 2010. 17. Sibai F. N., "On the performance benefits of sharing and privatizing second and third-level cache memories in homogeneous multi-core architectures", Microprocessors and Microsystems 32 (7), pp. 405-412, 2008. 18. www.cray.com

Parallel PPI Prediction Performance Study on HPC Platforms

27

19. Haring R.A., Ohmacht M., Fox T.W., Gschwind M.K., Satterfield D.L., Sugavanam K., Coteus P.W., Heidelberger P., Blumrich M.A., Wisniewski R.W., Gara A., Chiu G.L.-T., Boyle P.A., Chist N.H., Changhoan K., ―The IBM Blue Gene/Q Compute Chip‖, Micro, IEEE, Volume32 , Issue 2, 2012. 20. Lakner G. and knudson B., ―IBM System Blue Gene Solution: Blue Gene/Q System Administration‖, IBM Redbooks, 2013. 21. Bateman E., Birney E., Durbin R., Eddy S., Howe K. and Sonnhammer E., ―The Pfam Protein Families Database‖, Nucleic Acids Research 28, pp. 263-266, 2000. 22. Zaki N., El-Hajj W., Kamel H., and Sibai F., Chapter 26 ―A Protein–Protein Interaction Classification Approach‖, in Arabnia H. and TranQ., "Software Tools and Algorithms for Biological Systems", Springer, 2011. 23. Schoenrock A., Dehne F., Green J., Golshani A., and Pitre A., " MP-PIPE: a massively parallel protein-protein interaction prediction engine", Proceedings of the international conference on Supercomputing ICS '11, pp. 327 – 337, 2011. 24. You Z., Ming Z., Huang H. and Peng X., ―A Novel Method to Predict Protein-Protein Interactions Based on the Information of Protein Sequence‖, 2012 IEEE International Conference on Control System, Computing and Engineering, Malaysia, 2012. 25. Zaki N., Lazarova-Molnar S., El-Hajj W. and Campbell P., ―Protein-protein interaction based on pair-wise similarity‖, BMC Bioinformatics, 2009, pp. 10-150. 26. Arudchutha S., Nishanthy T. and Ragel R. G., "String Matching with Multicore CPUs: Performing Better with the Aho-Corasick Algorithm", Cornell University Library, 2014. 27. Sharma C., Vyas A.K., "Parallel Approaches in Multiple Sequence Alignments", International Journal of Advanced Research in Computer Science and Software Engineering, Volume 4, Issue 2, February 2014. 28. Isaza S. et al., "Parametrizing Multicore Architectures for Multiple Sequence Alignment", ACM CF’11, 2011, Ischia, Italy. 29. Thompson J., Higgins D., and Gibson T., "CLUSTALW: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting", Position-specific Gap Penalties and Weight Matrix Choice. Nucleic Acids Research, 22:4673–4680, 1994. 30. Vasconcellos J., Nishibe C., Almeida N. and Cáceres E., " Efficient Parallel Implementations of Multiple Sequence Alignment using BSP/CGM Model ", Proceedings of Programming Models and Applications on Multicores and Manycores PWAM '14, pp. 130-138, 2014. 31. D. Gus_eld. Algorithms on Strings, Trees and Sequences, chapter 14. Cambridge University Press, 1997. 32. Venetis I.E. and Gao G.R., ―Mapping the LU decomposition on a many-core architecture: Challenges and solutions‖, ACM International Conference on Computing Frontiers, Ischia, Italy, 18–20 May 2009; pp. 71–80. 33. Cico L., Cooper R. and Greene J., ―Performance and programmability of the IBM/Sony/Toshiba Cell broadband engine processor‖, White Paper, 2006. 34. Nakano H.,―Technology and applications of Bluegene supercomputer and cell broadband engine‖, International Symposium on VLSI Technology, Systems and Applications, Hsinchu, Taiwan, 23–25 April 2007; pp. 1–4. 35. El-Moursy A., El-Mahdy A., El-Shishiny H., ―An efficient in-place 3D transpose for multicore processors with software managed memory hierarchy‖, IFMT, 2008. 36. Kurzak J., Buttari A. and Dongarra J., ―Solving systems of linear equations on the CELL processor using Cholesky factorization‖, IEEE Transactions on Parallel and Distributed Systems 2008; pp. 19(9):1175–1186. 37. Guo X., Gray A., Simpson A. and Trew A., ―Moving HPCx applications to future systems‖, Technical Report, HPCx Capability Computing, HPCxTR0610, 2006.

28

Ali A. El-Moursy , Wael S. Afifi , Fadi N. Sibai and Salwa M. Nassar

38. Peterka T., Yu H., Ross R.B., Ma K-L. and Latham R., ―End-to-end study of parallel volume rendering on the IBM BlueGene/P‖. Proceedings of ICPP’09 Conference, Vienna, Austria, September 2009. 39. Rao A.R., Cecchi G.A. and Magnasco M. "High performance computing environment for multidimensional image analysis". International Workshop on Multiscale Biological Imaging, Data Mining and Informatics, Santa Barbara, CA, U.S.A., September 2006; 35–37. 40. Commer M., Newman G.A., Carazzone J.J., Dickens T.A., Green K.E., Wahrmund L.A., Willen D.E. and Shiu J., "Massively parallel electrical-conductivity imaging of hydrocarbons using the IBM Blue Gene/L supercomputer". IBM Journal of Research and Development 2008; 52(1–2):93–104. 41. El-Moursy A. and Sibai F., ―Image processing applications performance study on Cell BE and Blue Gene/L, Concurrency and Computation, John Wiley & Sons, Ltd., 2010. 42. El-Moursy A.A., ElAzhary H., Younis A., "High-accuracy hierarchical parallel technique for hidden Markov model-based 3D magnetic resonance image brain segmentation", Concurrency and Computation: Practice and Experience, J. Wiley, 26(1): 194-216 (2014). 43. http://www.mpich.org/ 44. ―Intel Snaps Up InfiniBand Technology, Product Line from QLogic‖, http://www.hpcwire.com/hpcwire/2012-0123/intel_snaps_up_infiniband_technology,_product_line_from_qlogic.html 45. ―An Introduction to the InfiniBand Architecture‖, O’Reilly, http://www.oreillynet.com/pub/a/network/2002/02/04/windows.html 46. ―3Com® Switch 4500G Family Configuration Guide‖, http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02579469/c02579469.pdf 47. Liu J., Wu J., Kini S. P., Wyckoff P. and Panda D. K., ―High Performance RDMA-Based MPI Implementation over InfiniBand‖, The17th Annual ACM International Conference on Supercomputing (ICS ’03), 2003.

Suggest Documents