GPU-based Cloud Computing for Comparing the Structure of Protein Binding Sites Matthias Leinweber1 , Lars Baumg¨artner1 , Marco Mernberger1 , Thomas Fober1 , Eyke H¨ullermeier1 , Gerhard Klebe2 , Bernd Freisleben1 1 Department
of Mathematics & Computer Science and Center for Synthetic Microbiology, University of Marburg Hans-Meerwein-Str. 3, D-35032 Marburg, Germany 2 Department of Pharmacy and Center for Synthetic Microbiology, University of Marburg Marbacher Weg 6, D-35037 Marburg, Germany 1 {leinweberm, lbaumgaertner, mernberger, thomas, eyke, freisleb}@informatik.uni-marburg.de 2
[email protected]
Abstract— In this paper, we present a novel approach for using a GPU-based Cloud computing infrastructure to efficiently perform a structural comparison of protein binding sites. The original CPU-based Java version of a recent graph-based algorithm called SEGA has been rewritten in OpenCL to run on NVIDIA GPUs in parallel on a set of Amazon EC2 Cluster GPU Instances. This new implementation of SEGA has been tested on a subset of protein structure data contained in the CavBase, providing a structural comparison of protein binding sites on a much larger scale than in previous research efforts reported in the literature. Index Terms— GPU, Cloud computing, protein binding sites, structure comparison, graph alignment, OpenCL.
I. I NTRODUCTION A major goal in synthetic biology is the manipulation of the genetic setup of living cells to introduce novel biochemical pathways and alter existing ones. A prerequisite for the constitution of new biochemical pathways in microorganisms is a working knowledge of the biochemical function of the proteins of interest. Since assessing protein function experimentally is time-consuming and in some cases even infeasible, the prediction of protein function is a central task in bioinformatics. Typically, the function of a protein is inferred from similar proteins with known functions, most prominently by a sequence comparison, owing to the observation that proteins with an amino acid sequence similarity larger than 40% tend to have similar functions [19]. Accordingly, a plethora of algorithms exists for comparing protein sequences, including the well-known NCBI BLAST algorithm [1]. Yet, below this threshold of 40%, results of sequence comparisons become more and more uncertain [11]. In cases where a sequence-based inference of protein function remains inconclusive, a structural comparison can provide further insights and uncover more remote similarities [18], especially when focusing on functionally important regions of proteins, such as protein binding sites. Several algorithms are known to compare possible protein binding sites based on structural data [9], [17], [3]. However, such algorithms have much longer runtimes than their sequence-based counterparts, severely limiting their use for large scale comparisons.
In this paper, we present a novel approach to significantly speed up the computation times of a recent graphbased algorithm for performing a structural comparison of protein binding sites, called SEGA [15], by using the digital ecosystem of a GPU-based Cloud computing infrastructure. The original CPU-based Java version of SEGA has been rewritten in OpenCL to run on NVIDIA GPUs in parallel on a set of Amazon EC2 Cluster GPU Instances. This new implementation of SEGA has been tested on protein structure data of the CavBase [16], providing a structural comparison of protein binding sites on a much larger scale than in previous research efforts reported in the literature. This paper is organized as follows. Section II discusses related work. The SEGA algorithm is described in Section III, and its GPU implementation is presented in Section IV. Experimental results are discussed in Section V. Section VI concludes the paper and outlines areas for future work. II. R ELATED W ORK Several graph-based algorithms for protein structure analysis have been proposed in the literature. For example, a subgraph isomorphism algorithm [20] has been used by Artymiuk et al. [2] to identify amino acid side chain patterns. Furthermore, Xie and Bourne [22] have proposed an approach utilizing weighted subgraph isomorphism, while Jambon et al. [6] employ heuristics to find correspondences. A more recent approach based on fuzzy histograms to find similarities in structural protein data has been presented by Fober and H¨ullermeier [5]. Fober et al. [4] have shown that pair-wise or multiple alignments on structural protein information can be achieved using labeled point clouds, i.e., sets of vertices in a three-dimensional coordinate system. Apparently, algorithms for performing a structural comparison of protein binding sites have not been designed to run on modern GPUs. However, there are several sequence-based protein analysis approaches that were ported to GPUs. For example, NCBI BLAST runs on GPUs to achieve significant speedups [21]. Other projects, such as CUDASW++ and CUDA-BLASTP [13],[14],[12],[8], have shown that GPUs can
be used as cheap and powerful accelerators for well-known algorithms for performing local sequence alignment, such as the Smith-Waterman algorithm. III. T HE SEGA A LGORITHM The SEGA algorithm constructs a global graph alignment of complete node-labeled and edge-weighted graphs, i.e., a 1to-1 correspondence of nodes. In principle, SEGA realizes a divide and conquer strategy by first solving a correspondence problem on a local scale to derive a distance measure on nodes. This local distance measure is used in a second step to solve another correspondence problem on a global scale, by deriving a mutual assignment of nodes to construct a global graph alignment. To derive a local distance measure, nodes are compared in terms of their immediate surroundings, i.e., the node neighborhood. This node neighborhood is defined by the subgraph formed by the n nearest neighbor nodes. Since SEGA has been developed for graphs representing protein binding sites based on CavBase data [16], nodes represent pseudocenters, i.e., spatial descriptors of physicochemical properties present within a binding site. Edges are weighted with the Euclidian distance between pseudocenters. The basic assumption is that the more similar the immediate surroundings of two pseudocenters are, the higher the likelihood that they belong to corresponding protein regions. Comparing the node neighborhood thus corresponds to comparing the spatial constellation of physicochemical properties in close proximity of these pseudocenters. If these are highly similar, a mutual assignment of these nodes should be favored. Given two input graphs G1 = (V1 , E1 ) and G2 = (V2 , E2 ) with |V1 | = m1 and |V2 | = m2 , a local m1 × m2 distance matrix D = (dij )1≤i≤m1 ,1≤j≤m2 is obtained by extracting the induced neighborhood subgraph for each center node vi ∈ V1 and vj ∈ V2 as given by the set of nodes including the center nodes themselves and the closest n neighbor nodes. To obtain a distance measure between two nodes vi and vj , the corresponding subgraphs are decomposed into the set of all triangles containing the center node vc (see Figure 1). Then, an assignment problem is solved to obtain the number of matching triangles. Triangles are considered to match, if a mutual assignment of nodes exists for which node labels of corresponding neighbor nodes are identical and all corresponding edge weights are within an -range of each other. In other words, a superposition preserving node labels (exempting the center node) and edge lengths is obtained. The node labels of the center nodes are not required to match, to introduce a certain level of tolerance, which is necessary when dealing with molecular structure data. Likewise, the parameter ≥ 0 is a tolerance threshold determining the allowed deviation of edge lengths. The obtained distance matrix D can be considered as a cost matrix, indicating the cost for each potential assignment of nodes vi ∈ V1 and vj ∈ V2 . In the second step of the algorithm, an optimal assignment of nodes from V1 and V2 is derived incrementally, by first realizing the assignment of nodes that have the smallest distance to each other before assigning the next pair of nodes.
v1
v2
v4
vc
v4
v1
v1 vc
vc
v1
v2
v2
vc
v4
vc v4
v3
v3
vc v3
v2 vc
v3
Fig. 1. Decomposition of the neighborhood of node vc with nneigh = 4. The subgraph defined by the nneigh nearest nodes is decomposed into triangles containing the center node vc .
If ambiguities arise, SEGA resorts to global information by selecting assignments for which both nodes preferably show a small deviation with respect to an already obtained partial solution. More precisely, the relative position of candidate nodes to each node in the partial solution is determined and used to calculate another cost matrix, containing a measure of the geometric deviation for each candidate pair. The actual assignments are then obtained by solving another optimal assignment problem, using the Hungarian algorithm [10]. A more detailed description of the approach can be found in Mernberger et al. [15]. IV. SEGA IN A GPU C LOUD In this section, a version of the SEGA algorithm running on GPU hardware and a pipelined computation framework for performing large scale GPU-based structural comparisons of protein binding sites in a Cloud environment are presented. A. GPU Implementation of SEGA A common problem when developing applications to run on GPU hardware is that it is not easy to utilize all resources of a computational node efficiently. If the complete algorithm is implemented to run on a GPU, the host CPU’s work only consists of controlling the device, which usually is not sufficient to operate the processor at full load. The SEGA algorithm is well suited for a division into a GPU and a CPU part. The part of the algorithm that solves the correspondence problem has been rewritten to run on GPU hardware using OpenCL. The iterative part constructing a global alignment is computed on the host CPU, supported by intermediate results generated by the GPU part of the implementation. The creation of the cost matrix D (see Section III) is divided into four OpenCL kernels. The first OpenCL kernel builds input graphs G = (V, E) from the point cloud information provided by the protein cavity database. The data is stored in a m × m matrix where m = |V | is the number of points describing this cavity. Based on the data parallelism in this
Fig. 2.
SEGA GPU architecture overview.
task, this kernel can run with m2 threads at once, where each thread computes a pair-wise distance. The second OpenCL kernel constructs an intermediate matrix for a protein cavity. This matrix contains for each V ∈ G the indices for the n nearest neighbors. Each line in this matrix is data-independent and contains the indices for the n smallest values from the corresponding matrix line in D. This is calculated by m · (m/2) threads, where m/2 threads calculate the n smallest values with parallel reduction and the use of block-shared memory. A neighborhood size of n results in l = n · (n − 1)/2 triangles for each V ∈ G. These triangles are stored in a m × l matrix Z that is created by the third OpenCL kernel. This kernel is executed with m × l threads in parallel using a vector containing the indices indicating which of the n nearest neighbors is combined with which other neighbor. The last OpenCL kernel combines two triangle matrices Z1 , Z2 into a distance matrix D with m1 × m2 elements. It is executed with m1 · m2 threads, where each thread loops over l · l triangles computing the cost for a match. The final alignment of the distance matrix D is computed as described in Section III, supported by intermediate results generated by the OpenCL part. B. Management Framework We have developed a software framework for managing the GPU and CPU computations involved in our implementation. The framework consists of six major components. Three components control the GPU hardware, the fourth component is responsible for selecting objects for comparison, the fifth component offers a service to manage thread pools for workloads on CPUs, and the sixth component provides progress monitoring functionality. The six components communicate via queues that offer multithreading inside each component and additionally a viable way for utilizing multiple GPU devices on a single compute node. Furthermore, this design offers the possibility of repeated execution of a computation on GPU and CPU hardware. This can easily be realized by states inside a
calculation object that contains a set of tasks to handle a group of comparisons. A calculation object contains two important pieces of information: (a) a description of the entities to be compared, and (b) a set of instructions that are to be issued when a comparison is performed. Figure 2 shows the orchestration of the six components involved in the comparison of the protein binding sites using our GPU enhanced implementation of the SEGA algorithm. Furthermore, it also illustrates the data flow through the framework. The Selector component is the entrance point of the framework. It provides both an interconnection to a data store with caching capabilities and the program logic that controls which entities should be compared next. To perform the SEGA comparisons, the Selector combines a set of protein cavity identifiers and loads the point cloud data. This information is passed via a queue to the DataProcessor. Additionally, the Selector stores meta-information in the Monitor component, such as the tasks in progress. In our case, no further work of the algorithm depends on the CPU at this point, so the next component belongs to the GPU. The decision to split the GPU part into three components is mainly due to the design of modern GPU hardware. The latest generation of GPU hardware offers independent control flows of memory reads, memory writes and kernel execution induced by the host system. Therefore, the DataProcessor component containing an arbitrary number of threads is responsible for converting (if needed) and transferring data from the host system to the GPU device memory. Moreover, each GPU device is controlled by its own set of GPU components to ensure maximum utilization of the given resources. For SEGA, the point cloud data is copied into OpenCL buffers and transferred to the GPU. At this point, we encountered a possible bottleneck in the management of OpenCL memory objects: handling several thousands of objects dramatically reduced the allocation performance. Thus, we had to introduce an additional component responsible for ensuring a simple and efficient reuse of memory objects. Additionally, this allows a safer use of GPU device memory because such a pre-allocation
V. E VALUATION To assess the performance of our approach, several experiments have been conducted. The evaluation is split into two parts. First, the performance gains of the SEGA GPU compared to the original SEGA algorithm are investigated. Second, the results of a large scale comparison of protein binding sites on Amazon’s EC2 Cluster GPU Instances are presented. The structural data has been taken from the CavBase [16] maintained by the Cambridge Crystallographic Data Centre.
10
time [ms]
8 6 4 2 0 300 200 100 number of pseudocenters
0
0
50
100
150
200
250
300
number of pseudocenters
(a) GPU part of SEGA GPU
30
time [ms]
guarantees that the GPU device memory is not exceeded during execution, and also limits the number of computations currently in progress. After a successful write operation to the GPU, the calculation object containing the meta-information is passed via an additional queue to the Launcher component. The Launcher executes the corresponding GPU kernels, which in the case of SEGA are responsible for creating the polygon data and combining two distance matrices. After completion, the calculation object is pushed into the next queue. The last GPU related component is the Dispatcher. It is responsible for reading back the results of the kernel execution to the host memory and if necessary process the data further. Afterwards, the results are pushed to the ThreadService. Here, the alignments of the polygons are calculated, and the results are stored. After successfully finishing a computation, the Monitor component is informed. The Monitor fulfills two major tasks. First, it creates an interconnection between the Selector and the ThreadService for storing the results. This is necessary to know whether all combinations have been successfully calculated. Additionally, it records the progress of the computation on a persistent storage. If a computation becomes interrupted due to unpredictable reasons, such as system failures or disk I/O errors, the computation can be resumed at the correct position. The described framework has been implemented in Java using the JogAmp JOCL library [7] for controlling the OpenCL platform.
20
10
0 300 200 100 number of pseudocenters
0
0
50
250 200 150 100 number of pseudocenters
300
(b) CPU part of SEGA GPU
time [ms]
30
C. Cloud Deployment
10
0 300 200 100 number of pseudocenters
0
0
50
100
150
200
250
300
number of pseudocenters
(c) Maximum of GPU and CPU parts of SEGA GPU
2500 2000 time [ms]
A common approach for parallelizing a computational problem is its division into three steps: work partitioning and distribution, task computation and result collection. In case of a commutative comparison where a self-comparison is not necessary, an input set of n elements results in a total number n · (n − 1)/2 computations. A straight-forward approach is to divide the total number of computations by the available number of Cloud nodes. If every comparison is indexed by a single unique identifier, a single node simply needs the identifier to perform a comparison. However, a better approach is to divide the total number of comparisons by an arbitrary number that is larger than the available number of nodes. This allows one to start the result collection phase before the end of the task computation phase and, moreover, enables an on-demand scheduling of tasks to other nodes in case a node fails. The work partitioning and distribution phase also includes the distribution of the input data. For this purpose, several approaches are possible, such as data replication, network exports, and cluster file systems. Fortunately, in our case the required data of the cavity database could be reduced to about 140 MB. Consequently, the data has been transferred and loaded into the main memory of each node. Due to the overall runtimes, this has a negligible impact on the total computation time. After data and task distribution, the nodes can calculate their part(s). When a task has finished, its results can be collected from the Cloud and stored locally.
20
1500 1000 500 0 300 200 100 number of pseudocenters
0
0
50
250 200 150 100 number of pseudocenters
300
(d) Original SEGA Fig. 3.
SEGA benchmarks
A. SEGA vs. SEGA GPU The performance of the original SEGA implementation has been measured on a single core of an Intel Core i7-2600 @ 3.40 GHz with 8 GB RAM, whereas the performance of SEGA GPU has been measured on a single NVIDIA GeForce GTX 580 with 3 GB RAM. The runtimes depend on the number of pseudocenters present in the protein cavities, and thus both SEGA versions
Fig. 5. Pseudocenter distribution among the selected subset of the CavBase
Fig. 4.
Comparison of original SEGA and SEGA GPU benchmarks
have been benchmarked using a subset of the CavBase with a large spectrum of numbers of pseudocenters. In particular, the subset consists of cavities where the numbers of pseudocenters range from 15 to 250. For each comparison, some cavities matching certain size requirements were selected and compared several times (100 times for SEGA GPU; 10 times for original SEGA) to calculate the average runtimes for a particular size combination. The plots in Figure 3 show the runtimes depending on the number of pseudocenters of each cavity. Figure 3(a) shows the average runtime of the GPU part of a SEGA GPU run, and Figure 3(b) shows the runtime of the CPU part of a SEGA GPU run. It is evident that the needed CPU runtime is often higher but is never twice as much as the GPU runtime. One could argue that in typical cluster nodes that offer GPU hardware, for each GPU at least two physical CPU cores are available. Instead, we decided to look at the worst case and compared the results with Figure 3(c). This plot shows the maximum of the two preceding graphs. Finally, Figure 3(d) shows the runtimes of the original SEGA implementation. Figure 4 shows the SEGA GPU and original SEGA runtimes in a single plot. It is important to note that the z-axis has a logarithmic scale. It is evident that the SEGA GPU implementation is 10 to 200 (with an average of 110) times faster than the original SEGA implementation, depending on the number of pseudocenters in each cavity.
must have at least 11 pseudo centers. This resulted in n = 144.849 protein binding sites, leading to n ∗ (n − 1)/2 = 10.490.543.976 comparisons in total.
12
10
8
6
4
2
(a) Boxplot showing the runtime distribution for SEGA GPU
1000
800
600
400
200
0
B. SEGA GPU @ Amazon EC2 The main target platform for SEGA GPU is Amazon’s EC2 Cluster GPU Instances. Each node (instance type: cg1.4xlarge) has two Intel Xeon X5570 CPUs, 22 GB RAM and two NVIDIA Tesla M2050 with 2 GB RAM. Benchmarks between the Tesla M2050 and GeForce GTX 580 have shown that the GTX 580 is about two times faster than the TESLA. This matches the theoretical GFLOPS specifications from NVIDIA (single precision floating point). Thus, the GPU runtime measured in the previous section corresponds to a single EC2 node. The subset of the CavBase used in our experiments has been selected based on the following (pharmaceutically meaningful) ˚ criteria: The resolution of a cavity must be larger than 2.5A; ˚3 and 3500A ˚3 ; a protein the volume must be between 350A
OpenCL
pure Java
(b) Boxplot showing the runtime distribution for SEGA GPU compared to the original SEGA implementation Fig. 6.
Boxplots showing the randomly sampled runtime distributions
Using Amazon’s EC2 resources with associated costs makes it important to predict the expected total runtime of a computation especially if a hard limit for the financial budget must be respected. According to Figure 5, the number of pseudo centers of the proteins in the selected subset of the CavBase is not uniformly distributed. Thus, to predict the total runtime, randomly sampled pairs from the CavBase were visualized with boxplots. The blue box enclosed by the lower and upper
quartile contains the medial 50% of the data. The distance between the upper and lower quartile defines the interquartile range (IQR), a measure for the variance of the data. The (lower and upper) whisker visualizes the remaining data that is not contained in the box defined by the lower and upper quartile. Its length is bounded by 1.5· IQR. Data outside the whisker are outliers and marked by a cross. The 50th percentile (median) is visualized by a red line, the confidence interval (α = 0.05) for the mean by a triangle. Figure 6 (a) shows the boxplot for the SEGA GPU implementation. Figure 6 (b) shows a comparison between the original SEGA implementation and SEGA GPU to exemplify the performance gain. A runtime per comparison of 1.7 ms was expected due to the boxplot. To efficiently use of the infrastructure provided by Amazon EC2, the entire computation was divided to run on 8 Amazon EC2 Cluster GPU Instances in parallel. The comparisons were grouped into 4096 packages and distributed by assigning 512 packages to each node. Due to the runtime of a single comparison and a total number of about 10.5 billion comparisons, a runtime of about 24 days on eight EC2 nodes was expected. In reality, the computation took about 22 days to complete. The cost was about 6.700 US-$ (10.490.543.976 comparisons · 1,7 ms / 3.600.000 ms/h · 1,234 US-$/h = 6.113 US-$ for computations, the rest for storage and network traffic). In contrast, performing the 10.5 billion comparisons on a single core of an Intel Core i7-2600 @ 3.40 GHz with about 300 ms runtime per comparison (see Figure 6 (b)) would require about 36.425 days (∼ 100 years); on a quadcore node with the same specifications, about 9.106 days (∼ 25 years) are required. If an Amazon High Quad CPU Instance with a cost of 0,40 US-$ per hour were used, the total cost would amount to about 87.421 US-$.
VI. C ONCLUSIONS In this paper, we have presented a novel approach to significantly speed up the computation times of the SEGA algorithm for a structural comparison of protein binding sites by using the digital ecosystem of a GPU-based Cloud computing infrastructure. The original CPU-based Java version of SEGA has been rewritten in OpenCL to run on NVIDIA GPUs in parallel on a set of Amazon EC2 Cluster GPU Instances. This new implementation of SEGA has been tested on a subset of protein structure data of the CavBase, requiring an acceptable computation time of about three weeks. Thus, a structural approach to compare protein binding sites becomes a viable alternative to sequence-based alignment algorithms. There are several directions for future work. For example, a comparative analysis could be done for the entire protein space in the CavBase, which not only allows a classification of the protein space into structurally and functionally similar, homologous and non-homologous protein groups, but also supports the systematic search for unexpected similarities and functional relationships. Furthermore, other algorithms for a structural comparison of protein binding sites could be rewritten to run on GPU hardware to provide further insights.
ACKNOWLEDGEMENTS This work is partially supported within the LOEWE program of the State of Hesse, Germany, the German Research Foundation (DFG), and a research grant provided by Amazon Web Services (AWS) in Education. R EFERENCES [1] S. F. Altschul. BLAST Algorithm. John Wiley & Sons, Ltd, 2001. [2] P. J. Artymiuk, A. R. Poirrette, H. M. Grindley, D. W. Rice, and P. Willett. A Graph-theoretic Approach to the Identification of Threedimensional Patterns of Amino Acid Side-chains in Protein Structures. Journal of Molecular Biology, 243(2):327–344, 1994. [3] T. Binkowski and A. Joachimiak. Protein functional surfaces: global shape matching and local spatial alignments of ligand binding sites. BMC structural biology, 8(1):45–68, 2008. [4] T. Fober, G. Glinca, G. Klebe, and E. H¨ullermeier. Superposition and Alignment of Labeled Point Clouds. IEEEACM Transactions on Computational Biology and Bioinformatics, 8(6):1653–1666, 2011. [5] T. Fober and E. Hullermeier. Similarity Measures for Protein Structures Based on Fuzzy Histogram Comparison. Computational Intelligence, pages 18–23, 2010. [6] M. Jambon, A. Imberty, G. Delage, and C. Geourjon. A new bioinformatic approach to detect common 3D sites in protein structures. Proteins, 52(2):137–145, 2003. [7] JogAmp Community. JogAmp JOCL. http://jogamp.org/jocl/www/, 2012. [8] M. A. Kentie. Biological Sequence Alignment on Graphics Processing Units. Master’s thesis, Delft University of Technology, 2010. [9] K. Kinoshita and H. Nakamura. Identification of protein biochemical functions by similarity search using the molecular surface database eFsite. Protein Science, 12(8):1589–1595, 2003. [10] H. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics, 52(1):7–21, 2005. [11] D. Lee, O. Redfern, and C. Orengo. Predicting protein function from sequence and structure. Nature Reviews Molecular Cell Biology, 8(12):995–1005, 2007. [12] W. Liu, B. Schmidt, and W. Muller-Wittig. Cuda-blastp: Accelerating blastp on cuda-enabled graphics hardware. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 8(6):1678–1684, Nov. 2011. [13] Y. Liu, W. Huang, J. Johnson, and S. Vaidya. GPU accelerated smithwaterman. In Proceedings of the 6th international conference on Computational Science - Volume Part IV, ICCS’06, pages 188–195, Berlin, Heidelberg, 2006. Springer-Verlag. [14] Y. Liu, D. Maskell, and B. Schmidt. CUDASW++: optimizing SmithWaterman sequence database searches for CUDA-enable graphics processing units. BMC Research Notes, 2(1):73, 2009. [15] M. Mernberger, G. Klebe, and E. H¨ullermeier. SEGA: Semi-global graph alignment for structure-based protein comparison. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(5):1330– 1343, 2011. [16] S. Schmitt, D. Kuhn, and G. Klebe. A New Method to Detect Related Function Among Proteins Independent of Sequence and Fold Homology. Journal of Molecular Biology, 323(2):387 – 406, 2002. [17] A. Stark and R. Russell. Annotation in three dimensions. PINTS: patterns in non-homologous tertiary structures. Nucleic Acids Research, 31(13):3341–3344, 2003. [18] J. M. Thornton. From genome to function. Science, 292(5524):2095– 2097, 2001. [19] A. Todd, C. Orengo, and J. Thornton. Evolution of function in protein superfamilies, from a structural perspective. Journal of Molecular Biology, 307(4):1113–1143, 2001. [20] J. R. Ullmann. An Algorithm for Subgraph Isomorphism. Journal of the ACM, 23(1):31–42, 1976. [21] P. D. Vouzis and N. V. Sahinidis. GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics, 27(2):182–188, 2011. [22] L. Xie and P. E. Bourne. Detecting evolutionary relationships across existing fold space, using sequence order-independent profileprofile alignments. Proceedings of the National Academy of Sciences of the United States of America, 105(14):5441–5446, 2008.