. Abstract. We
provide a detailed review of basic algorithm techniqueues as applied to
bioinformatic.
Design of Algorithms for Bioinformatics Kayhan Erciyeş İzmir University Computer Engineering Department Gursel Aksel Bulv. No. 14 Uckuyular, Izmir 35350, Turkey
[email protected]
Abstract We provide a detailed review of basic algorithm techniqueues as applied to bioinformatic problems. Dynamic programming and graph algorithms are of particular concern due to their wide range of applications in bioinformatics. Some of the bioinformatic problems do not have solutions in polynomial time and are called NP-Complete. For these problems, approximation algorithms may be used. We show several examples where approximation algorithms may be used to provide sub-optimal solutions to these problems. We finally provide sample results of our ongoing work on building phylogenetic trees for Yhaplogroup data.
1. Introduction Recent studies in bioinformatics necessiates the cooperation between various disciplines such as biology, mathematics, computer science, and statistics. Problems to be solved in bioinformatics require immense computational power due to the huge size of biological data. Design of efficient algorithms therefore is key to provide solutions to these problems. There are various algorithmic methods such as greedy algorithms, divide and conquer, dynamic programming etc. which provide efficient solutions to some bioinformatic problems. However, many problems have no known solutions in polynomial time and approximation algorithms that find sub-optimal solutions are usually the only choice. The aim of this study is first to provide a detailed survey of algorithmic methods as applied to bioinformatic problems. We then provide results of our recent work [1] on building phylogenetic trees from Y-haplogroup data. The rest of the paper is organized as follows. Section 2 provides a detailed algorithm background, Section 3 describes our work on building phylogenetic trees and finally conclusions are outlined in Section 4.
2. Algorithms for Bioinformatics An algorithm is basically a finite set of precise instructions for performing a computation or for solving a problem. The name is derived from the Persian Mathematician Al-Khowarizmi who was first to formalize the rules for the 4 basic arithmetic operations. When one is given the task of solving a problem of any kind; intuition, experience, being informed about the current status of global knowledge about that particular problem, talent and sometimes luck all play a role. There are several basic requirements for a good algorithm. Firstly, it should work correctly for all valid inputs. It should also be efficient, that is, it should have low time complexity with respect to the input size. An algorithm should be scalable, meaning it should have similar efficiency for any input size. The limitations of an algorithm, if any, should also be clearly stated. For many non-trivial problems, there is a natural brute force N search algorithm that checks every possible solution which takes ~2 time or worse for inputs of size N which is unacceptable in practice. Desirable scaling property is that there d exists constants c > 0 and d > 0 such that its running time is bounded by c N steps.
In this section, we briefly review basic algorithm paradigms, namely, greedy algorithms, divide and conquer method, dynamic programming and graph algorithms as applied to bioinformatics problems. We then show that some problems are intractible and named as NP-Complete problems and describe the use of Approximation Algorithms which give sub-optimal solutions to these problems. 2.1. Greedy Algorithms Greedy algorithms build up a solution by choosing the next piece that offers the most obvious and immediate benefit. There is usually a greedy criterion that is optimized at each step of the algorithm [2]. Greedy algorithms are not suitable for many computational tasks but provide optimum results for problems like minimum spanning tree computation of a graph. 2.1.1. Cabbage and Turnip Example [3] Although cabbages and turnips share a recent common ancestor, they look and taste different. When we compare their gene sequences in their mtDNA, there is no evolutionary information and mtDNAs are about 99% equal. However, gene sequences differ in gene order. Evolution is manifested as the divergence of gene order. The key problem that should be addressed is the revelation of evolutionary scenario for transforming one genome to another. Our goal is; given two permutations, to find the shortest series of reversals that transforms one into another. In algorithmic notation: • Input: Permutations p and q • Output: A series of reversals r1,…rt transforming p into q, such that t is minimum For example, given the gene sequence of p = 1 2 3 4 5 6 7 8 of a genome, we need to find the minimum number of reversals required to obtain the gene sequence q = 1 2 5 4 6 3 7 8 in another genome. By inspection, we can see that the first reversal can be on 3 4 5 yielding 1 2 5 4 3 6 7 8 and secondly, reversal on 3 6 will give the required q sequence. This number will give us an idea on the relatedness of two genomes. Greedy methods which employ reversal sort may be used to obtain the target genome sequence. 2.2. Divide and Conquer Method This method, in general, has three phases. We first break a problem into subproblems that are themselves smaller instances of the same type of problem. We then recursively solve these subproblems and appropriately combine their answers [2]. One classical example of this method is the mergesort algorithm. Formally, we are given an array of numbers and asked to sort these in descending or ascending order. Mergesort algorithm divides the array into two halves, recursively sorts each half and merges two halves to make sorted whole. It has O(n logn) time complexity. 2.3. Graph Algorithms Many problems can be expressed clearly and precisely with graphs. Formally, a graph G(V,E) is specified by a set of vertices V and by edges E between pairs of vertices. Viewing and analyzing vast amounts of biological data as a whole set is very difficult and it is much easier to interpret the data if they are partitioned into clusters joining similar data points. 2.3.1. A Simple Clustering Algorithm A clique is a graph such that every vertex is connected to all other vertices in the graph. Assuming we have the distance between genes provided by some means as a distance matrix D of n x n elements where D(i,j) is the distance between genes i and j.
We need to form clusters of this data. One possible algorithm is shown in Fig.1 where our aim is to form clusters of similar genome segments. 1. Work out the distance matrix D. 2. Choose a distance threshold μ. 3. If the distance between two vertices is below μ, draw an edge between them. 4. The resulting graph may contain cliques. These cliques represent clusters of closely located data points. Figure 1. Clustering Algorithm for Gene Data 2.3.2. Phylogenetic Trees Evolutionary trees are graph structures built from DNA sequences where leaves of the tree represent existing species, nodes between the root and the leaves represent ancestors and root is the oldest ancestor. Edges usually have weights, showing the number of mutations from one specimen to another, which represents the time estimate from each other. Given n species, one can compute the distance matrix Dnxn where Dij is the genetic distance between two species. Evolution of these species is described by a tree that we do not know beforehand. Therefore, an algorithm is required that builds a tree which provides distances between the species as close as possible to distances in Dnxn.
Figure 2. A Phylogenetic Tree Example Neighbor-joining algorithm which was proposed in 1987 Saitou and Nei, is an iterative, greedy algorithm that builds a phylogenetic tree for the genes under consideration [4]. It assumes the distance matrix D is available. Each iteration consists of the following steps: 1. Based on the current distance matrix, calculate the matrix Q of new distances. 2. Find the pair of taxa in Q with the lowest value. Create a node on the tree that joins these two taxa (i.e., join the closest neighbors, as the algorithm name implies). 3. Calculate the distance of each of the taxa in the pair to this new node. 4. Calculate the distance of all taxa outside of this pair to the new node. 5. Start the algorithm again, considering the pair of joined neighbors as a single taxon and using the distances calculated in the previous step. This algorithm usually finds a tree that is quite close to the original phylogenetic tree but may not find the true tree as it is an approximation algorithm. However, NJ algorithm has polynomial time complexity and although other methods such as maximum
parsimony and maximum likelihood offer more accurate construction of trees, NJ may be preferred for large sets of data due to its low time complexity. 2.4. Dynamic Programming Dynamic programming is a very powerful algorithmic paradigm in which a problem is solved by identifying a collection of subproblems and tackling them one by one, starting from the smallest, using the answers to small problems to help figure out larger ones, until the whole problem is solved. When solving a problem by dynamic programming, the most important question is to identify the subproblems. Once these are found, it is an easy matter to iteratively solve subproblems in order of increasing size [2]. 2.4.1. Gene Finding A gene is a sequence of nucleotids which codes for a protein. Gene prediction is determining the beginning and end positions of a gene in a genome. Every triplet of nucleotides called a codon, codes for exactly one amino acid in a protein. In human genome, about 3% of DNA sequence is genes which show that there is abundant of DNA that seems to be of no particular use. A newly sequenced gene may be similar to a known gene. Finding sequence similarities with genes of known function is a common approach to understand a newly sequenced gene’s function. In 1984, Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene [3]. A normal growth gene switched on at the wrong time causes cancer. Computing a similarity score between two genes shows the likeliness of them to have similar functions. Dynamic programming may be used to find out the similarities between genes. 2.4.2. Longest Common Subsequence Problem Given two sequences x = x1 x2…xm and y = y1 y2…yn, the Longest Common Subsequence (LCS) of x and y is a sequence of positions in x: 1 < i1 < i2 < … < it < m and a sequence of positions in y: 1 < j1 < j2 < … < jt < n such that it letter of x equals to jt letter of y and t is maximal. LCS is basically an alignment without mismatches. Fig. 3 shows an example where LCS of two genomes are shown with 4 insertions and deletions. GT–TGAG–A –TGTG –GCA Figure 3. LCS Example LCS problem can be solved by the well-known dynamic programming example of Manhattan Tourist Problem (MTP) [2]. In MTP, we are given a two dimensional weighted grid which represents the streets and are allowed only to travel eastward and soutward. We are asked to find the longest path from a given source node to the sink. A greedy algorithm that always tries to choose the heaviest edge from the current vertex may find a path that is quite different than the optimal. Dynamic programming solution to this problem requires filling the vertices in the matrix as : s i,j = max{ si-1, j-1+1 if vi = wj; si-1, j; si, j-1 }
(1)
The first row of the grid is labeled with the source genome segment and the first column is labeled with the destination. A recursive dynamic algorithm provides the solution in linear time.
2.4.3. Suffix Trees The suffix tree for a string S is a tree whose edges are labeled with strings, such that each suffix of S corresponds to exactly one path from the tree's root to a leaf. It is thus a radix tree for the suffixes of S. Fig. 4 depicts such a suffix tree for the word “ananas”. The leaves of the tree are labeled with the index locations for the patterns of the word. For genomes, each path to a leaf of the tree would correspond to a substring of the genome.
Figure 4. Suffix Tree for the word “ananas”
Constructing such a tree for the string S takes time and space as a linear function of the length of S. Once constructed, operations such as locating a substring, locating matches for a regular expression pattern etc. can be performed quickly. Suffix trees also provided one of the first linear-time solutions for the LCS problem. However, storing a string's suffix tree typically requires significantly more space than storing the string itself. 2.4.4. Pattern Matching Genomic repeats are tandem repeats of DNA segments. It is highly desirable to detect such repeats in DNA for a number of reasons. Firstly, genomic repeats are usually associated by repeats; it is possible to find hereditary information and many tumors exhibit repeat structure. An example is shown in Fig.5a where the repeat pattern is “TTGA”. The repeat structue may not be easy to detect as in Fig.5b. where G has mutated to T and T has mutated to A. (a) ATATTGATATATTGATAGTATTGA (b) ATATAGATATATTGATAGTATTTA Figure 5. A Repeat Pattern in a DNA Segment We may need to find the locations, their start index in the DNA, of these repeats or it may be required to find long repeats, first we can find the short l-mer repeats, then we can extend the l-mer repeats to longer repeats. For example, assuming that the DNA sequence GCGTAGGCTTACTCAGTCTTGTAGGCTTA is given; the 4-mer GTAG repeats are at locations 3 and 21. We can extend the 4-mer repeat to maximal repeat GTAGGCTT. A simple way to find l-mer repeats is hashing, where a hash-table index is generated for each l-mer repeat and at each index of the table, genome start locations of the l-mer repeat that produced the index is stored. Finally, l-mer repeats can be extended
to maximal repeats. If instead of finding the maximal repeat in a genome, we are required to find the locations of occurrence of a given pattern in a genome, the problem is called the pattern matching problem. Formally, we are given an input pattern s= s1, s2, …, sn and a genome g1,g2,..,gm we are asked to find all positions of 1< i < (m – n + 1) such that the n-letter substring of g starting at i matches s. We can use suffix trees for that purpose. 2.5. NP-Completeness We would like to classify problems whether they can be solved in polynomial time or not. Most of the fundamental problems are difficult to classify, let alone the difficulty in finding solutions for them. However, these fundamental problems are computationally equivalent and appear to be the manifestation of one really hard problem. Informally, for a problem to be considered NP-Complete, it should have a polynomial time certificate which decides on an instant of a problem and secondly, it should be reduced to another NP-Complete problem. The first NP-Complete problem is the CIRCUIT-SAT problem which is stated as follows. Given a combinational circuit built out of AND, OR, and NOT gates, is there a way to set the circuit inputs so that the output is 1? Another example is the vertex cover problem which is as follows. Given a graph G = (V, E), is there a subset of vertices S V such, and for each edge, at least one of its endpoints is in S and S is minimum. The decision part of this problem is that given an integer k and |S| k, is there a subset of vertices S such that each edge has an end point in S ? Vertex Cover is NP-Complete but there are various approximation algorithms to solve it. 2.6. Approximation Algorithms When we need a solution to an NP-Complete problem, we may be able to find a workable sub-optimal algorithm if we sacrifice the optimal solution. A ρ approximation algorithm runs in polynomial time and guarantess to find the solution within ratio ρ of the optimum. The real challenge here is the proving a solution value is close to optimum without knowing what optimum is. The approximation ratio ρ of an algorithm A on input p is A(p) / OPT(p) where A(p) is the solution produced by algorithm A and OPT(p) is the optimal solution of the problem. As an example, an approximation algorithm for vertex cover is as follows. Consider an arbitrary edge (u, v) in the graph. One of its two vertices must be in the cover, but we do not know which one. Therefore we choose to simply put both vertices into the vertex cover. Then we remove all edges that are incident to u and v since they are now all covered, and recurse on the remaining edges until all edges are covered. For every one vertex that must be in the cover, we put two into our cover, so it is easy to see that the cover we generate is at most twice the size of the optimum cover. 2.6.1. DNA Sequencing In DNA sequencing, DNA is partitioned into millions of small fragments and about 500-700 nucleotids are read from each fragment to assemble a single genomic sequence called superstring. This is equivalent to shortest superstring problem. Formally, we are given a set of strings s1, s2, .., sn and are asked to find a single shortest superstring which contains all of the strings. This problem is NP-Complete, therefore approximation algorithms may be used. Close inspection yields this problem can be modelled as the well-known Travelling Salesman Problem (TSP) in graphs where a salesman is required to travel a number of cities, visiting each city only once with the requirement that the path traversed is minimum and he should return to the start location. A simple approximation algorithm for TSP is finding the minimum spanning tree of the graph, and performing a pre-order traversal of this tree. This procedure yields an approximation factor of 2.
2.6.2. Center Selection for Clustering Given a set of n points in a plane, we are required to select k centers so that maximum distance from a point to the closest center is minimized. Solution to this problem will yield the cluster centers around which clusters can be formed. The greedy approach here would be repeatedly choosing the next center to be farthest away from all existing centers. Fig. 6 displays the greedy approximation algorithm for center selection. Greedy_Center_Find(k,n, s1, s2, .., sn) { C Ф; repeat k times { Select point si with max(dist(si,C)); Add si to C; } return C; } Figure 6. Greedy Center Selection Approximation Algorithm This algorithm provides an approximation ratio of 2 and no other algorithms are known to achieve a better ratio. 2.6.3. Fuzzy Clustering Clustering, in fact, deals with partitioning the data set into homogenous groups with respect to proper similarity measure in such a way those patterns likely to be similar to each other in the same cluster, and as dissimilar as possible in different clusters. In hard cluster analysis, the boundary of different clusters is crisp such that a pattern is assigned to exactly one cluster. In practice, the data are usually not well distributed; therefore the boundary may not be precisely defined. That is, a data point could belong to more than one cluster with different degrees of membership. Therefore, the partitions of a data set can be either hard as in the k-means clustering algorithms, or fuzzy as in the fuzzy cmeans (FCM) algorithm. Fuzzy c-Means (FCM) Algorithm FCM algorithm partitions a collection of n vectors ( X x1, x2 ,...,xn p ) into c fuzzy groups such that the weighted within-groups sum of squared error objective function is minimized [5]. The objective function and constraints for FCM are defined as c n
J m (u, v) uijm d 2 (vi , x j ) min
(2)
i 1 j 1
n c i 1uij 1 , uij 0,1 and 0 j 1 uij n .
In Eq. (2), uij is the membership of the jth data point in the i-th cluster, vi is the ith cluster center, and d (vi , x j ) is the distance between vi and
xj
d 2 i , xi ( i x j )T A( i x j ) . (3) The distance matrix A is chosen depending on the geometric and statistical properties of the data. The parameter m in Eq. (2) is the fuzzy exponent controlling the
amount of fuzziness in the classification process and m 1, . In general, the larger m is, the fuzzier are the membership assignments of the clustering. By differentiating J m with respect to i (for fixed uij ; i 1,...,c ; j 1,...,n ) and uij (for fixed i ; i 1,...,c ) the necessary conditions for J m to reach its minimum are
i
nj 1uijm x j nj 1uijm
, i 1, c ,
(4)
and
uij
1 c
( || xi v j || / || xi vl ||)
2 /(m 1)
, i 1, c ; j 1, n .
(5)
l 1
The pseudocode of the FCM algorithm is given in Fig.7. 1. Fix number of clusters c and fuzziness exponent m . Choose initial cluster centers v0 (v10,...,vc0 ) arbitrarily and termination criterion 0 . Set t 1 ; t ] , i 1,...,c ; j 1,...,n by using Eq. (5); 2. Compute all memberships u t [uij
3. Compute new cluster centers 4. Compute t vt vt 1
2
vit by using Eq. (4).
. If t stop, else
t t 1 and go to Step 2.
Figure 7. FCM Algorithm Cluster validation is an important issue in cluster analysis since the correct structure of a data set is unknown. Once the partition is obtained by a clustering method, the validity function can help one to validate whether it accurately presents the data structure or not. Hence, detecting the cluster validity is the basic problem of cluster analysis. Cluster validity indexes are defined as identifying the optimal cluster number. It is impossible to detect the real structure of the cluster if a little mistake is made in determining the number of clusters. Fuzzy-Neighborhood DBSCAN (FN-DBSCAN) Although FCM algorithm has been used in many different disciplines, it has some constraints and disadvantages in terms of application. For instance, number of clusters must be given a priori and different choices of initial cluster centers, eventually, may lead to different local optima. Moreover, the conventional FCM algorithm can identify only the spherical-shaped clusters and elements of a dataset are distributed to clusters in approximately equal numbers. Density/neighborhood-based clustering methods have been developed to discover clusters with irregular shape. Such kind of methods typically regards clusters as dense regions of objects in the data space that are separated by regions of low density (representing noise). In addition, in such kind of clustering, there is no need to predict the number of clusters. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the first clustering algorithm proposed to use the neighborhood concept [6]. DBSCAN algorithm grows regions with sufficiently high density into clusters and discovers clusters of arbitrary shape in spatial databases with noise. It defines a cluster as a maximal set of density-connected points.
DBSCAN algorithm has a low time complexity (O(n logn). However, the values of its parameters Eps and MinPts, which represent neighborhood radius and minimum number of neighbors within a certain radius, respectively, vary for each data set depending on the scale. Although the algorithm runs fast for any value of the parameters, the process of adjusting the optimal value of parameters for a certain data set could be difficult. In terms of adjusting its parameters, a robust algorithm, NRFJP (Noise-Robust Fuzzy Joint Points), was proposed (Nasibov&Ulutagay, 2007). NRFJP algorithm makes use of the advantages of fuzzy sets theory in calculating the neighborhood relations. 4 However, the worst time complexity of NRFJP algorithm is O(n ) . FN-DBSCAN algorithm combines the speed of DBSCAN and the robustness of NRFJP algorithms [7]. The main advantage of integrating these algorithms in the FNDBSCAN algorithm is the usability of various neighborhood membership functions that regularize different neighborhood sensitivities. Thus, the FN-DBSCAN method is more robust to the variations of the density within clusters and to the scale of dataset. Note that the fuzziness concept in NRFJP and FN-DBSCAN algorithms is different than that of FCM. NRFJP and FN-DBSCAN algorithms use detailedness of features as fuzziness while this concept in FCM is considered as being member of various classes simultaneuosly with membership degress between 0 and 1. Some of the concepts uses in FN-DBSCAN are defined below. Definition 1. The fuzzy neighborhood set of point set determined as follows:
x X with 1 parameter is a fuzzy
FN ( x; 1 ) { y, N x ( y) | N x ( y) 1} .
where
(6)
N x : X [0,1] is any membership function that determines neighborhood
relation between points, and neighborhood degrees. Definition 2: A point
1
is a parameter that determines the minimal threshold of
x is called a fuzzy core point with parameters 1 and 2 if card FN ( x; 1 ) N x ( y) 2
(7)
yN ( x;1 )
holds for the point
x X , where 2 is a parameter that determines noise threshold, and
N ( x; 1 ) is a neighborhood set N ( x; 1 ) { y X | N x ( y) 1} .
(8)
The pseudocode of the FN-DBSCAN algorithm is given in Fig.8.
1. Mark all the points in the data set X as unclassified. 2. Specify the fuzzy memberships threshold parameters
1
3. Find an unclassified fuzzy core-point p with parameters
2 . Set t 1 . 1 and 2 . and
4. Mark p to be classified. Start a new cluster to be the cluster cluster. 5. Find all the unclassified points in the neighborhood set
C t and assign p to this
N ( p; 1 ) . Create a set of
S and put all these points into the set S . 6. Get a point q in the set S , mark q to be classified, assign q to the cluster C t , and remove q from the set S . seeds
q is a fuzzy core-point with parameters 1 and 2 ; if so, add all the unclassified points in the neighborhood set N (q; 1 ) to the set S . 8. Repeat step 6 through Step 7 until the set S is empty. 9. Find a new unclassified fuzzy core point p with parameters 1 and 2 , set t t 1 7. Check if
and repeat Step 4 through Step 7 until there is no unclassified fuzzy core point. 10. Mark all the points, which do not belong to any cluster, as noise. Figure 8. FN-DBSCAN Algorithm 3. Our Work A Single Nucleotid Polymorphism (SNP) is a variation in the DNA sequence when a single nucleotid (A,T,C or G) differs between members of species. Single nucleotids may be changed, deleted or inserted. SNPs may be in the coding sequence of the genes, noncoding region of the genes or in the intergenic regions between the genes. SNPs provide important information about how humans develop diseases and respond to chemicals, pathogens, vaccines, drugs etc. A Short Tandem Repeat (STR) is an adjacent repeated pattern in DNA. By identifying the count of repeats of STRs, it is possible to create a genetic profile of an individual. There are about 10,000 discovered STRs in human genome and analysis of STRs is the main method used in forensic science to determine the genetic profile of an individual. Y-chromosome is only found in males and is passed from father to son almost intact. All males can be classified based on the SNPs of their Y-chromosome. YHaplogroups have been formed based on the SNPs of the Y-chromosome. There is about 18 Y-haplogroups of existing males in the world today as shown in Fig. 9. All current living males in the world can be traced back to a single male who is called the Genetic Adam. Females, similarly pass their mtDNA to their daughters and all of the existing females may be traced back to a single woman called the Genetic Eve.
Figure 9. Y-DNA Haplogroup Tree DNA Y-Chromosome Segment (DYS) is a segment of DNA where STRs are contained. The possible variations of repeats at a DYS marker are called alleles. STR values are commonly called markers in genealogy testing. For example DYS19 has “TAGA” sequence and an individual having DYS19 = 15 has 15 tandem repeats of “TAGA” sequence. Fig. 10 displays 12 basic DYS values for two individuals A and B in the Y-haplogroup G with a distance of 4.
DYS
393
390
19
391
385a
385b
426
388
439
389i
392
389ii
A B
14 14
21 21
15 15
11 10
13 14
15 14
11 11
12 12
12 12
12 12
11 11
28 29
Figure 10. DYS Values for Two Individuals The study of Y-haplogroups is important to find ancestors, to understand migration patterns and also susceptibility to certain diseases. In this study, we have used data of G haplogroup to provide phylogenetic trees for the G2a3a (M406) subgroup [8]. G2a3a is characterized by M201, P287, P5 and L30 SNPs and is generally found in Turkey, Greece, Middle-East. G2a3a persons usually have DYS390=21 value and based on this observation, it is estimated that 5% of Anatolian Turks are G2a3a as the highest percentage in any country sampled until today. Lebanon, Jordan and Palestine also have signifiicant G2a3a populations. G2a3a which is estimated to be about 4,000 years old, probably spread from East Mediterrenean by trading, sea-routes, slave-selling etc. Phonecians who were settled in Jordan, Israel, Lebanon area may have been one of the main carriers of G2a3a to Greece, southern Italy, southern Spain along the sea-routes. To construct the phlogenetic tree for a small sample of G2a3a and G2a3b individuals, we first calculate distances between individuals based on their STR distances and the mutation rates for STRs based on Chandler data [9]. We then provide three clustering algorithms, the first one is the basic clustering algorithm described in Section 2.6.2., the second one is FCM and the third one is FN-DBSCAN described in Section 2.3.6. We show that the FN-DBSCAN is superior to the others based on the output for this data set. Finally, we build phylogenetic trees using NJ method on the clusters.
Figure 11. Clustering by FN-DBSCAN Algorithm for G Data Our intended contribution is twofold. Firstly, we provide clustering before building the phylogenetic trees in a hierarchical manner and then apply the NJ algorithm within each cluster which provides simplification in processing and also provides basis for distributed setting, secondly we show that FN-DBSCAN provides favorable clusters. We implemented the FN-DBSCAN clustering algorithm to M406 data to obtain the three distinct clusters shown in Fig. 11 [1]. The Phylogenetic tree obtained for the cluster on the right of Fig. 11 is depicted in Fig. 12.
Figure 12. Phylogenetic Tree for the right Cluster of Fig. 11
4. Conclusions Design of efficient algorithms to provide solutions to biocomputing and bioinformatic problems such as LCS, Gene Finding, Clustering is of paramount importance. In this study, we have provided a detailed survey of basic algorithmic methods to provide efficient solutions to such problems. Some of these problems have no known polynomial time solutions and are classified as NP-Complete problems. Approximation algorithms provide sub-optimal solutions to these problems with polynomial time complexity. One such important problem is clustering and we show various approaches to clustering. Finally, we briefly provide the results of our recent work on building phylogenetic trees for the G Y-haplogroup data. References st
[1] Ruzgar, E., Erciyes, K., Pyhlogenetic Tree Contstruction for Y-Haplogroups, The 1 Int.’l Symposium on Computing in Science and Engineering, June 2010, Kusadasi, Turkey. [2] Dasgupta, S., Papadimitriou, C. H.,Vazirani, U. V., Algorithms, McGraw-Hill, 2006. [3] Pevzner, P., A., An Introduction to Bioinformatics Algorithms, MIT Press, 2004, ISBN-10:0-26210106-8. [4] Saitou N., Nei M., (1987), The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol., 4, 406-425. [5] Bezdek,J.C.,“Fuzzy Mathematics in Pattern Classification”,PhD thesis, Cornell University, 1973 [6] Ester, M., Kriegel, H., Sander, J., & Xu, X., A density based algorithm for discovering clusters nd in large spatial datasets with noise, 2 Int. Conf. KDDD, Portland, Oregon, 1232–1239, 1996. [7] Nasibov, E. N., Ulutagay, G.: Robustness of density-based clustering methods with various neighborhood relations, Fuzzy Sets and Systems, (2009), 160, 3601- 3615. [8] Y-DNA Haplogroup G Project, http://www.members.cox.net/morebanks/Diagram.html [9] Chandler: Estimating Per-Locus Mutation Rates. Journal of Genetic Genealogy, 2, 27-33, 2006.