Key words: string distance measures, EST clustering, simulated data, ... best measured by direct comparison with the biological reality: this enables ... as well as their descriptions ([10] and a preliminary user manual for ECLEST) ... An Expressed Sequence Tag (EST) is a short (typically 300â500 base pair) DNA sequence.
A Comparative Study of Biological Distances for EST Clustering Scott Hazelhurst1 , Zsuzsanna Lipt´ak2 , and Judith Zimmerman3 1
3
School of Computer Science, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa ? ? ? 2 Universit¨at Bielefeld,Technische Fakult¨at, AG Genominformatik, 33594 Bielefeld, Germany † Research Group ‘Algorithms, Data Structures, and Applications’, Institute of Theoretical Computer Science, ETH Zurich, CH-8092 Zurich
Abstract. The paper presents the results of an experimental study in which different string distance measures were compared and evaluated as to their applicability to EST clustering. We implemented two tools, SeqGen (Sequence Generator) and ECLEST (Evaluator for Clusterings of ESTs). These were used to generate simulated ESTs from input human cDNAs; and to run EST clustering on these ESTs and compute a score for the quality of the clustering, respectively. We propagate the use of simulated data for comparative studies of this type, because they allow evaluation w.r.t. a known ideal solution (in this case, the correct clustering), which is not possible in most cases with real-life data. The distance measures we compared include both subword-based and alignment-based measures. We ran a large number of tests and obtained statistically significant results as to the applicability of the distance measures included. For example, we show that certain subword-based measures produce output, in a significant number of cases, that is comparable to alignment-based ones, and that certain (easy-to-compute) measures are well suited for a preprocessing step. Our results have significant applications in studies of gene expression and discovery of products of alternative splicing, where there is a pressing need for fast clustering of increasingly large sets of ESTs.
Key words: string distance measures, EST clustering, simulated data, clustering evaluation, benchmarks
1 Introduction The notion of biological distance is critical in many applications in computational molecular biology, where sequences (DNA or protein) must be compared with each other. For example, searching for an approximate match in a DNA or protein database is done by thousands of people all over the world every day. These searches rely on some notion of what it means for two sequences to approximately match — what biological distance means. EST clustering, the application we study, is another good example: here we are given thousands of short DNA sequences, and cluster the sequences, putting those sequences that are ‘close’ to each other in the same cluster. Many distances have been proposed. There are two key criteria for a biological distance: – accuracy (biological fitness for the application); – computational cost (how efficiently can it be implemented). Computational cost is relatively easy to measure, analytically or experimentally; determining accuracy it seems is more difficult. Initially, our research started trying to find a more computationally efficient algorithm for computing distance, and was faced with the challenge of demonstrating that it would lead to biologically correct answers. The way we were told we had to do this was to show that our algorithm would produce the same answers as an existing one (which had validated itself by showing it produced the same answers as one before that). Thus, the gold standard for correctness is the output of an algorithm that someone else ???
†
Partially supported by SA National Research Foundation (GUN2053410) Most of this work was done while Zs.L. was working in the Research Group ‘Algorithms, Data Structures, and Applications’, Institute of Theoretical Computer Science, ETH Zurich, and at the South African National Institute of Bioinformatics (SANBI), Cape Town
has developed, rather than biological fitness. A literature survey confirmed that arguments for biological accuracy tend to be qualitative (based on some model) rather than quantitative. Those algorithms that are common today probably do give good biological answers (as demonstrated by the number of people who put their faith in them). However, if any answer that is different is ipso facto worse, this operational definition of correctness must hinder innovation. We argue that accuracy is best measured by direct comparison with the biological reality: this enables us to come up with better definitions of distance, as well as doing appropriate trade-offs between cost and accuracy. Nor is it necessarily the case that the same distance is best in all circumstances. In some cases, we use a distance to measure a biological reality (e.g. in a phylogenetic study, we may use a distance between sequences as a way to measure evolutionary distance); in others, a distance may be used to filter out artifactual differences (e.g. lab errors). Contributions: If in any application, different distances yield significantly different accuracy results, then it really matters which is used. If on the other hand, the results are robust with respect to distance used, this gives us a useful tool in algorithm design. This paper proposes a methodology for evaluating distances for EST clustering with respect to biological accuracy. This methodology is demonstrated by an empirical comparison of five different distances. The two tools we implemented for this research, SeqGen and ECLEST, as well as their descriptions ([10] and a preliminary user manual for ECLEST) are available on www. ebite .uni-bielefeld.de/~zsuzsa/EST lust.html. The rest of this section motivates and gives background to the choice of EST clustering, and then outlines the paper. 1.1
EST clustering
An Expressed Sequence Tag (EST) is a short (typically 300–500 base pair) DNA sequence. They are typically produced in applications where we wish to discover what gene products are being produced by some cells’ DNA at a particular point in time. Examples of their use are a time-series study of an organism, differential studies of similar healthy and diseased cells, or discovery of products of alternative splicing. Full-length mRNA sequences are extracted from some cells of interest. These mRNAs are the product of the active genes within the cells. Since the mRNA is difficult to sequence, they are reverse transcribed into complementary DNA (cDNA) sequences, cloned, and then sequenced. The biological sample should contain a number of cDNA sequences from each active gene in the cells. Those cDNAs that are the product of the same gene will be almost identical and there will be very few differences due to errors in transcription. The sequencing process produces reads of between 300 and 500 base pairs in length — these are the ESTs. This is a high-throughput process but is also error-prone as typically only single reads are done. The result of this process is a large set of ESTs. Each long cDNA sequence will be broken into a number of ESTs. For each cDNA manufactured from the mRNA there will be multiple cloned copies, and so for each gene product there will be a set of overlapping ESTs. The EST clustering problem is to take a set of ESTs produced from a set of mRNAs and cluster the ESTs so that all the ESTs that are the product of the same gene are put in the same cluster. This is done by clustering those ESTs that are close together, which can be formalised as a string problem: x and y are in the same cluster if they contain a sufficiently common subsequence. Note the transitivity implicit in this definition: if x and y have a common subsequence and y and z have another common subsequence then x and z will be in the same cluster even if they are very different sequences. This makes sense since the ESTs that are produced from either end of the same long cDNA (and hence gene) should be put in the same cluster even if there is no similarity between them at the string level. More formally, let B be the alphabet fA; C; G; T g. An EST is a sequence over B . Clustering is done with respect to – a distance d : B + B +
! R; and
– a threshold θ 2 R. Then: / – C = fC1 ; : : : ; Cm g is a partition of S if [Ci = S and for i 6= j, Ci \ C j = 0. – If C and D are two partitions of S, C is a finer partition than D (or D is coarser than C ) if 8C 2 C , 9D 2 D : C D. – C is a clustering of S with respect to d and θ if it is the finest partition of S such that if d (x; y) θ then x; y 2 C j for some j. def
One way of improving the computational performance of a clustering algorithm G is to find a coarser clustering algorithm F that is cheaper, to first run F on S, and then to separately run G on each of the clusters that F finds. We want the first clustering produced to be coarser than the actually desired clustering so as not to lose accuracy. However, if too coarse, we do not get any computational advantage. On the other hand, if it is too fine, it may no longer be coarser than the one we want. In more biological terms: We want to reduce the number of false positives (pairs that are incorrectly clustered together) without producing any false negatives (pairs that are clustered separately). 1.2
Organisation of paper
Section 2 discusses the notion of biological distance in more detail, describing the different types of distances, and presenting five distance functions in detail. Section 3 discusses methodology. The main points explored are: (i) the use of simulated data for experimentation; (ii) the method used for evaluating the quality of the clusterings produced, and (iii) the statistical tests used for interpreting the results. Section 4 presents the experimental results, and briefly discusses them. Section 5 discusses related research. Finally, Section 6 concludes and presents future work.
2 The Notion of Biological Distance Many different distance functions have been proposed. Motivations are principled (how they model the underlying biological reality) or pragmatic (efficiency of implementation). An important class of distance functions are metrics, i.e. they have the following properties – positivity: d (x; y) 0 and d (x; y) = 0 () x = y; – symmetry: d (x; y) = d (y; x). – triangle inequality: d (x; y) + d (y; z) d (x; z).
If condition 1. is replaced by d (x; y) 0 only, then the function is called a pseudo-metric. A metric is desirable because its properties allow efficient algorithm design (e.g. to do pre-indexing). However, in EST clustering it is common for non-metrics to be used, and in some implementations the distance function has none of the above properties. Distance functions can also be characterised as alignment-based or alignment-free. In alignmentbased approaches, we try to align the two strings (globally or locally) and then test their level of similarity. Alignment-free distance functions use word frequency counts or information-theoretic methods. See [19] for details. Notation: If x is a string, then jxj is its length. For a set X , jX j denotes its cardinality. For two strings y = y1 y2 : : : yk and x = x1 x2 : : : xn , y is a substring of x, denoted y v x, if there is a position 1 i n such that y = xi xi+1 : : : xi+k 1 . If w is a string with jwj jxj, cx (w) is the number of times w occurs in x. More formally, cx (w) = jfi j w = xi : : : xi+jwj 1 gj. Note that this definition allows overlapping occurrences of w. The goal of EST clustering is to find overlaps between ESTs; even if sequences x and y are very different when considered globally, if they have a nearly common subsequence of a sufficient size, they are considered similar. Often therefore, to compute the distance between x and y, a window size m is fixed, all subsequences (windows) of length m in x and y are considered, and the minimum distance between
def these windows is computed. Formally, given a distance function f we compute fˆ(x; y) = minf f (x0 ; y0 ) : x0 v x; y0 v y; jx0 j = jy0 j = mg. Note that by taking the minimum over all pairs of windows, then even if f is a metric, then fˆ will not be since the triangle inequality will not hold.
2.1
Edit distance
Edit distance was one of the first distance measures to be used both because of its intuitive power and also because it has been used in other approximate string matching applications. The (unit cost) edit distance, also called Levenshtein distance, between two strings is the number of edit operations required to convert one string into another. The edit operations allowed are insertion, deletion, and substitution. Non-unit cost edit distance allows for different penalties for different classes of operations, and even to different operations within that class (e.g. a lower penalty would be given for substituting an A for a G than for a T ). Furthermore, refinements can be achieved by e.g. affine gap functions, which penalise opening a gap (a contiguous stretch of insertions or of deletions) differently from extending one. Alignment using edit distance is naturally implemented as a dynamic programming problem, which makes alignment a quadratic algorithm – too expensive in many applications. Many papers, however, refer to edit distance as the ideal measure to use. To improve on this, heuristic algorithms like BLAST [2] and FASTA [14] are used, which use filtering techniques to skip searching areas where matches are unlikely to happen. By reducing sensitivity, performance can be improved significantly. See [17, chapter 7] for a discussion. 2.2
Common word
This distance function associates two sequences together if they share a common word of a given length. This definition can ( be formalised as: 0 9w; jwj = k; cx (w) > 0; cy (w) > 0; cwordk (x; y) = 1 otherwise. Note that this distance function is symmetric but does not have the positivity or triangle inequality property and so is not a metric. The advantage of the common word distance function is that clustering can be implemented in linear time, at least for reasonable size k. The clusters that are produced are not very good, but provided k is chosen well this is often a good first phase in a clustering algorithm. This can be refined so that cword gives degrees of similarity based upon the number of common words shared. 2.3
The d 2 distance function
The d 2 distance function4 uses word-frequency counts and was originally developed for database search [6]. It has, however, been successfully applied to EST clustering [3]. This function is motivated both on its biological sensitivity and on performance grounds. Define dk2 (x; y) = ∑ (cx (w) cy (w))2 . In the most general form, the d 2 score between two se-
jwj=k
def
quences x and y is defined by d 2 (x; y) = ∑Lk=l dk2 (x; y). However, in practice, experimental evidence has shown that fixing k is satisfactory, and commonly d62 is used (we only look at words of length 6). This is followed in the rest of the paper, and for simplicity we refer to d62 just as d 2 . Note that d 2 is a pseudo-metric since d 2 (x; y) = 0 6) x = y. In fact, there are example sequences of length 100 that have a d 2 score of 0 and a unit cost edit distance of 30. Note that d 2 is the square of the Euclidean distance between the word frequency vectors of the two sequences, hence the name. In EST clustering, dˆ2 is employed, as defined before: minimising over all pairs of windows, usually of length 100. Thus as applied here, the triangle inequality does not hold. 4
usually pronounced d2
Moreover, in some implementations, in order to speed up the algorithm, rather than looking at all pairs of windows, all windows in one sequence are compared to every 50th window in the other. Hence, the resulting distance function will not be symmetric, and the clustering produced will depend on the order in which the sequences are compared. 2.4
q-grams
q-grams, also called q-tuples, are simply words (or strings) of length q. q-grams have been used in approximate string matching, in particular in database search, as a filtering method: First, based on occurrence of common q-grams, regions in the text are identified where an approximate match of the search pattern could occur, and in a second step, this is verified by an in-depth approximate string matching algorithm such as the Smith-Waterman algorithm. Jokinen and Ukkonen ([13]) developed a necessary condition on the minimal number of common q-grams for an approximate match to occur. The QUASAR algorithm ([4]) used a suffix array to implement the q-gram method. When using q-grams as a distance measure, there exist conflicting definitions. We use the following definition which we call Boolean q-grams: bqg(x; y) = ∑jwj=k bx (w) by (w), where bx (w) = 1 if w v x, and 0 otherwise.
3 Experimental Methodology This section describes our experimental methodology, serving both as a record of the experiments we carried out and a template for doing such comparative analysis. Section 3.1 describes and justifies the use of simulated data for EST clustering and our approach to generating such simulated data. This is followed by Section 3.2 which explains and justifies how we measure the quality of a clustering produced by an algorithm. Section 3.3 presents the details of our experiments: the data used; the parameters set for the EST generation with our sequence generation tool SeqGen; the distance measures compared by our clustering comparison tool ECLEST; and the method employed for evaluating the clusterings. Finally, in Section 3.4, we discuss the statistical tests used to evaluate the results. 3.1
Using simulated data for benchmarking
We propose the use of synthetic data for evaluation. Other approaches are possible: 1. The output of a clustering algorithm is examined by experts, who carefully analyse the output, possibly changing it based on their expert knowledge, before giving it their imprimatur. These sort of data sets are ideal to use for benchmarking. Unfortunately, good data appears to be rare. 2. Comparison to existing algorithms Although there are very pragmatic reasons for adopting this view, we have argued that this inherently skews the results unless there is compelling biological knowledge that justifies this result. An additional factor that should be borne in mind is that, considering ESTs are relatively error-prone, there is no guarantee that one measure is superior in all conditions. We propose that artificial but realistic data sets should be used as complementary benchmarks and have designed a tool called SeqGen to produce these benchmarks. The objective of SeqGen is to produce large amounts of artificial – but realistic – test data for testing the effectiveness of different distance measures used in clustering DNA (or related sequences). SeqGen seeks to create artificial ESTs from any number of given cDNA sequences using a number of criteria. The artificial creation of ESTs in this way will lead to the creation of an EST set whose exact final clustering is known. So, when testing a new algorithm or measure, we can compare the output of the new algorithm with the known right answer. In addition, the use of artificial test data enables us to produce data with a range of different error models. Thus, if some measures are better than others in different circumstances we will be able to provide some insight. This would be difficult to test with real data.
EST sequences are submitted in bulk and are single, unverified runs, quality is on average low. We simulate the production of ESTs from cDNAs or any genomic-like data with a variety of models that simulate what happens in the real world. We do not aim to produce one model for ESTs since the biological processes used by different labs at different times manifest different types of errors and error rates. Rather, we produce a tool that can generate different types of data with user-specified error models. 1. Given a long original sequence (e.g. mRNA or cDNA), the original sequence is split into fragments. Splitting can either be done at random (with a user-specified parameters) or at ‘restriction’ sites. 2. Each fragment is copied a user-specified number of times. 3. Each copy of each fragment is then mutated with user-specified fault models. Errors modelled Real ESTs have various sources of errors, including: contaminants (vector, rRNA, mitRNA, possibly other species, genomic sequence); repeat sequences (simple repeats, complex repeats); base pair errors; frameshift errors; chimeras resulting fr om artificial ligation of unrelated ESTs; and stutters. SeqGen has been built on an understanding of the types of errors that are produced in the laboratory. The methodology used was to try to understand the biological processes and with the aid of a biologist draw the types of error curves. We then found convenient mathematical functions that simulated these curves and that could be easily programmed. The errors modelled are: – Single base errors Single bases may be read incorrectly. The physical phenomenon we are modelling is random noise. Polymerase decay. As the EST is read, the polymerase decays increasing the rate of errors. There is gentle decay for the bulk of the EST followed by a very rapid decay at the end. Primer interference. Interference from the primer makes the beginning parts of reads particularly unstable. If an error does occur, there are four possible events. A base can be arbitrarily changed, deleted or inserted, or an N can be inserted. The probabilities of these events are parameterisable. – Stuttering Stuttering is caused by a problem in the reading of the EST: the transcription slips and a portion of the mRNA is re-read. Stuttering can happen anywhere, but is most likely to happen after repeated Gs or Ts. We only model stuttering after repeated Ts and Gs since other stuttering events are very rare. – Ligation Ligation happens when two ESTs bond together, giving the appearance of a new EST. The two ESTs that join together need not come from adjacent parts of an mRNA; indeed, they could come from different mRNAs. In the creation of the faults, ligations happen first (since the physical faults being modelled here occur to the real ESTs – the other faults discussed are artifacts of the sequencing process). SeqGen provides the user with a number of parameters so that a range of different test cases can be produced. SeqGen does not model contamination or repeats since these are typically masked out. SeqGen has been built in a modular way so that it is relatively easy to build in new fault models. However, there are about 10 parameters for the fault models at the moment plus other program parameters, and we believe that this design space gives the user the ability to build realistic data sets. Full details can be found in [10]. The approach of SeqGen is similar to that of GenFrag [7, 8], though it is tailored to EST data and supports more sophisticated error models. 3.2
Clustering evaluation
The advantage of using simulated data is that the quality of a clustering algorithm for particular data can be computed directly by comparing how well it performs to the known correct answer. We considered
two indexes of clustering quality that are commonly used: a graph theoretic approach that uses a maximal perfect matching (Matching Index), and the so-called Rand Index. For an overview of different methods of clustering evaluation, see [11], for clustering techniques in general [12]. Most of the clustering techniques discussed there refer to data that have some geometric properties, and are thus not applicable here; as discussed earlier, we use transitive closure for clustering, i.e. x and y are clustered together iff there is a finite chain x = x1 ; x2 ; : : : ; xk = y s.t. xi and xi+1 are close, i.e. d (xi ; xi+1 ) < θ, for all i = 1; : : : ; k 1. Let S be the ground set of size jSj = n, C1 and C2 the two clusterings under consideration, and denote the cluster to which x 2 S belongs under clustering Ci by Ci (x). For the Matching Index, a complete bipartite graph is generated with vertex set A1 [ A2 where Ai = Ci ; i = 1; 2. If jC1 j 6= jC2 j, then the one with fewer clusters is first increased by adding a sufficient number of empty clusters. The edges are assigned a weight according to how alike the corresponding clusters are, the simplest weight function being the size of their intersection. Then, a maximal perfect matching is computed and its weight normalised by dividing it by n. The Rand Index, on the other hand, returns a normalised count of the number of pairs of elements that def were “treated alike” by both clusterings, more formally: Let a = jffx; yg : x 6= y; Ci (x) = Ci (y); i = 1; 2gj, the number of pairs that are clustered together in both clusterings, and d = jffx; yg : Ci (x) 6= Ci (y); i = 1; 2gj, the number of pairs that are separated in both clusterings 5 . The Rand Index of C1 ; C2 is then def
RI (C1 ; C2 )
def
=
a+d
n 2
:
The Rand Index can also be normalised by the expected difference to a randomly chosen clustering with the given number of clusters. However, this value is so small that the computational effort does not seem worth the added accuracy, so we chose to use the non-normalised version of the Rand Index. A closer look at the two functions showed that the Rand Index is more appropriate for our problem. The reason is that the Matching Index penalises clustered elements independently of the size of their clusters, and is less sensitive to different types of incorrect clustering. To illustrate, consider the two examples in Figures 1 and 2:
Fig. 1. Element x is included in the wrong cluster in Case 1, while it is put into a singleton cluster in Case 2
Ideal clustering A
B
x
Clustering computed
Ideal clustering
A
A
B
B
Case 1
Clustering computed A
x
B
x Case 2
x
Figure 1: The Matching Index penalises both cases by 1=n, while the Rand Index decreases by (jAj + n n jBj)= 2 in Case 1, but by only jAj= 2 in Case 2, thus differentiating between the two cases. In our application though, Case 2 is clearly a less severe error—in particular, it can much more easily be detected by manual post-processing. Note also that the Rand Index penalty increases by values proportional to jAj + jBj and jAj, respectively, while the Matching Index penalty is independent of the cluster sizes. 5
If one of the two clusterings is considered as the “correct” one, as is the case in our application, then we can also say that n (a + d ) is the number of false positives and false negatives. 2
Figure 2: Here, one cluster of size m is split up in two different ways, once into two subclusters of equal size, and in the second case, into one big subcluster of size m=2 and into m=2 singletons. Again, the Matching Index does not differentiate between the two cases, penalising both by ( m2 )=n. The Rand 2 2 Index, on the other hand, is reduced by ( m4 )= n2 in Case 1, and by ( 3m 4 2m )= n2 , i.e. by nearly 3 times as much, in Case 2. In this case, for our application, Case 1 is the less severe error.
Fig. 2. A big cluster is split up into two equal parts (Case 1), and into one big and many little subclusters (Case 2)
Ideal clustering
Clustering computed
Ideal clustering
Clustering computed m=2
m =2 m
m m=2 singletons
m=2 Case 1
3.3
Case 2
Details of our experiments
Data used in our experiments For the experiments reported here, we used SeqGen to generate ESTs from a collection of human ESTs – We took the cDNA sequences from a mammalian gene collection at http://mg .n i.nih.gov/. – Non-human ESTs were removed by using BLAST and human genome information. We normalised the input data so there was one cDNA sequence per gene in a collection, by performing complete pair-wise comparison. If the similarity was greater than e-85, one of these two sequences was deleted from the file, because the probability that they were from the same gene was quite high. The parameters of SeqGen were then set to produce a set of good quality ESTs. – ESTs of length between 300 and 500 are produced, where on average, every base appears on average in 5 ESTs; no reverse reads were produced. – The probability graph of a single base error, taking into account random noise, polymerase decay and primer interference is shown in Figure 3. – We allowed a modest amount if stuttering. – No ligation was used in our initial tests. Note that when ligation happens it may be impossible to cluster accurately because ligated cDNAs will could cause two separate clusters to be merged. – We split up the original set of cDNAs into sets of between 5 and 10 sequences per test (depending on the length of the sequences). Each of these sets was used for a test. – We ran 10 preliminary tests to estimate the distributions of the five distance measures used, and 124 tests for testing our hypotheses. A more formal discussion of the setting of the parameters of SeqGen is out of place here. For the record, the following settings were used: samplerandom 300 500 5 0; α = 0:005, ν = 20, ξ = 2, ζ = 1, β = 30, γ = 1=24, ν = 20, η = 20. See [10] for details. Distance measures compared We implemented five distance measures, with the following parameters (θ is the threshold): 1. symmetric d 2 : all windows of length 100 in the two sequences are compared pairwise, k = 6; θ = 50;
0.1 Random Polymer decay Primer
Probability of error
0.08
0.06
0.04
0.02
0 0
50
100 150 200 Base position (assuming sequence 300 long)
250
300
Fig. 3. Error probability functions for different single base errors. For clarity, the y-axis is cropped at y = 0:1.
2. asymmetric d 2 : all windows of size 100 in one sequence are compared to each 50th window of length 100 in the other sequence, k = 6; θ = 50; 3. edit distance: penalty function p(match) = 1, p(mismatch) = +3, p(gap opening) = +5, p(gap extension) = +2; and p(side gap) = 0 (i.e. gaps at the beginning and end of the sequence), θ = 20; 4. common word: word length 19, θ = 1; 5. Boolean q-grams: all windows of length 100 are compared pairwise, q = 6; θ = 13. The thresholds were found empirically: We did a few test runs in order to see what threshold yields a clustering with the best scores. ECLEST The ECLEST tool takes as input a set of DNA-strings in FASTA-format and a text-file specifying the ideal clustering, and computes and evaluates a clustering – using a specified distance measure, – using a specified clustering algorithm, and – using a specified clustering evaluation method. It has been designed in such a way that in all three points, new algorithms can be implemented. We have at the moment implemented the five distance measures detailed above, transitive closure as only the clustering algorithm, and the Rand Index as the only clustering evaluation method. A preliminary version of the ECLEST manual can be found at www. ebite .uni-bielefeld.de/~zsuzsa/ EST lust.html. 3.4
Statistical evaluation
We ran 10 preliminary test sets to estimate the distributions of the clustering scores computed with the different distance measures, and 124 test sets to test our hypotheses. From the preliminary test sets it could be seen that the convenient assumption of normal distribution was untenable, thus we were restricted in the kind of statistical tests we could use. In all cases, we used either the Friedman ranking test or the binomial test. The Friedman test utilises only a ranking of the scores rather than their values.
Another limitation is that the Friedman test is designed for continuous distributions, which assumption slightly skews the results when several identical scores appear in one test. We therefore ran the test also by only using those test sets where the results were different for all distance measures in question; however, this reduced dramatically the number of test sets that could be evaluated. All our hypotheses have one of the following forms, where Di are distance measures and 0 λ 100: – – – –
Di has the same distribution as D j . Di performs no worse than D j in λ percent of the cases. Di is better than D j in at least λ percent of the cases. Di followed by D j performs as well as Dk (in λ percent of the cases).
Because of the missing normal distribution assumption, we were unable to make quantitative statements about how much better one distance measure performs than another. In one case, it can be assumed that the difference between two distance measures (namely, symmetric d 2 and asymmetric d 2 ) has a normal distribution, where we will thus be able to make a quantitative statement, as well; however, we will need to run additional tests here, so this result is not yet included in the present paper. All our statements (acceptance or rejection of hypotheses) have been made with a 95% confidence. We used the statistics program R, version 1.6.1 [9], to evaluate our data.
4 Results 4.1
Statistical results
The clustering scores do not follow a normal distribution. Since nothing can be said about the underlying distribution, we can only make comparative statements. The first—and unsurprising—result is that the five string distance measures do not follow the same distribution. This could be shown both with data that included ties, and with the much smaller data set that was without ties. Next we compared those distance measures where it seemed reasonable to assume that either they differed significantly or that they were similar. Since the distribution of the common word distance was very different from the others, we excluded it from these individual comparisons. Here, we were able to show6 that – – – –
symmetric d 2 performs as well as edit distance only with probability < 0:5; asymmetric d 2 performs as well as symmetric d 2 with probability > 0:5; symmetric d 2 performs better than Boolean q-grams with probability > 0:95; asymmetric d 2 performs better than Boolean q-grams with probability > 0:95.
Finally, we tested whether certain distance measures can be used as preprocessing for others in order to achieve a good result. – We could show that common word followed by symmetric d 2 performs as well as symmetric d 2 alone, with probability > 0:95; – On the other hand, we could not prove that common word followed by edit distance performs better than symmetric d 2 alone with probability > 0:5. 4.2
Quantitative comparison
Although we cannot draw statistically valid conclusions about how much one method is better than another, Table 1 shows the results we obtained. The first data column of the table shows the average performance of that measure across all experiments. The remaining columns show how the two methods compared: if the entry in row i, column j is a/b/c, it means that method i beat method j a times, b times they were the same, and c times method j won out. 6
always with 95% confidence
D2 S D2 AS ED CW BQG
4.3
Average D2 S D2 AS ED CW BQG 0.9768 39/95/0 2/56/76 64/21/49 134/0/0 0.9718 0/95/39 1/35/98 64/13/57 134/0/0 0.9861 76/56/2 98/35/1 66/49/19 134/0/0 0.8674 49/21/64 57/13/64 19/49/66 103/0/31 0.8262 0/0/134 0/0/134 0/0/134 31/0/103 Table 1. Quantitative comparison of measures
Discussion
These results show that the choice of distance is important. The performance indexes of d 2 and edit distance are better than the others. However, we can see that except for boolean q-grams, no method is better than the other methods in all circumstances. On the other hand, the better methods are very close together in performance. The finding that the use of the common word distance in conjunction with symmetric d 2 is as good as just using symmetric d 2 is interesting and important because it justifies certain algorithm design decisions that have been used in practice. There was a belief that for the d 2 score to be below a certain threshold that the sequences had to share a common 19-word. However, recently counter examples were found, where two sequences with a 0 d 2 score only shared 11-words. Our results here support the view that such examples are anomalies and will not affect the overall quality of clustering.
5 Related Work There are many papers that have compared distance measures. However, most take one approach as the standard and then see how another compares. A good example of this is [16] where BLAST and Smith-Waterman are compared. Here it is shown that Smith-Waterman is more sensitive but that there are some matches that BLAST makes that Smith-Waterman doesn’t. However, although it is implicit in their discussion that Smith-Waterman is better, there is no way in their comparative framework of determining which of the matches found are correct. A more rigorous example is [3] which compares d 2 -cluster results to Unigene. First, they show that 2 d is more sensitive and argue that it is correctly more sensitive by: – looking at the differences and then using biological expertise to make a judgement of correctness (this we think is a very good technique to use; unfortunately it is very expensive). – deriving upper bounds on the probability of wrongly clustering sequences. In addition, they do a comparison with Smith-Waterman, explicitly making the assumption that it is correct. Work has also been done on clustering of data from microarray expression data. The focus of [18] is a comparison of different clustering algorithms, though an obiter remark is made that the choice of distance function used has a ‘profound effect’. The different clustering algorithms are shown to produce different clustering though ‘without a biological basis for interpreting these results, there is no way to decide which grouping is right and which is wrong.’ In their comparison of different clustering algorithms, Datta and Datta evaluate the algorithms based upon their internal consistency [5]. This gives some objective way of evaluating the algorithms, though the link to the ‘real’ answer is indirect. Our methodology is most similar to that used in phylogenetic studies, where a phylogenetic tree is synthetically generated according to some evolutionary model, and then phylogenetic algorithms are evaluated according to how well they can reconstruct the known tree from the leaf data [15].
6 Conclusion This paper has made two contributions:
– The development of a methodology for comparing distance measures for the EST clustering. – A preliminary study using common distance measures and typical values for EST error models. Our evaluation methodology comprises of: 1. A system called SeqGen that generates realistic simulated data for clustering. The advantage of this is that we know the correct answers that the clustering algorithm should provide, and we can also test using different error models. 2. A system called ECLEST that tests different measures on the given data, and computes a measure of clustering quality. 3. A statistical test for evaluating the statistical significance of the results. The particular experiments we ran illustrated the value of the approach. It shows that certain measures are better than others, but also that some measures compute very similar results. This gives algorithm designers greater choices in designing algorithms, and also gives rigorous ways of justifying the choice of certain heuristics to speed up the clustering process. Extending the work We would like to extend our work in various ways: – Evaluating other distance measures. BLAST [2] and Chaos Game Representation [1] are ones that we particularly wish to explore. – Using more data. – Using different error models to see how the relative performance changes, particularly in very errorprone situations. – Exploring how the choice of parameters of the particular distances affect clustering. We may find that the choice of parameters is much more important than the choice of distance. This will have important results both for biological correctness and algorithm design. – Exploring whether other statistical techniques can be used to give better quantitative comparisons. – Making our tools publicly available. Acknowledgements: Professor Winston Hide from the South African National Bioinformatics Institute (SANBI) gave significant encouragement and help at the various stages of this research, which was started while two of the authors were visiting SANBI. Dr Anton Bergheim was central to the development of SeqGen and helped us with the choice of biological parameters. We thank Andrew D. Barbour and the Statistics Department of ETH Zurich for advice on the statistical test arrangement, Thomas Erlebach at the Computer Engineering and Networks Laboratory (TIK) and Kai Nagel, Institute for Scientific Computing, both ETH Zurich, for computing time. The program ECLEST and the tests presented in this paper constitute part of Judith Zimmermann’s thesis for a Diploma (Masters) in Computer Science.
References 1. J.S. Almeida, J.A. Carrico, A. Maretzek, P.A. Noble, and M. Fletcher. Analysis of genomic sequences by Chaos Game Representation. Bioinformatics, 17(5):429–437, 2001. 2. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403–410, 1990. 3. J. Burke, D. Davison, and W. Hide. D2 cluster: A Validated Method for Clustering EST and Full-length cDNA Sequences. Genome Research, 9(11):1135–1142, 1999. 4. S. Burkhardt, A. Crauser, P. Ferragina, H.-P. Lenhof, E. Rivals, and M. Vingron. q-gram based database searching using a suffix array. In S. Istrail, P. Pevzner, and M. Waterman, editors, Proceedings of the 3rd Annual International Conference on Computational Molecular Biology (RECOMB), pages 77–83, Lyon, France, 1999. ACM Press. 5. S. Datta and S. Datta. Comparisons and validation of statistical clustering techniques for microarray data. Bioinformatics, 19(4):459–466, 2003. 6. Daniel Davison David C. Torney, Christina Burks and Karl M. Sirotkin. Computation of d 2 : A Measure of Sequence Dissimilarity, pages 109–125. Addison-Wesley, 1990.
7. M.L. Engle and C. Burks. Artificially generated data sets for testing DNA fragement assembly algorithms. Genomics, 1(1):286–288, 1993. 8. M.L. Engle and C. Burks. GenFrag 2.1: new features for more robust fragment assembly benchmarks. Computer Applications in the Biosciences, 10(5):567–568, 1994. 9. The R Foundation for Statistical Computing. available at http://www.r-proje t.org/. 10. S. Hazelhurst and A. Bergheim. SeqGen: A tool for creating benchmarks for EST clustering algorithms. Technical Report TR-Wits-CS-2003-1, School of Computer Science, University of the Witwatersrand, April 2003. ftp://ftp. s.wits. a .za/pub/resear h/reports/TR-Wits-CS-2003-1.pdf. 11. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2:193–218, 1985. 12. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Comp. Surveys, 31(3):264–323, September 1999. 13. P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In Andrzej Tarlecki, editor, Proceedings of Mathematical Foundations of Computer Science. (MFCS ’91), volume 520 of LNCS, pages 240–248, Berlin, Germany, September 1991. Springer. 14. D.J. Lipman and W.R. Pearson. Rapid and sensitive protein similarity searches. Science, pages 1435–1441, 1985. 15. B.M.E Moret, L.-S. Wang, and T. Warnow. Towards new software for computational phylogenetics. IEEE Computer, 35(7):55–64, 2002. 16. H. Nash, D. Blair, and J. Greffenstette. Comparing algorithms for large-scale sequence analysis. In ”Proceedings of the Second IEEE International Symposium on Bioinformatics and Bioengineering”, pages 89–96. IEEE Computer Society Press, March 2001. 17. P.A. Pevzner. Computational Molecular Biology. MIT Press, Cambridge, Massachusetts, 2000. 18. J. Quackenbush. Computational analysis of microwarray data. Nature Reviews Genetics, 2:418–427, June 2001. 19. S. Vinga and J. Almeida. Alignment-free sequence comparison – a review. Bioinformatics, 19(3):513–524, 2003.