MEDITE: A Unilingual Textual Aligner Julien Bourdaillet and Jean-Gabriel Ganascia Universit´e Pierre et Marie Curie - Laboratoire d’Informatique de Paris 6 8 rue du Capitaine Scott - 75015 Paris - France
[email protected] -
[email protected]
Abstract. This paper addresses a problem of natural language text alignment, from a humanities discipline called textual genetic criticism where different text versions must be compared. The paper shows that this task is hard because such versions can be very different and texts with a lot of internal repetitions present specific difficulties. MEDITE is a natural language text aligner that compares texts written in the same language. It detects modifications at character level, as opposed to related applications which either remain at word level or give poor results at character level. The detection of moved blocks in the text, induced by our formalism based on edit distance with moves, is introduced. The algorithm is closely related to sequence alignment in bioinformatics as similar building blocks are used and applied to this natural language processing task. A benchmark analysis has been carried out to compare MEDITE with other aligners and it shows that our approach is superior to existing ones especially in hard cases.
1
Introduction
MEDITE has been designed as an application to assist philologists in their practice of textual genetic criticism [1, 2]. It is part of the humanities and was developped thirty years ago as an important original French school of literary study [3–5]. This discipline introduced a temporal dimension in literary criticism by studying not only the final version of a literary work but also writers’ drafts in order to highlight the genesis of the text. It seeks to understand how a text is produced but remains close to the aesthetics of the work. Philologists suggest interpretative hypotheses when they read the final version of a text, which they corroborate (or invalidate) through the study of previous versions. This study is based on text version comparison and considers every modification between two versions. These modifications need to be character based because a writer can proceed by one- or two-character long modifications, which can seriously alter the sentence meaning, especially for a morphologically rich language such as French. Techniques arising from genetic criticism have been applied to epistemology as in the following example. Claude Bernard was a nineteenth century physiologist who contributed to the birth of modern medicine. In order to study the evolution of his medical theories, philologists want to compare his experiment
notebooks and their synthesis written some years later. The notebooks relate observations in a telegraphic style while observations are written in an academic style in the synthesis (in which new ideas are also inserted). An example of comparison of these two texts using Microsoft Word is given in Figure 1 and using MEDITE in Figure 2. It can be seen that MEDITE identifies considerably more invariants (in black and white) between the two texts than Word, resulting in a better alignment (as presented in Section 4). Furthermore, the visualization interface impacts on the readability of the alignment. (This example uses French texts but MEDITE works for West-European languages.) Comparison and visualization problems are common in existing file comparison tools. These tools are generally descendants of diff [6] in which two files are compared line by line and a list of inserted and deleted lines is produced. This kind of program comes from the community that created Unix where their main interest was source code comparison. For this task, line by line comparison is sufficient because program structure is very constrained and the syntax is strong. This results in well-organized texts (i.e. source code) and the assertion “one line, one instruction” is generally verified. Most of the modifications occurring between two versions are line modifications. The limits of these comparers appear with texts such as those of Claude Bernard because intra-line modifications are not well identified. For example, the modification of one character in a line will lead to a “deletion of one line, insertion of one line” analysis. This is acceptable for source code but is a bad result for natural language. The precision of detections is of crucial importance for genetic criticism and this is not addressed correctly by existing aligners. Furthermore, Claude Bernard’s texts contain a lot of repeated text blocks. In the left text of Figure 2, the word mouvements is repeated three times and it is repeated more times in the whole text. With simple alignment algorithms, several repetitions may not be found, resulting in missing invariant or moved blocks in the final alignment, as in Figure 1. This is due to the fact that invariants (and moves) between the two texts are blocks repeated at least twice and if some repetitions are missed then invariants will be missed. Similar problems existed in previous versions of MEDITE: when processing Claude Bernard’s texts, our results were similar to those of Word. We present here a new algorithm that addresses these problems. This task can be defined as unilingual textual alignment that compares two related texts, written in the same language, and identifies invariants and differences between them. More precisely, the term alignment refers to the identification of these invariants and their pairing. Once identified, differences can be deduced, but there is no one way of doing this. We also address the move detection task, but since moves can be seen as a deletion plus an insertion, this task involves ambiguity. Using our formalism, based on edit distance with moves, it is possible to handle moves and this is presented in Section 2. Machine translation is based on alignment, but it is bilingual and sentence or word-based. Most of the methods rely on machine learning where a statistical model is trained from a bilingual reference corpus [7]. In our case, there exists
no unilingual aligned corpus, so supervised learning is not possible. Moreover, our aim is to detect modifications between texts whereas in machine translation each word or expression in the first text must match a similar unit in the second. There are no deletions or insertions whereas they are central in our problem. In bioinformatics, sequences of nucleic acids (DNA) or amino acids (proteins) are aligned. This is unilingual alignment because sequences are expressed in the same alphabet and the grain of the alignment is character-based. Two alignment types exist in bioinformatics, local and global, descending from [8] and [9] respectively. Local aligners try to find regions in sequences that match exactly or with a maximum similarity. Regions of low similarity can be left unaligned because not all regions are of equal importance. On a DNA strand, coding regions (exons) will code for proteins and non-coding regions (introns) will be eliminated during the transformation process to RNA. What global aligners try to do is to match two sequences completely. One character from one of the two sequences either matches one character of the other sequence or matches a blank character meaning it is inserted or deleted. We are not interested in finding regions of high similarity between two texts without considering low similarity regions because each character of our texts must be aligned. Our algorithm is related to bioinformatics global aligners and will be presented in Section 3 but, whereas bioinformatics aligners can hide repetitions in sequences before alignment using tools such as RepeatMasker [10], we must address this problem. MEDITE was evaluated using a benchmark with other file comparison tools, as presented in Section 4. General conclusisons are presented in Section 5.
Fig. 1. The alignment of Claude Bernard’s texts using Microsoft Word
Fig. 2. The alignment of Claude Bernard’s texts using MEDITE
2
Formalism
This alignment problem can be formalized as the computation of edit distance with moves [11, 12] detailed below. We have two sequences Latin al2 over the common West-European S s1 and sS S phabet Σ = {a, ..., z} {A, ..., Z} {accentuated characters} {separators}. Four operators are given: character insertion, character deletion, character substitution and block moves. A block is a 3-tuple (p, q, l) where p is the position in s1 , q the position in s2 and l the length of the block. The goal is to find a sequence of operations of minimum cost which transforms s1 in s2 . Characters not involved in an edit operation are called invariant characters, present in both s1 and s2 . The decomposition of s1 and s2 into a list of inserted, deleted, substituted, moved and invariant blocks forms an alignment. This problem is NP-complete [13, 12] and, following the formalism described in [14], it can be reduced to the block permutation problem. The block permutation problem considers two sequences sh1 and sh2 over Σ such that |sh1 | = |sh2 | and a predicate P that defines a bijection between every character of two strings: ′
′
′
P (S, S ) = (∀i, 1 ≤ i ≤ |S|, ∃!j, 1 ≤ j ≤ |S |, S[i] = S [j]) ∧ ′
′
(∀i, 1 ≤ i ≤ |S |, ∃!j, 1 ≤ j ≤ |S|, S [i] = S[j])
(1)
P (sh1 , sh2 ) defines a minimal trivial partition of the two strings in invariant or moved blocks, the 1-character long block partition. The goal is to find a maximal
del
mov
curare was
del
injected in the leg
mov curare was
del (a)
mov
del
mov
injected in the leg mov
(b)
Fig. 3. (a): Decorator transformation, (b): Block split
partition under a certain measure M that is a partition maximizing invariant block size and minimizing moved block size. This enables us to define X X M= |bi | − |bm | (2) bi ∈{invariants}
bm ∈{moves}
A function part extracts a partition from sh1 and sh2 such that: part (sh1 , sh2 ) → {invariants} , {moves} with ∀x ∈ (sh1 ∪ sh2 ), x ∈ {invariants} ∨ x ∈ {moves}
(3)
Predicate P declares that every character in sh1 must be present in sh2 and vice versa. The concatenation of homologies between s1 and s2 verify the predicate P because homologies are present in both sequences. Hence a function hom(s1 , s2 ) → sh1 , sh2 transforms the problem of the computation of edit distance with moves into the block permutation problem. The size of the set of all homologies between two related sequences s1 and s2 is exponential. In order to reduce this size, we consider only maximal exact matches (MEMs) that are matches which cannot be extended to the left or to the right without losing the homology property. Non maximal matches are included in MEMs and are of no interest. Furthermore, it is consistent, but not sufficient, with the necessity for homologies extracted by hom to be disjoint, because a block cannot be invariant and moved at the same time. All moves can be seen as a deletion plus an insertion. When a small moved block is situated between two deleted (or inserted) blocks it may be better to consider it as part of the deleted (or inserted) blocks and to merge them. Then these moved sub-blocks can be seen as decorators of the block rather than already presented moved blocks seen as operators, as shown in Figure 3(a).
In order to capture the two different types of move, we introduce typed decorated blocks as a tuple such that: blocktd : (type, begin, end, decorators) with type ∈ {ins, del, sub, mov, inv}, (1 ≤ begin < end ≤ |s1 |) ∨ (1 ≤ begin < end ≤ |s2 |), decorators = {(bd , ed ), begin ≤ bd < ed ≤ end}
(4)
As decoration makes no sense for blocks of type mov and inv, the following restriction is applied to them: decorators = ∅. An empty block named N one is introduced for convenience. This formalism makes it possible to capture homologies considered as blocks of type mov as well as homologies considered as moves inside a block of another type (except inv). Furthermore an alignment A becomes an increasing list of pairs of typed decorated blocks: A ⇔ [(Bs1 , Bs2 ) where Bs1 and Bs2 are blocktd , 1 ≤ P red(Bs1 )[end] < Bs1 [begin] < Bs1 [end] < Succ(Bs1 )[begin] ≤ |s1 |, 1 ≤ P red(Bs2 )[end] < Bs2 [begin] < Bs2 [end] < Succ(Bs2 )[begin] ≤ |s2 |] (5) where P red and Succ are the predecessor and successor functions respectively. Blocks of type sub and inv are aligned pairwise whereas other blocks are aligned with N one. We name this structure a bi-block list. Hence an alignment algorithm has to build it.
3
Algorithm
Our algorithm is an instantiation of the formalism described in Section 2. It processes text in two phases. The first phase resolves the block permutation problem and the second processes the remaining text and builds the bi-block list. 3.1
Block Permutation Problem
In bioinformatics most of the recent global aligners [15–17] proceed in three steps: 1. searching for anchors in sequences 2. aligning them in order to determine invariants and moves 3. processing recursively between invariants We proceed in the same way, as shown below, and these three steps correspond to the hom function in Section 2. Anchors are homologies under a certain similarity criterion. Our criterion is exact homology, where the two substrings must match exactly. Firstly, a generalized suffix tree [18] is built over the two sequences in order to extract all the homologies. A minimum size parameter is chosen by the user (by default, five characters long). This method generates overlaps between homologies. However, it is necessary to resolve them in order to obtain a proper partition of the complete sequences in disjoint blocks. We use a heuristic based on a property of natural language: if the overlap contains separators, it is better to cut it on one of them, since an inter-word cut is preferable to an intra-word cut. Most of the time this condition is verified, but if not the block is cut arbitrarily.
In the second step, blocks must be aligned to determine which are invariant and which are moved. The space of all possible alignments is browsed by an A∗ procedure using an alignment cost function which is a heuristic based on the measure M of Section 2. During the search, an alignment cost is decomposed into the cost of already aligned blocks plus an estimation of the cost of the remaining blocks to be aligned such that cost = costab + costrb where costab is the sum of the size of moved blocks in already aligned blocks and costrb is the sum of the size of the blocks in the symmetric difference between remaining blocks to be aligned in the two sequences. Because it will not be possible to align these remaining blocks in the rest of the alignment process, they will be considered as moved in the final alignment and counted as a penalty due to M . This corresponds to the part function in Section 2. Finally, these two steps are repeated recursively. The difference comes from input sequences. We loop over the alignment resulting from step two and consider the subsequences between each pair of aligned invariant blocks. Then these subsequences are used as input of the first step. The output of the recursive steps one and two is an alignment for the two subsequences. Invariants and moves identified with this alignment are included in the main invariant and moved blocks. This recursive step enables to find alignments which would otherwise have not been found. 3.2
Bi-block List Building
The first phase produces two sets of invariant and moved blocks. Text between two invariant blocks in the first sequence is a deleted block and text between two invariant blocks in the second sequence is an inserted block. By definition moved blocks overlap deleted and inserted blocks, hence all moved blocks (identified during the first phase) overlap a deleted or an inserted block and are considered as their decorators by including them in the decorator set of each deleted or inserted block. Then two heuristics are used to determine substituted blocks and moved blocks considered as operators. If in the bi-block list two bi-blocks of type (del, N one) and (N one, ins) follow each other then we examine the size of blocks del and ins. If the ratio between their size reaches a certain threshold, they are considered as two substituted blocks, and are replaced in the bi-block list by one pair of type (sub, sub) with the same features. By default the ratio is set to 0.5, and the user is free to modify it. For instance, if a bi-block ((del,′ He saw me′ ), N one) is immediately followed in the list by (N one, (ins,′ I saw him′ )) and their size ratio exceeds the threshold then they are replaced by ((sub,′ He saw me′ ), (sub,′ I saw him′ )) meaning that ′ He saw me′ has been replaced by ′ I saw him′ in the text. In a similar way, for each block of type del and ins, we examine the ratio between the size of the block and the sum of the size of its decorators. If it is above another threshold we split the block into several blocks of type mov or the original type such that the intervals covered by these new blocks are the same
as the original block. An example is presented in Figure 3(b). This ratio is also set to 0.5 by default and the user can modify it. Finally a bi-block list that defines an alignment between the two sequences results from this phase.
4
Experiments and Evaluation
In bioinformatics, the evaluation of sequence alignment, i.e. finding an objective criterion to tell whether an alignment is good or not, is a difficult task and remains an open problem [18, 19]. The first classic measure to evaluate an alignment A of two sequences sa1 and sa2 is the character-weight measure |A| M2 (sa1 , sa2 ) = Σk=1 S(sa1 [k], sa2 [k]) where S is a scoring matrix between two characters of the alphabet such as Dayhoff or BLOSUM matrices. These matrices encode the probability of the substitution of one character for another. In natural language such matrices do not make sense. The second classic measure is an operator weight measure where a weight is assignedP to each kind of edit opP eration such as M3 (sa1 , sa2 ) = bi ∈{invariants} Wi |bi | − bd ∈{deletions} Wd |bd | − P P P bm ∈{moves} Wm |bm |. bs ∈{substitutions} Ws |bs |− bins ∈{insertions} Wins |bins |− Our measure M is similar to M3 but it cannot be used for evaluation for three reasons. In our algorithm, M drives the alignment process, so using it to evaluate itself is of no interest. Secondly, this measure gives a blind evaluation of the alignment as it is character-based and counts each block of each type but this does not evaluate the relevance of the resulting alignment. Thirdly, there exists no processable representation of the alignments for the applications we tested in section 4.1 (except for ours), because this information is encoded only in the visualization interface. Furthermore, there exists no annotated unilingual corpus (a corpus where texts would have been aligned correctly by human annotators) which could be considered as a reference corpus or a gold standard. Hence evaluation can not be based on measures such as precision, recall, Bleu, Blanc or Rouge [20]. These facts led us to evaluate our system in different ways. 4.1
Benchmark
MEDITE has been compared with ten aligners, the most famous being the one present in Microsoft Word. For each application, four file comparisons were made, where three points were tested (identified with capitalized letters below). The first comparison is between two versions of a Python language source file. In the second version, large pieces of text were inserted at the beginning and the end of the file, and a lot of lines were modified in the body of the file. These modifications occur mainly line by line, though some occur within lines. This comparison was expected to be easy and serves as a baseline. To pass the first test, inserted and deleted paragraphs must be found (A); for the second test, line by line alignment must be correct (B) and for the third test intra-line modifications must be found (C).
The second comparison is between two versions of a short story by Pascal Quignard entitled “Bernon l’Enfant”. Small modifications of some characters were introduced throughout the text. Lexical words were changed, misspellings corrected and words moved. The goal is to find such modifications. Paragraphs must be aligned (D); word modifications must be found (E) and character modifications must be found (F). The third comparison is between a news agency dispatch and an article which is rather different but derived from it. Two paragraphs were kept with some internal modifications, and the remaining text was replaced completely by another one. The two paragraphs must be aligned (G); modifications inside these paragraphs must be found (H) and similar lexical words must be found (I). The fourth comparison is the one described in the introduction. Texts from Claude Bernard’s experiment notebooks and their synthesis must be aligned. This task is very hard because the existing content remained the same but the form changed and new content was inserted. Paragraphs must be aligned (J); word groups must be aligned (K) and isolated words must be aligned (L). Table 1. Benchmark Results
A B C D E F G H I J K L Total MEDITE DiffDoc Word Compare It Araxis Merge Beyond Compare Visual Comparer Compare Suite WinMerge Active File Compare Perforce P4diff
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
1 0 1 1 1 1 0 0 0 0 0
1 1 1 1 1 1 1 1 1 0 0
1 1 1 1 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 0 0 0 0
1 1 1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
12 8 7 6 4 4 4 3 3 2 2
The results1 of this experiment are presented in Table 1. Line by line Python code alignment (A and B) is correct for all the applications, but intra-line modifications (C) are detected only by half of them. This is a problem since intra-line modifications are necessary to detect a variable name change for instance. Only four applications detect word changes in test E and only MEDITE and Compare It detect character changes (F). The others detect character changes as word changes, whereas often only one or two characters have been modified. By contrast MEDITE focuses on the modified characters. 1
Detailed results of each application are accessible on http://www-poleia.lip6.fr/ ˜bourdaillet/comparison.
For the third comparison, only DiffDoc and MEDITE align the two paragraphs (G) and find small internal modifications (H). All the other applications fail to detect this. This test is useful because the longest invariant sequence is 752 characters long for two texts of 14 Ko and 18 Ko, and so represents about 5% the size of each file. As it doesn’t change, we could except all software to find it but only two of them do. Because the theme of the two texts is related, common lexical words are used in the remainder of the texts but only MEDITE aligns them correctly (I). The fourth comparison is the hardest one. Paragraphs are aligned correctly only by DiffDoc, MEDITE and Word (J). Several word groups are aligned by DiffDoc and Word but a lot are missed (K). We know they are missed because MEDITE detects them. As DiffDoc and Word miss numerous word groups, they miss isolated word changes whereas MEDITE aligns them pairwise correctly (L). The absence of these alignment anchors results in a bad alignment because a lot of information is not discovered and it impacts on the readability of the alignment. Our result can be viewed in Figure 2. The less the texts are aligned the less the visualization is good. In earlier versions of MEDITE [2] we had similar problems but the introduction of recursion in our algorithm enabled us to address them. None of the applications except for MEDITE detects moved blocks, though we have already said that this is crucial for philology. For source code comparison, this is still the case. Detecting that a code line has been moved from one function to another is an important piece of information. It is also important for any natural language text, because it makes possible to detect rearrangements of ideas, for instance. 4.2
Visualization
As in the case of bioinformatics [21] visualization is an important criterion for the evaluation of text alignment applications. Human judges can evaluate a natural language alignment empirically but in order to do that a good visualization interface is mandatory. Figure 2 presents MEDITE’s visualization interface. Although the figure is small it can be seen that the colors identify the different types of blocks well. Deleted blocks are red (or grey in the grayscale printed proceedings), inserted blocks are green (light grey) and substituted blocks are blue (dark grey) while invariant blocks remain black and white. These colors can customized by the user. Moved blocks are underlined and have a bold font, enabling decorators to be represented. Applications tested in section 4.1 have poor visualization in comparison to MEDITE. Not only do bad alignments result in bad visualization but in addition, graphical user interfaces (GUIs) are generally ill-suited. Another serious problem is that a lot of them present a merged text that mixes deletions and insertions: when texts are very different, the visualization is bad, as is the case with Word in Figure 1.
In MEDITE, when the user clicks on an invariant block its corresponding block is presented side-by-side on the other window. It is thus possible to browse the text in an intuitive way following the blocks the user is interested in. This differs from other applications, where scrolling bars are locked, so when big parts of text are deleted or inserted it is sometimes impossible to look at them side-by-side. MEDITE also generates an HTML report which is a direct visualization of the bi-block list. Each block is displayed with its match and both are colored corresponding to their type. This kind of visualization can be useful especially for source code.
5
Conclusion
This paper presents a textual alignment system and addresses the problem of sequence alignment when applied to natural language. We show that it can be very difficult and that results from existing aligners are not satisfactory for texts studied by textual genetic criticism where there are a lot of repeated blocks. Our experiments show that both existing algorithms and their visualization give poor results. Only two systems, DiffDoc and Word, compete with MEDITE but nevertheless are less good. We present a method to detect moved blocks in textual comparisons; none of the applications we tested was able to do this. In addition the way we decompose moved blocks in operators and decorators enables the user to handle them as they wish: if the user considers move detection more important, operators will be favored by shifting up the ratio and vice versa. It is interesting to remark that this is a direct application of a current theoretical problem, edit distance with moves. This problem is harder in bioinformatics due to the huge quantity of data but it is viable in the area of file comparison. We are also interested in medieval philology where spelling was not stable. Between two text versions, a word could be spelled in different ways because the copyist could decide arbitrarily to modify it. The challenge is to align such text versions correctly despite these difficulties. More generally, this problem is interesting because sequence alignment is an old problem but texts resulting from genetic criticism have shown hard cases that were handled incorrectly by classic file comparison tools. In addition, and this was the original aim of this work which is now completed, natural language processing brings new facilities to researchers in textual genetic criticism via a tool such as MEDITE.
References 1. Ganascia, J.G., Fenoglio, I., Lebrave, J.L.: Manuscrits, gen`ese et documents num´eris´es. EDITE: une ´etude informatis´ee du travail de l’´ecrivain. Document Num´erique 8(4) (2004) 91–110
2. Ganascia, J.G., Bourdaillet, J.: Alignements unilingues avec MEDITE. In: Huiti`emes Journ´ees Internationales d’Analyse Statistique des Donn´ees Textuelles, To appear. (2006) 3. Deppman, J., Ferrer, D., Groden, M., eds.: Genetic Criticism - Texts and Avanttextes. University of Pennsylvania Press (2004) 4. Hay, L., ed.: Essais de critique g´en´etique. Flammarion, coll. Textes et Manuscrits (1979) 5. de Biasi, P.M.: La G´en´etique des Textes. Nathan Universit´e (2000) 6. Hunt, J.W., McIlroy, M.D.: An Algorithm for Differential File Comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ (1976) 7. Manning, C.D., Schtze, H.: Foundations of Statistical Natural Language Processing. MIT Press (1999) 8. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147 (1981) 195–197 9. Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3) (1970) 443–453 10. Smit, A.F.: Identification of a new, abundant superfamily of mammalian LTRtransposons. Nucleic Acids Res 21 (1993) 1863–72 11. Tichy, W.F.: The String-to-String Correction Problem with Block Moves. ACM Trans. Comput. Syst. 2(4) (1984) 309–321 12. Lopresti, D.P., Tomkins, A.: Block Edit Models for Approximate String Matching. Theor. Comput. Sci. 181(1) (1997) 159–179 13. Shapira, D., Storer, J.A.: Edit Distance with Move Operations. In Apostolico, A., Takeda, M., eds.: CPM. Volume 2373 of Lecture Notes in Computer Science., Springer (2002) 85–98 14. Kaplan, H., Shafrir, N.: The greedy algorithm for edit distance with moves. Information Processing Letters 97(1) (2006) 23–27 15. Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucl. Acids. Res. 27(11) (1999) 2369–2376 16. Bray, N., Dubchak, I., Pachter, L.: AVID: A Global Alignment Program. Genome Res. 13(1) (2003) 97–102 17. Darling, A.C., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Res. 14 (2004) 1394 – 1403 18. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computer Biology. Cambridge University Press (1997) 19. Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6(1) (2005) 6–22 20. Lita, L., Rogati, M., Lavie, A.: BLANC: Learning Evaluation Metrics for MT. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, Association for Computational Linguistics (2005) 740–747 21. Raghava, G., Searle, S.M., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4(47) (2003)