Approximate Name Matching - KTH

Approximate Name Matching

Finding similar personal names in large international name lists

MENGMENG DU

Master’s Degree Project Stockholm, Sweden 2005

TRITA-NA-E05137

Numerisk analys och datalogi KTH 100 44 Stockholm

Department of Numerical Analysis and Computer Science Royal Institute of Technology SE-100 44 Stockholm, Sweden

Approximate Name Matching Finding similar personal names in large international name lists

MENGMENG DU

TRITA-NA-E05137

Master’s Thesis in Computer Science (20 credits) at the School of Computer Science and Engineering, Royal Institute of Technology year 2005 Supervisor at Nada was Viggo Kann Examiner was Stefan Arnborg

Abstract Looking up a person’s name in a list is a common operation in information systems. The list can be a customer or employee database, a phone directory, a passenger list, etc. Finding an exact match to a certain name is easy. Personal names, however, are easily misspelled, especially in an international setting. Hence, there is a need for a solution which can find approximate matches to a name in a list. In this thesis, several algorithms for approximate name matching are evaluated. The goal was to find a solution which is fast and language independent, suited for frequent, effective lookups in large lists of international names. Of the different approaches evaluated, the algorithms based on edit distance give the most effective results, while being completely language independent. Among these, reverse edit distance with a Bloom filter is the best solution if speed and minimising the number of irrelevant matches returned are most important. If maximising the number of relevant matches retrieved is more important than speed and precision, edit distance with a trie data structure and constant first letter is better.

ii

Approximativ namnmatchning Sökning av liknande namn i stora internationella namnlistor Sammanfattning Att slå upp ett personnamn i en lista är en vanlig operation i informationssystem. Listan kan vara en databas över kunder eller anställda, en telefonkatalog, en passagerarlista, etc. Att hitta en exakt matchning till ett visst namn är lätt. Dock är det vanligt att personnamn är felstavade, särskilt i en internationell miljö. Därför finns det ett behov av att kunna hitta approximativa matchningar till ett namn i en lista. I denna rapport utvärderas flera algoritmer för approximativ namnmatchning. Målet var att hitta en lösning som är snabb och språkoberoende, passande för frekventa, träffande uppslagningar i stora listor av namn från olika världsdelar. Av de olika utvärderade tillvägagångssätten är det de algoritmer som baseras på editeringsavstånd som ger de mest träffande resultaten samtidigt som de är helt språkoberoende. Bland dessa är omvänt editeringsavstånd med ett Bloomfilter den bästa lösningen om hastighet och minimering av antalet irrelevanta matchningar är viktigast. Om maximering av antalet relevanta matchningar är viktigare än hastighet och precision, så är editeringsavstånd med en trie och fast första bokstav ett bättre alternativ.

iii

Acknowledgements The project this thesis is based on was carried out at Amadeus, Sophia Antipolis, France, between February and July 2005. I would like to thank Professor Viggo Kann, my supervisor at Nada, for his excellent advice and invaluable support. I would also like to thank Svend Fjerdingstad, my supervisor at Amadeus, for being there throughout the project, and Loic Taloc in my team, who helped me prepare the test queries used for evaluating various approaches.

iv

Contents 1 Introduction 1.1 Problem Definition . 1.2 Purpose . . . . . . . 1.3 Limitations of Scope 1.4 Method . . . . . . . 1.5 Thesis Outline . . .

. . . . .

1 1 1 2 2 2

2 Spelling Personal Names 2.1 Variation in Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Spelling Error Patterns . . . . . . . . . . . . . . . . . . . . . . . . . .

3 3 3

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Name Matching Algorithms 3.1 Minimum Edit Distance Algorithms . . . . . . . . . . . . . . 3.1.1 Damerau’s Algorithm . . . . . . . . . . . . . . . . . . 3.1.2 Damerau-Levenshtein Metric . . . . . . . . . . . . . . 3.1.3 Weighted Edit Distance . . . . . . . . . . . . . . . . . 3.1.4 Edit Distance with Upperbound and Cut-Off Criterion 3.1.5 Edit Distance with a Trie . . . . . . . . . . . . . . . . 3.2 Similarity Key Algorithms . . . . . . . . . . . . . . . . . . . . 3.2.1 Phonetic Algorithms . . . . . . . . . . . . . . . . . . . 3.2.2 Knowledge-Based Algorithms . . . . . . . . . . . . . . 3.3 N -gram Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 N -gram Similarity Measures . . . . . . . . . . . . . . . 3.3.2 Variations . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Rule-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . 3.5 Probabilistic Algorithms . . . . . . . . . . . . . . . . . . . . . 3.6 Neural Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 4 List 4.1 4.2 4.3

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . .

5 5 6 6 7 8 8 9 9 13 14 14 15 15 15 16 16

Partitioning First Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Word Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Word Halves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17 17 17 18

v

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

4.4 4.5 4.6

Similarity Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N -grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bloom Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 Selecting Suitable Algorithms 5.1 Measuring Performance . . . . . . . 5.1.1 Speed . . . . . . . . . . . . . 5.1.2 Effectiveness . . . . . . . . . 5.2 Selection . . . . . . . . . . . . . . . . 5.2.1 Limitations . . . . . . . . . . 5.2.2 Selecting the Best Algorithms 5.2.3 General Comparisons . . . . . 5.3 Algorithms Selected . . . . . . . . . 6 Implementation 6.1 General . . . . . . . . . . . . 6.2 Edit Distance (Bloom Filter) 6.3 Edit Distance (Trie) . . . . . 6.4 Soundex . . . . . . . . . . . . 6.5 N -gram Analysis . . . . . . .

. . . . .

. . . . .

7 Evaluation of Selected Algorithms 7.1 Test Lists and Queries . . . . . . 7.2 Evaluating Performance . . . . . 7.2.1 Speed . . . . . . . . . . . 7.2.2 Effectiveness . . . . . . . 7.3 Results . . . . . . . . . . . . . . . 7.3.1 Speed . . . . . . . . . . . 7.3.2 Effectiveness . . . . . . . 7.3.3 Performance Averages . .

. . . . .

. . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

18 18 18

. . . . . . . .

20 20 20 20 21 21 21 24 24

. . . . .

25 25 25 26 26 26

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

27 27 28 28 28 30 30 31 31

8 Discussion 8.1 General . . . . . . . . . . . . . . . . . . 8.2 Subjectiveness of Measured Effectiveness 8.3 Defining Acceptable Performance . . . . 8.4 Intra-Category Algorithm Comparisons . 8.4.1 Edit Distance Algorithms . . . . 8.4.2 Soundex Algorithms . . . . . . . 8.4.3 N -gram Analysis . . . . . . . . . 8.5 Inter-Category Algorithm Comparisons . 8.6 Spelling Error Patterns for Names . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

35 35 35 36 37 37 38 38 39 40

vi

. . . . . . . .

. . . . . . . .

. . . . . . . .

9 Conclusion 9.1 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41 41 41

Bibliography

43

A More Evaluation Results

45

vii

List of Figures 3.1 3.2

Calculating the minimum edit distance between Filips and Phillips. . Matching Philips in an example edit distance trie, t = 2. . . . . . . . .

7 10

5.1

The two effectiveness measures, precision and recall. . . . . . . . . . . .

21

7.1 7.2

Time for matching a single name. . . . . . . . . . . . . . . . . . . . . . . Retrieval effectiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33

viii

List of Tables 3.1 3.2

Soundex phonetic codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . Phonix phonetic codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11 12

7.1 7.2

Matching speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance averages. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 34

A.1 Initialisation speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Build speed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45 45

ix

Chapter 1

Introduction 1.1

Problem Definition

Finding a person’s name in a list is a common operation in information systems. A match is easily found if the name searched for is entered exactly as it is recorded in the list. Personal names, however, are easily misspelled. Hence, there is a need for a matching algorithm capable of finding similar variations of the name, i.e. performing approximate name matching. Amadeus, the initiator of the project presented in this thesis, is the leading global distribution system provider, serving the marketing, sales, and distribution needs of the world’s travel and tourism industries in over 210 markets. The company has many applications which deal with lists of names of various origins, such as passenger lists and customer profile databases. Amadeus requires an efficient and language independent approximate name matching solution which can handle large amounts of data. An example application is travel reservations, where it is sometimes necessary to issue various kinds of alerts, for both commercial and safety purposes, if the customer’s name is a close variation of some name in a predefined watch list.

1.2

Purpose

The main purpose of this thesis is to propose an approximate name matching solution suited for efficient, high-volume searches of personal names in an international environment. An ideal solution would be both language independent and fast, since the name list could be large (around 100 000 items) and contain names from all around the world. It is also desirable to retrieve as many relevant matches as possible, while keeping the number of irrelevant matches found low.

1

1.3

Limitations of Scope

In this thesis, we assume that all names use the standard English alphabet consisting of 26 letters. Only personal surnames are matched. A solution capable of matching surnames effectively could most probably be easily extended to handle a complete personal name, if required in the future. Finally, no commercial applications are considered, in order to keep down costs and avoid becoming dependent of a single system.

1.4

Method

The following tasks were performed as part of this Master’s project: 1. Make a literature survey of existing approximate name matching algorithms. 2. Select a few possible algorithms for evaluation. 3. Gather testing data. 4. Implement the selected algorithms in C++ in a Unix environment. 5. Perform comparative tests of the implementations on the test sample. 6. Evaluate the testing results with regard to speed and retrieval effectiveness. 7. Propose a final solution.

1.5

Thesis Outline

In Chapter 2, we look at how names can be misspelled and some existing error patterns. Chapter 3 contains a survey of existing algorithms for approximate name matching. Chapter 4 gives a short overview of different ways to partition a list for fast retrieval. After covering existing research, we select in Chapter 5 the algorithms best suited for matching similar names. Implementation details are described in Chapter 6. In Chapter 7, we try the selected algorithms on some test name lists and present the evaluation results. These are discussed in Chapter 8. Finally, in Chapter 9, we conclude this thesis by proposing a final solution and discussing some possible future improvements.

2

Chapter 2

Spelling Personal Names Personal names are given to individuals and act as an identifier or label for the individual. According to [21], common components of personal names are a surname (used by all members of a family), a given name (identifying the individual), additional given names or middle names, and in some countries also titles. Nicknames can sometimes substitute an individual’s real name. As mentioned in Section 1.3, only surnames are considered in this study.

2.1

Variation in Names

An approximate name matching algorithm must be able to allow for variations in how a name is written. There may be several legitimate spelling or phonetic variations of a name (e.g. Filips–Phillips, Ericson–Eriksson). In an international setting, there can also be many ways of transcribing names usually not written in Latin letters, with different countries having their own transcription systems (e.g. Koo–Gu, Mohammad– Imhemed). A name can also be simply misspelled when entered into a directory. Kukich [8] mentions three different types of misspellings. Typographic errors (e.g. the–teh) are mainly due to motor coordination slips. It is assumed that the writer knows the correct spelling. Cognitive errors (e.g. receive–recieve) are due to misconceptions or a lack of knowledge on the part of the writer about the correct spelling. Phonetic errors (e.g. naturally–nacherly) occur when the writer substitutes a phonetically correct but alphabetically incorrect sequence of letters for the intended word. It is usually difficult to distinguish which category a misspelling belongs to. Phonetic errors tend to distort spellings more than typographic and cognitive errors.

2.2

Spelling Error Patterns

According to Damerau [1] and Peterson [13], more than 80 percent of all spelling errors fall into one of four classes of single-error misspellings. These four classes are 3

(1) substitution of a single letter, (2) omission of a single letter, (3) insertion of a single letter, and (4) transposition of two adjacent letters. Damerau states that these errors arise from misreading, hitting a key twice, or letting the eye move faster than the hand. Peterson finds that one wrong letter is the most common error, followed by omission of a letter, insertion of an extra letter, and transposition. These errors constitute over 90 percent of the errors in his study. The next most common errors are cases of multi-error misspellings: two extra letters, two missing letters, and two letters transposed around a third. Peterson claims that there is most probably an error distribution for each letter, due to keyboard layout (an r is more likely to be mistyped as a t than as a p on a qwerty keyboard). Kukich [8] references a study by Grudin showing that 58 percent of all substitution errors made when typewriting involve adjacent keys. She states, however, that the error distribution can vary greatly, depending on the technique used to enter words. An optical character recogniser (OCR) is more likely to confuse letters which look similar to each other (e.g. u–v). A human ear might misinterpret similar sounding letters (e.g. d–t, m–n). In Pollock and Zamora’s study [15] on scientific texts with 50 000 misspelled words, over 90 percent of the misspelled words have a single error. Another interesting result of the study is that about 20 percent of the errors occur in the third position of the word, while only 8 percent of the errors occur at the beginning. Kukich agrees with Pollock and Zamora that few misspellings occur in the first letter of a word. Since all the studies presented above are on words, it is unclear to what extent their results hold when personal names are involved. They are, nonetheless, interesting and should provide at least an indication of how names tend to be misspelled.

4

Chapter 3

Name Matching Algorithms In this chapter, we look at some existing algorithms for approximate name matching. The problem of finding similar names can be viewed as a special case of the more general approximate string matching problem, which has been studied ever since the 1960s. Research has focused on methods for automatic spelling correction and text recognition. Matching a certain name with names in a predefined list corresponds to finding possible corrections to a misspelled word among dictionary entries. The algorithms which have been devised can in general be divided into six different categories: • Minimum edit distance algorithms • Similarity key algorithms • N -gram analysis • Rule-based algorithms • Probabilistic algorithms • Neural nets

3.1

Minimum Edit Distance Algorithms

According to Kukich [8], the most studied spelling correction algorithms are the ones calculating edit distances between the misspelled word and words in the dictionary. Edit distance is the number of simple edit operations required to transform one string to another. The different operations allowed are substitution of a letter, deletion of a letter, insertion of a letter, and transposition of two adjacent letters. The less the edit distance between two strings is, the more similar are the strings to each other.

5

3.1.1

Damerau’s Algorithm

Damerau [1] pioneered minimum edit distance for spelling correction in 1964. His idea was to generate all possible variations of the misspelled word which are at most a single error apart from the word. These correction candidates are then tested against the dictionary, using exact matching. With a word of length n and an alphabet of l letters, there are l(2n + 1) + n − 1 possible corrections to test.1 Using the English alphabet, a ten-letter word would yield 555 words to be matched against the dictionary, and a twenty-letter word would give 1 085 variations. Some of these variations could be discarded before testing against the dictionary due to impossible letter combinations, i.e. those combinations not existent in the dictionary (see e.g. Kann et al. [6]).

3.1.2

Damerau-Levenshtein Metric

Damerau’s pioneering algorithm uses a reverse technique to generate possible corrections and then match these to the dictionary. Kukich [8] mentions several other variants of edit distance algorithms which use edit distance differently. These algorithms allow multiple errors and use the Damerau-Levenshtein metric to calculate the similarity between two words x and y. This way, all entries in the dictionary can be ranked after their proximity to the query word. The metric is, according to Pfeifer et al. [14], given by d(0, 0) = 0 d(i, j) = min[d(i − 1, j) + 1, d(i, j − 1) + 1, d(i − 1, j − 1) + c(xi , yj ), d(i − 2, j − 2) + c(xi−1 , yj ) + c(xi , yj−1 ) + 1].

(3.1)

The function d gives the minimum edit distance between the two words x and y, or in other words, the minimum number of edit operations required to transform x to y. If the length of x is |x|, and the length of y is |y|, then the minimum edit distance between x and y can be expressed as d(|x|, |y|). xi denotes the ith letter of the word x. The function c is defined as ½ 0 if xi = yj c(xi , yj ) = . (3.2) 1 if xi 6= yj The first expression in the min function in Equation 3.1 regards insertion, the second omission, the third substitution, and the last transposition. The Damerau-Levenshtein metric is most commonly calculated using dynamic programming. Wagner and Fischer’s algorithm [18] from 1974 is one of the best known, allowing the edit operations insertion, substitution, and omission. It was 1

There are n − 1 possible ways to transpose two adjacent letters in the string, n ways to omit a single letter, l(n+1) ways to insert one letter, and (l −1)n ways to substitute one letter for another. The total number of ways to make a single error is thus (n−1)+n+l(n+1)+(l−1)n = l(2n+1)+n−1.

6

later extended by Lowrance and Wagner [10] to also include transposition. The time complexity for calculating the edit distance between two words x and y with their dynamic programming approach is O(|x||y|). A zero-indexed matrix with dimensions (|x| + 1) × (|y| + 1) is used. The matrix is filled according to Equation 3.1 either one column or one row at a time. The matrix element at (|x|, |y|) gives the minimum edit distance between the two words x and y. Figure 3.1 shows an example dynamic programming matrix for calculating the edit distance between the two names Filips and Phillips. ² f i l i p s

² 0 1 2 3 4 5 6

p 1 1 2 3 4 4 5

h 2 2 2 3 4 5 5

i 3 3 2 3 3 4 5

l 4 4 3 2 3 4 5

l 5 5 4 3 3 4 5

i 6 6 5 4 3 4 5

p 7 7 6 5 4 3 4

s 8 8 7 6 5 4 3

Figure 3.1: Calculating the minimum edit distance between Filips and Phillips. In general, the whole dictionary has to be processed in order to find the possible matches to a word, which makes matching rather slow.

3.1.3

Weighted Edit Distance

As mentioned in Section 2.2, some letters are more easily confused with each other, due to e.g. keyboard layout, similar shapes, or phonetic similarity. In the DamerauLevenshtein metric presented in Section 3.1.2, all errors are given the same cost, or distance measure. For a more realistic cost measure, however, unusual errors should have a higher cost than common mistakes. As an example, both conspiracy and conspirxcy have edit distance 1 to conspiricy according to the standard DamerauLevenshtein metric, but the first string is obviously a better match than the second. According to Kukich [8], statistical data from spelling errors could be used to derive suitable distance costs between any two letters. This information could be stored in a l × l matrix, where l is the number of letters in the alphabet, and the element (i, j) corresponds to the distance cost between the ith and the jth letter. A vector with l elements would also be needed to hold the distance costs between a letter and no letter, for handling insertion and omission. This approach is very sensitive, since the distribution of errors varies depending on e.g. way of input (see Section 2.2), the language used, and the kind of text involved. The statistics used to derive distance costs must be tailored to the application in mind.

7

3.1.4

Edit Distance with Upperbound and Cut-Off Criterion

Du and Chang [2] present several algorithms to improve the search time of the edit distance approach. One algorithm is built on the idea that in order for the words x and y to be a match with maximum t edit distance, the following condition must hold: |x| − t ≤ |y| ≤ |x| + t.

(3.3)

It is assumed that the words in the dictionary are partitioned into groups by word length. If t is the error threshold, and n is the length of the misspelled word, then the groups with word length m, where n − t ≤ m ≤ n + t, are searched for close matches. If the distance between two words is found to be greater than t during calculation of the dynamic programming matrix, the attempt is aborted, and the search moves on to the next dictionary word.

3.1.5

Edit Distance with a Trie

Navarro and Baeza-Yates [12] propose an edit distance algorithm using the trie data structure to improve search time. A trie is built on the set of words in the dictionary. This trie is a tree with labelled edges where every node corresponds to a unique prefix of one or more words. The root corresponds to the empty string ². If a node corresponding to the string z has a child by an edge labelled a, then that child node corresponds to the string za. The leaves of the trie correspond to complete words in the dictionary. There may, however, also be non-leaf nodes corresponding to complete dictionary words. Each node of the trie represents a unique prefix of the set of dictionary words. Therefore, the trie can be used to avoid repeated calculations of edit distance between the query word and the shared prefixes of many words in the dictionary. When matching a word x, the trie is traversed depth first. For every node, a new column of the dynamic programming matrix for the branch the node belongs to is calculated according to Section 3.1.2. If the node corresponds to the prefix z, then the matrix filled so far gives the minimum edit distance between x and z. First of all, the first matrix column is filled. This column corresponds to the trie root, i.e. the empty string ², which is a common prefix for all words. The branches of the trie are then visited recursively. Children nodes generate their column from the columns of their predecessors. As a result, two words having a common prefix will also share the matrix columns up until the column corresponding to that prefix. When the node marked as the end of a word y is reached, the last cell of the newly computed column will give the minimal edit distance between the words x and y. If it is below a given error threshold, y is an approximate match to x. It is often possible to determine before reaching the end of a word whether the current branch can produce a relevant match. If all values in the current column are larger than the error threshold, then a match cannot occur, since the edit cost can 8

only increase or remain constant further on. The rest of the branch can therefore be skipped. Figure 3.2 shows how a trie containing the words Filips, Fillips, Phillip, Phillips, and Phan is searched for approximate matches to Philips, assuming an error threshold t = 2. The trie is shown in Figure 3.2a. Nodes corresponding to the end of a name are shaded. The edit distance matrices between each name in the trie and the query name Philips are shown in Figures 3.2b–f. The trie is visited depth first, left to right, and matrices are filled column by column. Thus, Matrix 1 in Figure 3.2b is calculated first, then Matrix 2 in Figure 3.2c, etc. The matches with edit distance d ≤ t = 2 are Filips, Phillip, and Phillips. In Figures 3.2b–f, if a matrix column is marked with a (?), then that column has already been calculated earlier and can be reused. A (×) over a column means that the column is never actually calculated, since before it is reached, the branch has already been judged to not contain a match and been cut. As an example, in the edit distance matrix calculations for Philips–Fillips in Figure 3.2c, it is possible to see already in the fifth column that all column values exceed the tolerated error threshold t = 2. The rest of that trie branch is therefore skipped, and the remaining matrix columns are never filled out. In the above example, only 18 matrix columns need to be calculated when using the trie data structure. The traditional edit distance approach in Section 3.1.2 would have required 37 column computations. This difference is due to the trie’s ability to take advantage of shared prefixes and to cut away irrelevant branches early on. With a larger word dictionary, there would be more shared prefixes, and the reduction in the number of computed matrix columns would be even more noticeable.

3.2

Similarity Key Algorithms

The general idea of the similarity key algorithms is, according to Kukich [8], to map every word into a key such that approximately matching words will have identical or similar keys. The corresponding key of each dictionary entry can be stored and indexed, thus allowing fast lookup of entries having the same key. Similarity key algorithms have therefore a speed advantage, since there is no need to look through the entire dictionary. A key is computed for the misspelled word which provides a pointer to correction candidates in the dictionary.

3.2.1

Phonetic Algorithms

There are several algorithms which match words by phonetic equivalence. Two words are considered close matches to each other if they sound alike when pronounced. The main advantage of these algorithms is speed. The main disadvantage is language dependency. Different languages have differing pronunciation of letters, and since phonetic algorithms rely on phonetics specific to one language, they most often do not work well with other languages than the one they were designed for. 9

² f

p

i

h a

l i

l

p s

i

n

l

Matrix 3

i

l

p

i

Matrix 1 s Matrix 2

p

Matrix 4

s Matrix 5

(a) The trie.

² p h i l i p s

² 0 1 2 3 4 5 6 7

f 1 1 2 3 4 5 6 7

i 2 2 2 2 3 4 5 6

l 3 3 3 3 2 3 4 5

i 4 4 4 3 3 2 3 4

p 5 4 5 4 4 3 2 3

s 6 5 5 5 5 4 3 2

² p h i l i p s

? ² 0 1 2 3 4 5 6 7

? f 1 1 2 3 4 5 6 7

(b) Matrix 1.

? i 2 2 2 2 3 4 5 6

? l 3 3 3 3 2 3 4 5

l 4 4 4 4 3 3 4 5

× i 5 5 5 4 4 3 4 5

× p 6 5 6 5 5 4 3 4

× s 7 6 6 6 6 5 4 3

? l 5 4 3 2 1 1 2 3

? i 6 5 4 3 2 1 2 3

? p 7 6 5 4 3 2 1 2

(c) Matrix 2.

² p h i l i p s

? ² 0 1 2 3 4 5 6 7

p 1 0 1 2 3 4 5 6

h 2 1 0 1 2 3 4 5

a 3 2 1 1 2 3 4 5

n 4 3 2 2 2 3 4 5

(d) Matrix 3.

² p h i l i p s

? ² 0 1 2 3 4 5 6 7

? p 1 0 1 2 3 4 5 6

? h 2 1 0 1 2 3 4 5

i 3 2 1 0 1 2 3 4

l 4 3 2 1 0 1 2 3

l 5 4 3 2 1 1 2 3

i 6 5 4 3 2 1 2 3

p 7 6 5 4 3 2 1 2

² p h i l i p s

(e) Matrix 4.

? ² 0 1 2 3 4 5 6 7

? p 1 0 1 2 3 4 5 6

? h 2 1 0 1 2 3 4 5

? i 3 2 1 0 1 2 3 4

? l 4 3 2 1 0 1 2 3

s 8 7 6 5 4 3 2 1

(f) Matrix 5.

Figure 3.2: Matching Philips in an example edit distance trie, t = 2.

10

Soundex Soundex, presented in Hall and Dowling’s paper [4], was invented by Odell and Russell in 1918 and used by the U.S. Census to match American English names. It is perhaps the best known and most cited of the similarity key algorithms. Soundex translates a name into a four-character code based on the sound of each letter. The first letter of the name is kept constant, while the rest of the letters are coded into digits according to Table 3.1. Letters aehiouwy bfpv cgjkqsxz dt l mn r

Code 0 1 2 3 4 5 6

Table 3.1: Soundex phonetic codes. Letters with the same Soundex digit as their preceding letter are ignored. After coding the entire name, all zeros are eliminated. Finally, the code is truncated or padded with zeros to one initial letter and three digits. As an example, Dickson → D022205 → D0205 → D25 → D250. Also, Pfister → P102306 → P02306 (since f has the same Soundex digit as its preceding letter p) → P236. The encoding algorithm is very fast in practice. After the code has been calculated, it can be used to quickly lookup possible matches in a name list indexed by Soundex codes. According to Hall and Dowling, the Soundex algorithm is rather crude and can sometimes go very wrong. Two names with different initial letters will never have the same Soundex code, even though they have the same pronunciation (e.g. Karlsson → K642 and Carlson → C642). The algorithm is designed for English, but even with common English names, it fails easily (e.g. Rodgers → R326 and Rogers → R262). Phonix There are many elaborations and adaptations of the Soundex encoding. According to Zobel and Dart [23], Phonix, invented by Gadd in 1988, is one of the more ambitious variants. It is far more complex than Soundex and therefore slower. While Soundex only removes certain letters and duplicate code characters, Phonix applies a large set of rules to transform the name before it is mapped to a set of codes according to Table 3.2. Altogether, there are about 160 transformations (for details, see Gadd [3]). The process is described by Pfeifer et al. [14] as the following:

11

1. Perform the phonetic transformations by replacing certain letter groups with others (e.g. gn, ghn, and gne are mapped to n; and tjV, where V is any vowel, is mapped to chV if it is at the start of the name). 2. Replace the initial letter with v if it is a vowel or the consonant y. 3. Remove the ending sound from the name (roughly the part after the last vowel or y). 4. Remove all remaining vowels, the consonants h, w, and y, and all repeated characters. 5. Create the Phonix code of the name without its ending sound by replacing all remaining letters except the first with the corresponding digits in Table 3.2. 6. Create the Phonix code of the ending sound by replacing every letter with the corresponding digit in Table 3.2. The maximum length of any Phonix code is restricted to eight characters. Using the Phonix code of the name, each entry in the name list can be ranked as (1) identical, (2) similar, or (3) unrelated. Using the Phonix code of the ending sound, the similar rank can be divided into three subranks, (2a) agree on ending sounds, (2b) agree on a prefix of the ending sounds, or (2c) have different ending sounds. Letters aehiouwy bp cgjkq dt l mn r fv sxz

Code 0 1 2 3 4 5 6 7 8

Table 3.2: Phonix phonetic codes.

Metaphone Metaphone, invented by Lawrence Philips, is another replacement for Soundex. It is based on commonplace rules of how English words are pronounced and is like Soundex designed primarily for American English names. According to Lait and Randell [9], Metaphone reduces the alphabet to sixteen consonant sounds, retaining vowels only when they occur as the initial letter of a name. The phonetic rules used can be found in detail in Lait and Randell’s paper. 12

As an improvement of Soundex, Metaphone is able to distinguish names such as Smith and Saneed, or Van Hoesen and Vincenzo, to which Soundex assigns the same code. The many rules, however, make it slower than Soundex.2 Non-English Phonetic Algorithms Soundex, Phonix, and Metaphone are all designed for English names, but there are algorithms dealing with other languages. Lait and Randell mention Fonem, for use with French names only, and Daitch-Mokotoff Soundex, adapted for Slavic and German spellings of Jewish names, as two examples.

3.2.2

Knowledge-Based Algorithms

Unlike phonetic algorithms, knowledge-based algorithms do not rest on phonetics, but on knowledge of error types and letter characteristics. SPEEDCOP, designed by Pollock and Zamora [15] in 1984, is an example of an error correction application using knowledge-based algorithms. SPEEDCOP is limited to single-error misspellings, motivated by research showing that over 80 percent of all spelling mistakes belong to this kind of error. The two knowledgebased algorithms used by SPEEDCOP are the skeleton key and the omission key, which complement each other. The two keys are generated for each entry in the dictionary. The dictionary is then sorted in key order. A misspelling is corrected by locating words with keys close to the keys of the misspelled word. The Skeleton Key The skeleton key of a word consists of the first letter of the word, followed by first the remaining unique consonants in order of occurrence, and then the unique vowels in order of occurrence (e.g. chemomagnetic → chmgntceoai). Pollock and Zamora believe that (1) the first letter is likely to be correct, (2) consonants carry more information than vowels, and (3) the original consonant order is mostly preserved. This motivates the design of the skeleton key. Furthermore, the key has the advantage of not being altered by doubling or undoubling of letters or most transpositions. The main weakness of the skeleton key is its emphasis on initial letters. The earlier an incorrect consonant occurs in the word, the greater is the distance between the key of the misspelled word and the keys of relevant corrections. The Omission Key The omission key is built to correct the above-mentioned weakness of the skeleton key. Pollock and Zamora noticed that early consonant damage is most often 2

There exists an improved variant, called Double Metaphone (see http://aspell.net/ metaphone/).

13

caused by omission errors. They use omission frequencies from their study of English scientific texts as the basis for their omission key. Consonants were found to be omitted from words in the following frequency order, starting with the most common omission: RSTNLCHDPGMFBYWVZXQKJ. The omission key of a word consists of its unique consonants, sorted in the reverse of the above frequency order, followed by the unique vowels of the word in their original order (e.g. caramel → mclrae). An example of a pair of matching names for which the omission key does not work is Carlson → CLNSRAO and Karlsson → KLNSRAO.

3.3

N -gram Analysis

The n-grams of a word are all the overlapping n-letter sequences in the word. As an example, the bigrams (n = 2) of the word method are {me, et, th, ho, od}, while the trigrams (n = 3) of the same word are {met, eth, tho, hod}. The main advantage of n-gram analysis is that it is language independent, since it only involves comparison of letters. According to Pfeifer et al. [14], the more n-grams two words have in common, the more similar they are.

3.3.1

N -gram Similarity Measures

There are several different n-gram similarity measures. A simple measure given by Zobel and Dart [23] is the count of the total number of n-grams two words have in common, gram-count = |N1 ∩ N2 | (3.4) where N1 and N2 are the sets of n-grams of the two words. The similarity coefficient is then, according to Pfeifer et al. [14], gram-coefficient =

|N1 ∩ N2 | . |N1 ∪ N2 |

(3.5)

The words receieve and receive have in total eight different bigrams, of which they have five in common. According to Equation 3.5, their similarity coefficient is therefore 5/8. Another measure used by Zobel and Dart is Ukkonen’s n-gram distance function, gram-dist = |N1 | + |N2 | − 2|N1 ∩ N2 |.

(3.6)

|N1 | and |N2 | denote the number of n-grams in the two words and can be easily calculated from the length of the words. In the above example, receieve has seven bigrams, and receive has six bigrams. They share five bigrams. The n-gram distance between them is thus 7 + 6 − 2 · 5 = 3. The similarity measures presented above do not take into account the ordering of letters within words. However, Zobel and Dart’s study show that the majority of 14

names do not contain repeated n-grams. Less than 2 percent of the 30 000 names in their data set contain a repeated bigram, and almost none contain a repeated trigram. Therefore, the information which ordering contains is usually retained implicitly in the above-mentioned measures.

3.3.2

Variations

Salton [17] states that bigrams and trigrams (n = 2, 3) are best suited for retrieving similar words to a given word. The use of unigrams (n = 1) would produce too many matches, while longer string sequences (n ≥ 4) would miss the common roots of many short words. Pfeifer et al. [14] propose the use of additional blanks at the beginning and the end of the word to emphasise the first and last letters and at the same time increase the total number of n-grams involved. This could give a more precise result, especially for shorter words. In general, the number of n-grams in a word of length m, with k blanks added to both the beginning and the end, is m + 2k − n + 1. Matching speed could be improved by indexing dictionary entries by the n-grams they contain, to allow for fast dictionary lookup, and limiting the matching process to words which have at least one n-gram in common with the misspelled word (see Section 4.5)

3.4

Rule-Based Algorithms

Rule-based algorithms attempt to represent knowledge of common spelling error patterns in the form of various rules for how to transform a misspelled word into a valid one. According to Kukich [8], correction candidates are generated by applying all applicable rules on the misspelled word and retaining the valid dictionary entries which this results in. The candidates can then be ranked. Frequently, a numerical score is assigned to each candidate, based on the probability of having made the particular error corrected by the corresponding rule. Higher probability means a closer match.

3.5

Probabilistic Algorithms

Probabilistic algorithms use probabilities to determine suitable corrections for the misspelled word. Kukich [8] describes two kinds of probabilities used in spelling correction, transition probabilities and confusion probabilities. Transition probabilities represent probabilities that a letter x will be followed by another letter y. These probabilities are language dependent and can be estimated using n-gram frequency statistics from a relevant data source. Confusion probabilities represent probabilities that a letter x will be confused and substituted with another letter y. These probabilities are source dependent (see Section 2.2).

15

An example of a probabilistic algorithm is the one invented by Bledsoe and Browning presented in Kukich’s paper. Their basic idea is that the conditional probability of the dictionary entry X being the correct spelling given the misspelled word Y , P (X|Y ), can be calculated using Bayes’ rule, P (X|Y ) =

P (Y |X) · P (X) . P (Y )

(3.7)

P (X) is the unigram probability of the word X, i.e. the likelihood of X occurring, estimated from its general frequency of occurrence in the data. P (Y |X) is the conditional probability of observing Y when X is the correct word and can be calculated using estimated probabilities of individual letters. Research on text recognition has shown, however, that the use of probabilistic information alone is insufficient to achieve acceptable error correction rates.

3.6

Neural Nets

According to [20], a neural network is an interconnected group of neurons. The human brain is a prime example of a neural network. Artificial neural networks, or neural nets for short, are built to resemble the workings of the brain. A neural net consists of a massively parallel collection of simple processing units called nodes or ’neurons’. The interconnections between these nodes form a large part of the network’s intelligence. Neural nets can, according to Kukich [8], be used for spelling correction, since they have an inherent ability to retrieve relevant information despite incomplete or noisy input, or in this case misspellings. They can also be trained to adapt to users’ error patterns and gradually improve and eventually maximise correction accuracy. However, they have a major drawback. Kukich shows in [7] that running the learning cycles needed to obtain acceptable correction accuracy requires a very long time even for a small dictionary of about 500 words. Training time increases nonpolynomially with dictionary size.

3.7

Hybrid Algorithms

There are some algorithms which combine several of the approaches mentioned above. Zobel and Dart [23] propose e.g. to first use a coarse search method such as n-grams or some kind of index to partition the dictionary and remove words unlikely to be a match, and then perform a finer search on the remaining words using e.g. edit distance or phonetic coding.

16

Chapter 4

List Partitioning When matching personal names, the size of the name list can be extensive. It is therefore important to avoid exhaustive searches of the list every time a new name needs to be matched. A partitioning method can be used to retrieve the part of the list which is most likely to be interesting. Then, an approximate name matching algorithm is used on this partition to find the relevant matches.

4.1

First Letter

A simple way to partition the list is by first letter. This is e.g. done implicitly when using Soundex (see Section 3.2.1), which does not encode the first letter of a name. Hall and Dowling [4] state that the main drawback of this method is that matching names with different initial letters will never be found. This might not be a problem for words, as mentioned in Section 2.2, since they are seldom misspelled at the beginning. With names, the situation is unclear.

4.2

Word Length

Hall and Dowling [4] also mention partitioning the list by word length. If the maximum number of errors allowed is set to a threshold value t, then the length of a possible match must be the length of the name ± t. The edit distance with upperbound and cut-off criterion algorithm described in Section 3.1.4 partitions the list according to this idea. Both Hall and Dowling [4] and Winkler [22] state, however, that partitioning by word length is doubtful for name lists. The approach may not have much impact on search time, since entries in name lists usually do not vary much in length. The distribution of length when it comes to surnames is very non-uniform.

17

4.3

Word Halves

Another approach presented by Hall and Dowling [4], suggested by Knuth, is indexing the first and last halves of all words. The motivation is that words differing by a single error must match exactly in either the first or the second half. The word halves indices can be implemented with a prefix tree and a suffix tree similar to the trie described in Section 3.1.5. The word to be matched is split in two halves, a prefix and a suffix. These halves are then looked up in the two trees. All words branching out below the word halves in each tree share the same prefix or suffix as the query word. The smallest partition of the list is retrieved by choosing the index with the least number of words which start or end in the same way as the query word.

4.4

Similarity Keys

When using a similarity key algorithm (see Section 3.2), it is possible to partition the list by first key character or simply the entire key. Depending on how the key is constructed, the problem of not being able to find matching names with different initial letters might be avoided.

4.5

N -grams

Zobel and Dart [23] propose using an inverted index of n-grams (see Section 3.3) to avoid going through the entire list. First, all names in the list are given a unique number. Then, for every possible n-gram1 , a list of all the numbers of names containing that n-gram is constructed. In order to find close matches to a specific name, the union of all lists with names having a n-gram in common with that name is taken. Answers can be stored in a heap, sorted after n-gram distance, and the answers with too large distances can be skipped to save sorting time.

4.6

Bloom Filter

A rather different approach to avoid looking through the entire list is presented by Kann et al. [6]. They use a Bloom filter in order to find spelling corrections in Swedish. A Bloom filter is a special kind of hash table, where every entry is either 0 or 1. The main advantage of the Bloom filter is that it allows extremely fast lookup of words. To lookup a word, it is hashed repeatedly with several hash functions. If all hash entries are equal to 1, the word is in the list. If any of the entries are equal to 0, then the word is not part of the list. There is, however, a small possibility that a random word is considered to be in the list. This occurs when the word has the 1

With an alphabet of l letters, there are ln possible n-grams.

18

same hash signature, i.e. ones in the same positions, as an actual word in the list. The probability of such collisions occurring can be minimised by adjusting the size of the hash table and the number of hash functions to the size of the word list. The hash table is, according to Kann et al. used optimally when it is half-filled with ones. If N is the size of the word list, and M is the chosen size of the hash table, then k hash functions should be used to minimise the error probability f (k), where ln 2 M M ¢ ≈ ln 2 · ¡ ≈ 0.69 · (4.1) k=− 1 N N N · ln 1 − M and

f (k) = 2−k .

(4.2)

As an example, if the list contains N = 100 000 words, and the hash table size is chosen to be M = 2 000 000, then k = ln 2 ·

2 000 000 ≈ 13.9 ≈ 14 100 000

(4.3)

hash functions should be used in order to minimise the probability of a random word being accepted to f (14) = 2−14 ≈ 6 · 10−5 = 0.006%. (4.4) Kann et al.’s idea for spelling correction is based on Damerau’s reverse edit distance algorithm (see Section 3.1.1). They first generate all possible words a single error apart from the original word and then check, using the Bloom filter, if any of these are in the list. The maximum number of words checked is always l(2n + 1) + n − 1, where l is the size of the alphabet, and n is the length of the original word, independent of the size of the word list. When using a Bloom filter, new words can be inserted into the list, but in order to delete an existing word, the entire Bloom filter must be rebuilt.

19

Chapter 5

Selecting Suitable Algorithms In this chapter, we look at which algorithms are best suited for the goals stated in Section 1.2. We wish to perform fast approximate matching of names originating from all over the world. In order to choose algorithms suited for this purpose, we must first define how we will measure performance.

5.1

Measuring Performance

We need to evaluate algorithms both according to the time it takes to perform a search and the quality of the results returned by that search.

5.1.1

Speed

Since we are dealing with large lists of names, time is an issue. Intuitively, the more time we have available to go through alternatives, the easier it is to find good matches. The goal, however, is to find a solution which can return relatively many interesting matches in the shortest time possible.

5.1.2

Effectiveness

In information retrieval, the two measures recall and precision are often used as retrieval effectiveness criteria. According to Raghavan et al. [16], high recall means retrieving as many relevant items as possible, while high precision means retrieving as few irrelevant items as possible. More specifically, recall is the proportion of relevant matches actually retrieved, and precision is the proportion of retrieved matches which are relevant (see Figure 5.1). A match is relevant if it is judged by the user to be of interest. Ideally, we would like to have 100 percent precision and 100 percent recall. There is, however, often a trade-off between recall and precision. In order to get more relevant items, usually more irrelevant items are retrieved as well. The converse is also true. Traditionally, according to Raghavan et al., an algorithm A is considered

20

Retrieved and Relevant (A ∩ B)

Relevant (B) Retrieved (A)

Precision = (A ∩ B)/A

Recall = (A ∩ B)/B

Figure 5.1: The two effectiveness measures, precision and recall.

to be better than an algorithm B if for every recall level, A has higher precision value than B.

5.2

Selection

Approximate name matching algorithms from different categories tend to return quite different results, since they are based on different ideas. For this reason, we want to first pick out the best algorithms within each of the categories presented in Chapter 3 and then compare their performances. However, there are some categories we will not consider.

5.2.1

Limitations

We will not look at rule-based and probabilistic algorithms, since these are built on error probabilities, which are derived through frequency analysis. According to Winkler [22], frequency-based matching will not work for large lists of names. The reason is that the variation is too great to give any useful information on frequencies and probabilities. We also disregard neural nets, since they require extended knowledge of linguistics and a great deal of time-consuming training. The time required for training actually increases non-polynomially with list size. Furthermore, according to Kukich [7], neural nets are actually slower than edit distance for large lists, while not offering a much better error correction rate.

5.2.2

Selecting the Best Algorithms

The categories remaining for us to explore are edit distance algorithms, similarity key algorithms, and n-gram analysis.

21

Edit Distance Algorithms Most of the comparative research on approximate name matching algorithms show that edit distance gives the best recall–precision performance, but takes the longest time (see e.g. Zobel and Dart [23]). Calculating the dynamic programming matrix for each name in the list is not an alternative, but we could use a trie as in Section 3.1.5 or partition the name list by word length and apply the cut-off criterion as in Section 3.1.4. Since name lists usually have non-uniform word length distribution, it is unclear whether partitioning by word length has any greater impact. In contrast, edit distance with a trie will almost certainly be faster than the traditional dynamic programming approach. Therefore, we choose to test edit distance with a trie (ED (Trie)), but will not consider the cut-off approach further. With the trie approach, by setting different thresholds for the number of edit operation errors allowed, we will get varying effectiveness and speed. The lower the error threshold, the faster the matching of a name against the trie, since less of the trie is visited. It is uncertain, however, in which way effectiveness is affected by the set threshold. According to tests of Pfeifer et al. [14] on a list of 14 000 English names, allowing more than three typing errors does not improve effectiveness markedly. Since empirical studies have shown that few misspellings occur in the first letter of a word (see Section 2.2), we will also try using a trie where only names starting with the same letter as the query name are matched, i.e. only a single branch going from the tree root is visited (ED (Trie-C)). This should save a great deal of time compared to the full trie approach, but might not return as effective results. We also choose to evaluate the reverse edit distance approach which uses a Bloom filter (ED (Bloom)) for extremely fast lookup of the alternatives (see Section 4.6). This might give as good matches as full edit distance, since several sources have shown that more than 80 percent of all spelling mistakes have a single error (see Section 2.2). No previous results from using a Bloom filter to match similar names have been found. This makes it even more interesting to try the algorithm. With the Bloom filter approach, the user could also decide how many errors generated alternatives may have. Instead of only allowing a single error, we could generate all alternatives having an edit distance t ≥ 0 to the query name. The number of alternatives grows exponentially with the maximum allowed edit distance t and is O((ln)t ), where l is the number of letters in the alphabet used, and n is the length of the query name. As an example, instead of 555 alternative names for a tenletter query name and 1 085 alternatives for a twenty-letter name as in Section 3.1.1, with t = 2, we would have around 300 000 alternatives for the ten-letter name and 1.2 million alternatives for the twenty-letter name. Thus, for short names, using the Bloom filter approach with edit distance 2 could be an option, since the number of alternatives to hash is still low. With longer names, however fast the hash functions are, it is doubtful that hashing all generated alternatives is a viable option. We could keep track of existent n-grams in the name list and only hash alternatives without impossible n-letter combinations, as 22

described by Kann et al. [6]. It is, however, unclear how much time would be saved, since checking for impossible n-grams also takes time, and in an international name list, few letter combinations may be impossible. We will not pursue this extended Bloom filter approach further for now. The partitioning by word halves approach presented in Section 4.3 is somewhat related to the Bloom filter algorithm. It is also built to only match names a single error apart from the query name. First, the smallest partition of the name list is retrieved where the names have at least half of their letters in common with the query name. Then, the edit distances for the names in the retrieved partition are calculated. This approach could be quite fast, but will in general slow down as the size of the name list increases and more names share the same prefixes and suffixes. The Bloom filter algorithm, on the other hand, is known to be extremely fast (see Kann et al.), and its speed performance is independent on list size. Since the matches they return should be the same, we choose to only look at the Bloom filter approach and skip the word halves approach. We will not consider weighted edit distance either. Looking up and adding the appropriate weights require extra computation time. Furthermore, by using weights, we risk making the solution language dependent, since the weights most likely vary from language to language. Similarity Key Algorithms Soundex (Soundex), one of the oldest phonetic algorithms, is fast and simple. Although it is designed for English, it might still give acceptable matches for international names. According to Pfeifer et al. [14], for mixed language lists, the Soundex algorithm is better suited than other phonetic algorithms due to its simplicity. Zobel and Dart [23] have shown that Soundex, curiously enough, outperforms the 70 years more recent Phonix. Lait and Randell [9] have found that Soundex is, for English names, both faster and more accurate than Metaphone as well. We choose therefore Soundex as the representative of the phonetic algorithms, and we will index all names in the list by their Soundex codes. We will also test a slightly extended version of Soundex where the initial letter of the query name is also encoded (Soundex-E). Using this approach, names with different initial letters, but similar pronunciations, can be found. This will, however, slow down the matching process, since more entries are involved, and may return more false matches. Regarding the knowledge-based similarity key algorithms, Pollock and Zamora [15] chose to complement their skeleton key with the omission key, since it did not work well on its own with words misspelled at the beginning. The omission key, however, is designed with the English language in mind. Since we have already chosen to try a language dependent algorithm, Soundex, we will not test this slower skeleton– omission key combination.

23

N -gram Analysis According to Salton [17], bigrams and trigrams (n = 2, 3) are best suited for approximate string matching. Pfeifer et al. [14] tested n-gram analysis on their list of 14 000 international names. They found that the best performance is obtained by using bigrams with one blank, with trigrams with two blanks being an acceptable alternative. Blanks are used as described in Section 3.3.2. We will test both bigrams and trigrams, with a varying number of blanks. We will use the n-gram distance similarity measure described in Section 3.3.1, since this is easier to calculate than the similarity coefficient, while containing as much information. We will build an inverted index of the n-grams of all names in the list as described in Section 4.5. As with the trie algorithms, by varying the threshold for the n-gram distance allowed, we will get varying effectiveness and speed. The different n-gram algorithms will be denoted N-gram (n, k), where n is the number of letters in each n-gram, and k is the number of blanks added to the beginning and the end of the names.

5.2.3

General Comparisons

Zobel and Dart [23] have compared Soundex, Phonix, n-gram analysis, and edit distance using a name list of about 32 000 names. They found phonetic coding algorithms to be poor matching mechanisms, even for personal names for which they were designed. Unexpectedly, edit distance was far better at finding good matches than phonetic algorithms. N -gram analysis with indexing and ranking by n-gram distance was the best approach when both speed and effectiveness were considered. Pfeifer et al.’s [14] tests on about 14 000 names of edit distance, the skeleton and omission keys, Soundex, Phonix, and n-gram analysis also showed that n-gram analysis was the best overall approach.

5.3

Algorithms Selected

In summary, these are the algorithms we will evaluate: • ED (Bloom): Reverse edit distance with a Bloom filter • ED (Trie): Edit distance with a trie • ED (Trie-C): Edit distance with a trie, matching only names starting with the same letter as the query name • Soundex: Original Soundex • Soundex-E: Extended Soundex with encoding of the initial letter • N-gram (n, k): N -gram analysis using k ≥ 0 blanks (n, k) = (2, 0), (2, 1), (3, 0), (3, 1), (3, 2) 24

Chapter 6

Implementation The approximate name matching algorithms we have selected for evaluation are described thoroughly in Section 4.6 (Bloom filter), Section 3.1.5 (trie), Section 3.3 and Section 4.5 (n-gram analysis), and Section 3.2.1 (Soundex). Here, we look briefly at some implementation details.

6.1

General

All algorithms are implemented in C++ and compiled using GNU C++ version 4.0.0. In general, for all the approaches, it is assumed that the name list is stored as a text file, with each name on a separate line. There is no duplicate check when new names are inserted.

6.2

Edit Distance (Bloom Filter)

According to Kann et al. [6], when using a Bloom filter to check misspellings, more than 80 percent of the execution time is spent on computing the hash functions. It is therefore important to find fast and efficient hash functions. Kann et al. used a hash function constructed by Jenkins [5]1 , which also has been adopted here. With Jenkins’ hash function, every bit of the key affects every bit of the hash value returned, and we get an even distribution of bits. The hash table size used must be an exponent of 2 to be able to calculate mod operations by using an extremely fast bitmask. The number of hashes and the size of the hash table must be adjusted for the size of the name list, according to Equation 4.1. The maximum word length allowed is set to 50 characters, but can be changed at will. The matches returned for a query name are unranked. 1

The hash function is also described in detail at http://burtleburtle.net/bob/hash/doobs. html.

25

6.3

Edit Distance (Trie)

The same implementation is used for ED (Trie) and ED (Trie-C), except that ED (Trie-C) partitions the name list by first letter and only explores a single branch going from the trie root. Each node in the trie is implemented as an array of pointers to possible child nodes. The array size is l, where l is the number of letters in the alphabet used. The node also keeps track of whether it represents the end of a name and the number of names below it in the trie. Matching names with edit distance below the threshold given by the user are returned together with their respective distances, ranked after edit distance to the query name.

6.4

Soundex

Soundex and Soundex-E share the same implementation, with the only difference that the first letter is also encoded when using Soundex-E. The matches returned for a query name are unranked.

6.5

N -gram Analysis

A general implementation of n-gram analysis was written. The user defines which n and k to use, where n is the number of letters in each n-gram, and k is the number of blanks added to the beginning and the end of the name. The user also sets the error threshold. Only names with a n-gram distance to the query name below this threshold are returned. The matches returned are ranked after distance to the query name.

26

Chapter 7

Evaluation of Selected Algorithms In this chapter, we evaluate the performances of the approximate name matching algorithms selected in Chapter 5 and present the results. We describe the test data used and the way results were measured.

7.1

Test Lists and Queries

Two test lists of surnames were built in order to evaluate the performance of the selected algorithms. The first name list, WORLD, consists of 4 735 distinct surnames. These names were collected from the following sources: • Customer lists of travel agencies from all around the world, accessed through Amadeus’ systems • Amadeus’ employee list, containing mostly Spanish, French, English, and German names • An online onomastikon (dictionary of names), see [11] • Lists of most common surnames from various countries, see [19] The WORLD list was used to evaluate effectiveness. A total of 150 query names were picked from WORLD, with fifteen names from each of ten different world regions and language groups.1 The query names were selected largely at random. For each query name, the WORLD list was manually searched for potentially relevant matches. The potential matches were then examined by two people independently, and only the names they both agreed upon were kept as relevant matches. A name in the list was judged to be relevant if it looked or sounded alike the query name. We envisioned a situation where a name may have been misspelled 1

The names in the ten different categories were (1) East Asian, (2) South Asian, (3) Middle Eastern, (4) African, (5) Mediterranean, (6) East European, (7) North European, (8) English, (9) German, and (10) Spanish, French, or Italian.

27

when making a travel reservation by telephone. Since all the 150 query names were selected from the name list, they all had at least one relevant match, i.e. their exact match. The second name list, LARGE, consists of 195 183 distinct surnames retrieved from a German travel agency chain through Amadeus’ systems. It was used to test the speed of various approaches. Ten different query names were chosen randomly from LARGE for matching using each of the selected algorithms. Each algorithm was run on seven different name lists with sizes ranging from 5 000 to 195 000 names.2 The names in each name list were distinct and picked randomly from LARGE.

7.2

Evaluating Performance

We clarify here how the performance of the different algorithms was measured.

7.2.1

Speed

The execution times of the approximate name matching algorithms were timed using the C++ function clock. The algorithms were run on an Intel Pentium machine, with 2.80 GHz CPU and 1 GB RAM. Three kinds of times were measured for each algorithm. The first measure is the time needed for building any required data structures and storing these to file. This is in most cases a one-time operation which does not need to be repeated as long as the name list does not undergo dramatic, sudden changes. The second time measured is the time required to initialise data structures needed for matching. This includes loading previously stored structures from file. Initialisation is necessary only at start up. The third time measure is the time needed for matching a single name against the name list. This is calculated as an average of the matching times of the ten randomly selected names described in Section 7.1.

7.2.2

Effectiveness

It was decided to use recall–precision graphs to depict the effectiveness of different algorithms relative each other. This kind of graph is, according to Raghavan et al. [16], often used as a combined evaluation measure of the quality of retrieval systems. Each of the 150 queries described in Section 7.1 was matched against the WORLD list, using each of the selected algorithms in turn. The average precision values of each algorithm at the conventional standard recall levels 0, 0.1, 0.2, . . . , and 1 were then plotted for comparison. 2

The different list sizes were 5 000, 10 000, 25 000, 50 000, 100 000, 150 000, and 195 000 names.

28

Recall and Precision for Unranked Queries Calculating precision at different recall levels for a single query is pretty straightforward when the algorithm used returns unranked results. Let NR denote the number of relevant items found so far. For a query with n possible relevant matches, 0 ≤ NR ≤ n. Then, for every recall point R = NR/n we encounter during the query, precision is simply calculated as P =

NR NR + NIR

(7.1)

where NIR is the number of irrelevant items found at that point in the query. Recall and Precision for Ranked Queries When the algorithm used returns ranked results, the above method for calculating precision must be modified. Since the system may generate many possible retrieval orders for a single query, we need a notion of probabilistic precision. Raghavan et al. [16] use the Probability of Relevance (PRR) measure to represent precision. PRR corresponds to the probability that a retrieved item is relevant. Assume that we require NR relevant items. We start the search from the top rank and continue until we reach a final rank k where our request can be satisfied. The PRR value corresponding to the recall point R = NR/n is given by P RR =

NR NR + j + (i · s)/(r + 1)

(7.2)

where NR j tr s i r

number of relevant items required number of irrelevant items found in ranks 1, . . . , k − 1 number of relevant items found in ranks 1, . . . , k − 1 number of relevant items left to be retrieved in rank k s = NR − tr number of irrelevant items in rank k number of relevant items in rank k

We get precision values for different recall levels by varying NR. Averaging Results from Multiple Queries In order to arrive at statistically significant evaluation results, we average precision results at standard recall levels from the 150 queries in the test sample. Since each of these queries might have a different number of relevant items, leading to different recall levels, we cannot average their precision values directly, but need to interpolate them at the selected standard recall levels.

29

According to Raghavan et al. [16], each query is first processed individually. Define r Rr = , 0 ≤ r ≤ n (7.3) n as the recall point where r relevant items have been retrieved of a query’s n possible relevant items in the list. Let R be the set of such recall points which we actually encounter when running the query. For every actual recall point Ra ∈ R, we calculate the corresponding precision value P (Ra ) as explained above. Next, these precision values are updated so that P (Ra ) = max (Rb ≥Ra ,Rb ∈R) P (Rb )

(7.4)

in order to reflect the empirical fact that precision on average decreases with recall. After this stage, we determine the precision value corresponding to each of the standard recall levels for the query. Let Rs denote one of the standard recall levels such that Rr ≤ Rs ≤ Rr+1 , 0 ≤ r < n (7.5) then P (Rs ) = P (Rr+1 ).

(7.6)

Finally, when all 150 queries have been processed this way, we average the precision values of all queries at each standard recall value.3

7.3

Results

Below are the evaluation results for the selected name matching algorithms.

7.3.1

Speed

The most interesting speed measure to look at is the time required to match a single name. According to the results shown in Figure 7.1 and also listed in Table 7.1, the two Soundex variants and ED (Bloom) are the fastest. These three approaches are barely affected by increases in list size. The n-gram analysis variants have acceptable execution times for small lists, but slows down notably for larger name lists. The two bigram variants are the slowest of all evaluated algorithms at list sizes above 100 000 names. ED (Trie) also slows down with increasing list size. It is the slowest algorithm for small lists and among the slower algorithms for large lists. ED (Trie-C), on the other hand, is much faster and among the fastest algorithms overall. The initialisation and building times of the approaches are included in Appendix A. Since initialisation is done only at startup, and building most likely 3 If we have a query for which the number of relevant items finally found, r, is less than the total number of relevant items in the list, n, then we skip that query when averaging precision values at standard recall levels larger than r/n, since the query does not have any precision values at those recall levels.

30

only occurs once for each new name list, these times are not very interesting as long as they are within reasonable limits. All approaches have build and initialisation times below 1.5 minutes for a list with 195 000 names. In general, the n-gram variants are the slowest, and the edit distance approaches are the fastest, with the Soundex approaches in between. Algorithm ED (Bloom) ED (Trie) ED (Trie-C) Soundex Soundex-E N -gram (2, 0) N -gram (2, 1) N -gram (3, 0) N -gram (3, 1) N -gram (3, 2)

N = 5 000 0.002 0.186 0.022 0.000 0.000 0.017 0.025 0.003 0.005 0.011

Time 10 000 0.002 0.288 0.033 0.000 0.000 0.038 0.053 0.005 0.011 0.020

for Matching a Single Name (s) 25 000 50 000 100 000 150 000 0.002 0.002 0.002 0.002 0.494 0.734 1.048 1.284 0.063 0.099 0.149 0.189 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.130 0.338 1.042 1.567 0.152 0.377 0.974 2.777 0.013 0.028 0.064 0.091 0.027 0.061 0.147 0.477 0.059 0.155 0.480 1.430

195 000 0.002 1.461 0.220 0.000 0.000 2.860 2.149 0.166 0.870 1.300

Table 7.1: Matching speed.

7.3.2

Effectiveness

The recall–precision graphs for all selected algorithms are shown in Figure 7.2. According to the definition presented in Section 5.1.2, ED (Bloom) is the best algorithm, since it has the highest precision values at all standard recall levels. At 100 percent recall, it has 90 percent precision. Overall, the edit distance algorithms are the best performers. ED (Trie-C) has slightly lower precision than ED (Bloom) at high recall levels. ED (Trie) performs well at low recall levels, but drops in performance as recall increases. The n-gram algorithms share a similar trend of having high precision at lower recall levels and decreasing precision with increasing recall. The best n-gram variants are N-gram (2, 1) and N-gram (3, 2). The two Soundex algorithms lie on a rather stable precision level of around 65 percent and 45 percent respectively. They have the worst precision values at low recall levels. Soundex performs better than most n-gram variants at higher recall levels. Soundex-E is, however, the worst performer overall.

7.3.3

Performance Averages

The graphs in Figure 7.2 show how precise the matches returned by the approximate name matching algorithms are if a certain recall level is reached. What they do not say is how often the algorithms reach that recall level. In general, with the n-gram and trie algorithms, we can reach all recall levels by setting a sufficiently large error threshold. The higher the threshold, the higher the 31

1.4 1.2 Time (s)

1 0.8

ED (Bloom) ED (Trie) ED (Trie-C) Soundex Soundex-E

0.6 0.4 0.2 0 5

10

25

50

100

150

195

Names (1000) (a) Edit distance and Soundex algorithms. ED (Bloom) and the Soundex algorithms have matching times close to zero (see Table 7.1). Thus their bars are invisible and would be nearly invisible even if the above figure is zoomed in.

3 2.5

Time (s)

2

N -gram N -gram N -gram N -gram N -gram

(2, (2, (3, (3, (3,

5

10

0) 1) 0) 1) 2)

1.5 1 0.5 0 25

50

100

150

Names (1000) (b) N -gram algorithms.

Figure 7.1: Time for matching a single name.

32

195

ED (Bloom) ED (Trie) ED (Trie-C) Soundex Soundex-E N -gram (2, 0) N -gram (2, 1) N -gram (3, 0) N -gram (3, 1) N -gram (3, 2)

1

Precision

0.8 0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

Recall Figure 7.2: Retrieval effectiveness.

recall level achieved. Using ED (Bloom) and the Soundex algorithms, however, we might not always achieve 100 percent recall, since these algorithms only look at a partition of the name list. In order to be able to compare the selected algorithms, the average recall and precision of each algorithm were calculated from the values at the end of each test query. Varying error thresholds were used for the n-gram and trie algorithms to get comparable results. The average time it took for matching a query name was also calculated. The complete results can be found in Table 7.2. The average precision, recall, and matching times of the algorithms are listed together with the error threshold at which these results were achieved. ED (Bloom) is, once again, the best algorithm. It achieves on average around 65 percent recall and 94 percent precision, and it is one of the fastest algorithms. In order to achieve the same average recall level with any other approach, we would have to either sacrifice precision or speed, and in many cases both.

33

Algorithm ED (Bloom) ED (Trie) ED (Trie) ED (Trie) ED (Trie-C) ED (Trie-C) ED (Trie-C) Soundex Soundex-E N -gram (2, 0) N -gram (2, 0) N -gram (2, 0) N -gram (2, 0) N -gram (2, 1) N -gram (2, 1) N -gram (2, 1) N -gram (2, 1) N -gram (2, 1) N -gram (2, 1) N -gram (3, 0) N -gram (3, 0) N -gram (3, 0) N -gram (3, 0) N -gram (3, 0) N -gram (3, 1) N -gram (3, 1) N -gram (3, 1) N -gram (3, 1) N -gram (3, 1) N -gram (3, 1) N -gram (3, 1) N -gram (3, 2) N -gram (3, 2) N -gram (3, 2) N -gram (3, 2) N -gram (3, 2) N -gram (3, 2) N -gram (3, 2) N -gram (3, 2) N -gram (3, 2)

Threshold 1 2 3 1 2 3 1 2 3 4 1 2 3 4 5 6 1 2 3 4 5 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9

Averages for 150 Queries Precision Recall Time (ms) 0.943 0.647 1.0 0.945 0.647 7.2 0.588 0.892 55.0 0.256 0.969 190.1 0.969 0.592 2.5 0.686 0.771 8.9 0.350 0.811 18.6 0.531 0.645 0.1 0.268 0.706 0.1 0.975 0.408 5.0 0.858 0.518 4.9 0.651 0.632 5.0 0.436 0.786 5.1 0.998 0.331 9.2 0.995 0.335 8.1 0.961 0.482 8.0 0.859 0.708 8.1 0.685 0.797 8.1 0.489 0.863 8.0 0.994 0.385 0.5 0.919 0.481 0.5 0.743 0.588 0.5 0.562 0.630 0.5 0.409 0.710 0.5 1.000 0.314 1.0 1.000 0.314 1.0 0.981 0.408 1.0 0.909 0.515 1.0 0.705 0.608 1.0 0.590 0.738 1.0 0.425 0.796 1.1 1.000 0.314 3.9 1.000 0.314 3.9 0.998 0.330 4.0 0.994 0.333 4.0 0.967 0.480 4.0 0.884 0.699 4.0 0.711 0.760 4.2 0.534 0.805 4.1 0.371 0.851 4.2

Table 7.2: Performance averages.

34

Chapter 8

Discussion Judging from the evaluation results in Section 7.3, which approximate name matching algorithm is most fit for fast and effective matching of international names? In this chapter, we discuss these results. We look at how well they fit theory and previous experimental studies, and we compare the algorithms with each other.

8.1

General

In general, the implementations of the different name matching algorithms have not been optimised to the fullest. Care has been taken to write efficient code, but it could most probably be improved further. The absolute times are therefore only an indication of the capability of various algorithms. The relative speed of the algorithms against each other should, however, remain approximately the same with more optimised code, which makes it possible for us to compare them.

8.2

Subjectiveness of Measured Effectiveness

Regarding retrieval effectiveness, we want to find as many relevant matches as possible and as few irrelevant matches as possible. In the evaluations, we have measured precision and recall of the algorithms. A problem with these measures, and any other measure based on relevancy, however, is that they are highly subjective. Whether two names are similar or not depends on the situation. In this study, the situation considered was travel reservation by telephone. Misspellings in this case could be a combination of both phonetic errors and typographic errors. People might mispronounce letters when spelling out their names on the phone, especially if they are not accustomed to the language of the country in which they are making their reservation. Booking agents might mishear parts of the name or single letters if the name is spelled out. They might also make mistakes when typing in the name. If we had considered online booking instead, phonetic misspellings would have been rare and typographic errors more common, since people usually know their own names, but might slip when typing. 35

Moreover, different people might have differing opinions on the similarity of two particular names, depending e.g. on their language knowledge and experience. In this project, a wide range of possible approximate matches for each query name was picked out by one person, and this was then reviewed by her and another team member. It would have been far better if language specialists had reviewed the test lists and queries instead. Another alternative would have been to ask more people to go through the test list independently and select the names they thought were similar to each query name. However, neither of these options was possible for this project. The two people making relevancy judgements were together familiar with names of English, French, Spanish, Portuguese, Italian, German, North European, East Asian, and Arab origin. Thus, their relevancy judgements for the various international names in the query set should still be quite reasonable, with perhaps slightly less accuracy for South Asian, East European, and African names. With relevancy being dependent on situation and people making the judgements, the results from this project may not be generalised. However, they should give an indication of how well the evaluated algorithms perform relative each other when the name list in question contains surnames from most parts of the world and the query name is randomly chosen. Also worth mentioning is that the query names used in the evaluations were all in the name list and therefore had at least one relevant match. Searching for names without relevant matches in the list might affect the general effectiveness results. Such a query would always have 100 percent recall, since there are no matches to be found. Hence, the overall average recall for the algorithms should be higher than stated in Table 7.2, but the relative performance of the algorithms should remain the same. Precision should not be affected by much, since from the matching algorithms’ point of view, they will return the same matches for a name, regardless of it being in the list or not. Furthermore, if a name has an exact match in the name list, then all approximate name matching algorithms evaluated will return it as a relevant match.

8.3

Defining Acceptable Performance

In order to judge which algorithms are more suitable for our purposes, we first need to discuss what we think is acceptable performance. When it comes to speed, it was decided, after running the tests, that anything with average matching time longer than 1 second for a list of 100 000 names on the test machine used was uninteresting. Regarding effectiveness, the acceptability of an algorithm’s performance depends on several factors. A lower precision is tolerable if the returned matches are reviewed manually by the user. If an algorithm is used for issuing automatic alerts for the matches found, we need much higher precision. It matters also whether the algorithm used returns ranked or unranked results. If the matches are ranked after similarity to the query name, then it is likely that 36

relevant matches are returned before more irrelevant ones. In that case, a lower average precision is more acceptable than if the matches returned had been unranked, as long as precision is initially high when running the query. In general, for unranked results, precision lower than 50 percent is intolerable. A user who manually reviews the matches would have a hard time sifting out the interesting ones with half of them being junk. An automatic system causing false alarms half of the time would most likely not be in use for long.

8.4

Intra-Category Algorithm Comparisons

Three categories of approximate name matching algorithms were evaluated. In this section, we compare the different algorithms within each category against each other. The results discussed can be found in Table 7.1 (speed), Figure 7.2 (effectiveness), and Table 7.2 (performance averages).

8.4.1

Edit Distance Algorithms

ED (Bloom) is the most promising option overall with its extremely high matching speed and relatively high effectiveness. According to Figure 7.2, it has over 90 percent precision at all standard recall levels, which is close to the ideal. On average it can find 65 percent of all interesting matches to a query name in the name list. The algorithm has practically the same average precision and recall as ED (Trie) when setting a threshold t = 1 for the trie. This is expected, since according to theory, these two algorithms should return the same matches. The former has slightly lower average precision than the latter (94.3 percent compared to 94.5 percent) due to hash collisions. However, ED (Bloom) is significantly faster than ED (Trie), and the difference in speed performance increases drastically with list size. With a list size of 100 000 names, ED (Trie) can take over 1 second to match a single name in the worst case.1 Even when only allowing a single typing error, the trie approach is around seven times slower than ED (Bloom) on a small list of 5 000 names (see Table 7.2). This makes ED (Trie) too slow for our standards. Its advantage is that it can achieve higher recall than ED (Bloom) by allowing more typing errors. However, this is done at the cost of a sharp decrease in average precision. The most interesting matches are returned first, and the more typing errors we allow, the more false matches we get. As mentioned in Section 5.2.2, tests of Pfeifer et al. [14] show that allowing more than three typing errors does not improve the effectiveness of the traditional edit distance algorithm by much. This study confirms these findings. ED (Trie) returns the same matches as the traditional edit distance approach would. Setting the error threshold t = 3, we get an average recall of 97 percent, i.e. almost all interesting 1

In the worst case scenario, we have to visit the entire trie, since it is not possible to tell if any name has an edit distance to the query name over the error threshold before reaching the end of that name.

37

matches are found. However, this leads to an intolerable average precision of 26 percent. ED (Trie-C) is much faster than ED (Trie), since only one branch from the root is visited, but still slower than ED (Bloom) due to the edit distance matrix calculations. As seen in Figure 7.2, it has the second best effectiveness after ED (Bloom). However, it also drops in average precision with higher error threshold t. Since it has acceptable speed also at larger list sizes and can achieve higher recall on average than ED (Bloom), ED (Trie-C) could be an option to consider if we are willing to tolerate slower matching and can afford to loose precision. Setting t = 2, we get better recall than with ED (Bloom) (77 percent compared to 65 percent), but a much lower precision (69 percent compared to 94 percent).

8.4.2

Soundex Algorithms

Soundex and Soundex-E are the fastest algorithms overall, and it was impossible to distinguish which of them is the fastest. On the other hand, they are also the two worst algorithms overall when it comes to effectiveness (see Figure 7.2). Soundex-E finds slightly more relevant names than Soundex (71 percent and 65 percent), since it can find similar names not starting with the same letter, but looses a great deal in precision. It has an average precision of 27 percent, which is extremely low, especially when the matches returned are not ranked. Soundex has an average precision of 53 percent, which is also below par.

8.4.3

N -gram Analysis

In general, the bigram variants are slower than the trigram variants of n-gram analysis. This is reasonable, since smaller n means that the query name is split up in more n-grams, which in turn implies more index lookups and a larger list of names with at least one n-gram in common with the query name. There is also a trend of the larger the number of blanks k added, the slower the matching. Added blanks increases the number of n-grams to lookup, which decreases speed. However, it also increases effectiveness, as mentioned in Section 3.3.2, since the beginning and the end of the name are emphasised, and more n-grams give better results, especially for shorter names. The worst n-gram variant evaluated is N-gram (2, 0). It has the lowest effectiveness and is the second slowest after N-gram (2, 1). The fastest variant is N-gram (3, 0), which has the second lowest effectiveness. The two approaches with best effectiveness are N-gram (2, 1) and N-gram (3, 2), the approaches using the most blanks. This is in accordance with the findings of Pfeifer et al. [14] from their study of 14 000 names.

38

8.5

Inter-Category Algorithm Comparisons

After comparing the algorithms within each of the three evaluated categories, we now compare the performance of these different categories against each other. The Soundex algorithms are, as mentioned earlier, the fastest algorithms overall (see Table 7.1), which is no surprise considering the encoding of a name into its Soundex key is extremely fast and only one index lookup by that key is needed. N -gram analysis is expectedly slower than Soundex, since it requires as many lookups as the number of n-grams in the query name, which is typically larger than one. It also has to calculate the n-gram distance between each looked-up name and the query name. The edit distance algorithms have very different speed performance. Unsurprisingly, ED (Trie) is among the slowest algorithms due to the trie traversal and the edit distance matrix calculations. ED (Trie-C) saves a great deal of time, since it only visits a single branch from the trie root. ED (Bloom) is almost as fast as Soundex. In Section 5.2.3, two earlier comparative studies concluding that n-gram analysis is the best overall approach were presented. These studies compared n-gram analysis with the traditional edit distance approach and some phonetic algorithms and stated that edit distance is more accurate but much slower, and Soundex is much faster but not as effective. The evaluation results in Section 7.3 are similar to those in the above studies. However, they do not support the conclusion that n-gram analysis is the best approach when both speed and effectiveness are considered, but instead suggest that ED (Bloom) and ED (Trie-C) are the overall best approaches. The results in Figure 7.2 show, in accordance with the other studies, that n-gram analysis is more effective than Soundex. Soundex-E is far worse, while Soundex catches up at high recall levels. Nevertheless, n-gram analysis has the advantage of returning ranked results. This means that the most relevant matches are returned at lower recall levels, where n-gram analysis has much higher precision than the evaluated Soundex algorithms. Moreover, Soundex suffers from language dependency, while both n-gram and edit distance algorithms are completely language independent. In this study, the edit distance algorithms are, similar to the results in the other studies, more effective than n-gram analysis. Moreover, ED (Bloom) is far faster than all evaluated n-gram variants for all list sizes and comparable to the Soundex algorithms in speed. ED (Trie-C) is also much faster than most n-gram variants for list sizes above 50 000 names. While the other studies used the traditional edit distance approach for comparison and found it was too slow, this study shows that it is possible to eat the cake and have it to, and combine the high effectiveness of edit distance algorithms with good speed performance.

39

8.6

Spelling Error Patterns for Names

In Section 2.2, some empirical findings on spelling error patterns for written text were presented. For words, more than 80 percent of all misspellings were found to contain only a single error. Also, the misspelled words were rarely wrong in the first letter. How valid really are these findings when it comes to names, which might be less regular than words? ED (Bloom) can only return matches a single error apart from the query name. The algorithm seems to perform well for names, with 65 percent recall on average. Figure 7.2 shows that it outperforms all other evaluated algorithms on all standard recall levels and has close to ideal precision. If we compare the effectiveness of ED (Trie) and ED (Trie-C), where the latter partitions the name list by first letter, the recall–precision graphs in Figure 7.2 show that ED (Trie-C) is a more accurate approach. This is also true when comparing Soundex, which partitions the list by first letter, and Soundex-E. The former has higher precision than the latter at all standard recall levels and only slightly lower recall on average (65 percent compared to 71 percent). According to these results, it would seem the general spelling error patterns found for words can be applied on international names as well. This might be interesting when designing better approximate name matching approaches in the future.

40

Chapter 9

Conclusion The goal of the project presented in this thesis was to find an approximate name matching solution suited for fast, effective searches of personal names in an international environment. We wanted a solution which is language independent, fast, and capable of finding as many interesting matches and as few irrelevant matches as possible. In other words, we wanted flexibility, speed, high recall, and high precision. However, there is a trade-off between these factors, and which solution to recommend depends on which of them we value the most.

9.1

Recommendations

ED (Bloom) (reverse edit distance with a Bloom filter) and ED (Trie-C) (edit distance with a trie, keeping the first letter constant) are two approaches which are both language independent, quite fast, and have relatively high recall and precision. If we require speed or high precision most of all, the Bloom filter algorithm is the best approach. If high recall is the most important factor and we are willing to sacrifice precision, the trie algorithm would be better.

9.2

Future Work

Both of the proposed algorithms above leave room for improvement. The Bloom filter approach is extremely fast and precise, but misses some good matches. The trie approach is slower and much less precise, but finds more of the interesting matches in the list. It is hard to think of another approach with equally fast matching time which can maintain the same level of retrieval effectiveness. Even if we had more time, edit distance algorithms still return the most relevant results. Improving precision without loosing recall is difficult. We would have to somehow filter out irrelevant matches. This could be done by only presenting the intersection of matches returned by different algorithms. However, it is likely recall 41

would decrease with this approach. Another way is to apply weights to screen out non-credible matches. The problem with this approach is, as mentioned earlier, that weights are language specific. In order to get better results, expert language analysis is needed, and we would no longer have language independency. In the future, it would be interesting to investigate whether recall can be improved by combining the returned matches from a few different approaches. We would want to choose relatively fast algorithms, in order to not decrease matching speed considerably. In order to get different results, these algorithms need to be from different categories. They must also have high precision, since we wish to reach a higher recall level without loosing precision. A possible combination would be ED (Bloom), which is extremely fast, has very high precision, and relatively high recall, together with one of the faster n-gram variants, setting a relatively low error threshold to keep precision at a high level and matching time low.

42

Bibliography [1] F. J. Damerau. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176, March 1964. [2] M.-W. Du and S. C. Chang. An approach to designing very fast approximate string matching algorithms. IEEE Transactions on Knowledge and Data Engineering, 6(4):620–633, August 1994. [3] T. N. Gadd. Phonix: The algorithm. Program, 24(4):363–366, October 1990. [4] P. A. V. Hall and G. R. Dowling. Approximate string matching. ACM Computing Surveys, 12(4):381–402, December 1980. [5] R. J. Jenkins. A hash function for hash table lookup. Dr Dobb’s Journal, 22 (9):107–109, September 1997. [6] V. Kann, R. Domeij, J. Hollman, and M. Tillenius. Implementation aspects and applications of a spelling correction algorithm. In L. Uhlirova, G. Wimmer, G. Altmann, and R. Koehler, editors, Text as a Linguistic Paradigm: Levels, Constituents, Constructs. Festschrift in honour of Ludek Hrebicek, volume 60 of Quantitative Linguistics. WVT, 2001. [7] K. Kukich. A comparison of some novel and traditional lexical distance metrics for spelling correction. In Proceedings of INNC-90-Paris, Paris, France, July 1990, pages 309–313, 1990. [8] K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377–439, December 1992. [9] A. J. Lait and B. Randell. An assessment of name matching algorithms. Department of Computing Science, University of Newcastle upon Tyne, UK, 1993. [10] R. Lowrance and R. A. Wagner. An extension of the string-to-string correction problem. Journal of the ACM, 22(2):177–183, April 1975. [11] K. Monk. Kate Monk’s onomastikon. URL http://www.gaminggeeks.org/Resources/KateMonk/. Accessed 2005-04-08. 43

[12] G. Navarro and R. Baeza-Yates. Matchsimile: A flexible approximate matching tool for searching proper names. Journal of the American Society for Information Science and Technology, 54(1):3–15, January 2003. [13] J. L. Peterson. A note on undetected typing errors. Communications of the ACM, 29(7):633–637, July 1986. [14] U. Pfeifer, T. Poersch, and N. Fuhr. Searching proper names in databases. In R. Kuhlen and M. Rittberger, editors, Hypertext - Information Retrieval Multimedia, Synergieeffekte elektronischer Informationssysteme, pages 259–276. Universitätsverlag Konstanz, Konstanz, Germany, 1995. [15] J. J. Pollock and A. Zamora. Automatic spelling correction in scientific and scholarly text. Communications of the ACM, 27(4):358–368, April 1984. [16] V. V. Raghavan, G. S. Jung, and P. Bollman. A critical investigation of recall and precision as measures of retrieval system performance. ACM Transactions on Information Systems, 7(3):205–229, July 1989. [17] G. Salton. Automatic text transformations. In G. Salton, editor, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, pages 425–470. Addison-Wesley, MA, USA, 1988. [18] R. A. Wagner and M. J. Fischer. The string-to-string correction problem. Journal of the ACM, 21(1):168–173, January 1974. [19] Wikipedia: The Free Encyclopedia. List of most common surnames. URL http://en.wikipedia.org/wiki/List_of_most_common_surnames. Accessed 2005-04-08. [20] Wikipedia: The Free Encyclopedia. Neural network. URL http://en.wikipedia.org/wiki/Neural_net. Accessed 2005-03-11. [21] Wikipedia: The Free Encyclopedia. Personal name. URL http://en.wikipedia.org/wiki/Personal_name. Accessed 2005-03-02. [22] W. E. Winkler. The state of record linkage and current research problems. Technical report, U.S. Bureau of the Census, 1999. [23] J. Zobel and P. Dart. Finding approximate matches in large lexicons. Software - Practice and Experience, 25(3):331–345, March 1995.

44

Appendix A

More Evaluation Results Algorithm ED (Bloom) ED (Trie) ED (Trie-C) Soundex Soundex-E N -gram (2, 0) N -gram (2, 1) N -gram (3, 0) N -gram (3, 1) N -gram (3, 2)

N = 5 000 0.016 0.016 0.016 0.039 0.039 0.297 0.375 0.266 0.328 0.422

10 000 0.016 0.047 0.047 0.078 0.078 0.649 0.852 0.531 0.719 0.930

Time for 25 000 0.032 0.125 0.125 0.195 0.195 2.117 2.719 1.423 2.000 2.625

Initialisation (s) 50 000 100 000 0.062 0.125 0.234 0.438 0.234 0.438 0.402 0.879 0.402 0.879 5.180 14.876 7.070 20.438 3.531 6.610 4.329 10.243 6.047 15.344

150 000 0.235 0.649 0.649 1.391 1.391 28.726 40.555 11.055 17.852 29.164

195 000 0.234 0.836 0.836 1.863 1.863 48.781 70.665 15.789 27.368 45.742

150 000 0.696 2.484 2.484 14.016 18.383 13.586 17.961 21.860

195 000 0.743 2.856 2.856 16.868 21.922 15.515 20.750 25.821

Table A.1: Initialisation speed.

Algorithm ED (Bloom) ED (Trie) ED (Trie-C) Soundex Soundex-E N -gram (2, 0) N -gram (2, 1) N -gram (3, 0) N -gram (3, 1) N -gram (3, 2)

N = 5 000 0.0310 0.074 0.074 0.352 0.453 0.328 0.438 0.547

10 000 0.0470 0.149 0.149 0.766 1.016 0.711 0.946 1.172

Time for Building (s) 25 000 50 000 100 000 0.086 0.188 0.368 0.379 0.789 1.614 0.379 0.789 1.614 2.108 4.282 9.024 2.750 5.758 12.055 1.922 4.079 8.555 2.578 5.437 11.492 3.235 6.789 14.258

Table A.2: Build speed.

45