A Hybrid Cross-Language Name Matching Technique using Novel Modified Levenshtein Distance Doaa Medhat
Ahmed Hassan
Cherif Salama
Computer and Systems Department Faculty of Engineering, Ain Shams University, Cairo, Egypt
[email protected]
Computer and Systems Department Faculty of Engineering, Ain Shams University, Cairo, Egypt
[email protected]
Computer and Systems Department Faculty of Engineering, Ain Shams University, Cairo, Egypt cherif.salama@ eng.asu.edu.eg
linguistic variations are such as changing the order of name parts, compound names, initials, nicknames, missing name component, or simple typing errors [1] [3]. Moreover, each culture, writing system, and language adds its own influences to the ways in which names can be varied.
Abstract—Name matching is a key component in various applications in our life like record linkage and data mining applications. This process suffers from multiple complexities such as matching data from different languages or data written by people from different cultures. In this paper, we present a new modified Cross-Language Levenshtein Distance (CLLD) algorithm that supports matching names across different writing scripts and with many-to-many characters mapping. In addition, we present a hybrid cross-language name matching technique that uses phonetic matching technique mixed with our proposed CLLD algorithm to improve the overall fmeasure and speed up the matching process. Our experiments demonstrate that this method substantially outperforms a number of well-known standard phonetic and approximate string similarity methods in terms of precision, recall, and fmeasure.
Both language and culture play an important role in the development of proper names. Many different structures of proper names are available and used in our world today for Western, Russian, Arabic, and Chinese naming systems. Structure of names may vary between different naming systems. For example, the basic Western naming standard combines one or two given names with a family name as first name, optional middle name, and last name. However, the Arabic naming system is one of the most complex naming systems where names follow a system closer to Western system or follow the traditional form of Arabic name [4], which consists of up to five different elements: Kunya, Ism, Nasab, Laqab, and Nisbah.
Keywords—cross language, multilingual, name matching, record linkage, entity resolution, deduplication
A more problematic source of variation in proper names is the variations caused by transferring proper names from one writing system to another as the same name may be represented by many different valid versions for the same non-Latin names. The process of converting a name from one writing system to another is referred to as transcription or transliteration [5].
I. INTRODUCTION Name matching is an important component in various applications in our life like Customer Relation Management (CRM), Customer Data Integration (CDI), Anti-Money Laundering, Criminal Investigation, Healthcare, and Genealogy services. All these applications involve variety of data sources and are facing the problem of identifying similar records using various fields with a non-unique identifier [1]. Data sources may include customer, vendor and employee databases, contact directories, watch lists, purchased information sources, and freely available directories on the Internet. Although name matching is not the only component in identity or data matching process [2], it is the most challenging part, the main starting point and a problematic issue that need to be smartly addressed.
In order to get an accurate and reliable matching process, efficient handling for linguistic and non-linguistic variations is a very important aspect that represents a big challenge. Different name matching techniques [1] are currently available like phonetic encoding, pattern matching, dictionary-based matching, and other techniques that will be illustrated later in Section II and III. However, up to our knowledge, none of these techniques can be used as a complete system for solving all these challenges.
An important aspect of data matching is to realize if two records of personal details are related to the same individual or not. That is how record name matching comes to the play. Complexity of names originates mainly from the numerous ways to write the same name due to linguistic, formatting and presentation variations. Linguistic variations [2] are such as different writing methods or phonetic sounds. Non-
978-1-4673-9971-5/15/$31.00 ©2015 IEEE
The remaining parts of this paper are organized as follows: Section II presents an overview of name matching techniques. Section III presents related work and mentions the different quality measures. Section IV demonstrates the proposed methodology. Section V presents the results and discussion of applying experiments on the current available
204
Distance to match "Richard" with "" (transliterated as "rtshard"), as a result, the minimum distance between them equals two which represents replacing "t" with "i" and "s" with "c".
techniques against the proposed ones. Finally, Section VI concludes the paper. II. NAME MATCHING OVERVIEW Name matching can be defined as the process of determining whether two name strings are instances of the same name. A couple of main approaches for matching names are phonetic encoding and pattern matching [1] .The phonetic encoding approaches aim to reduce names to a key or code based on their pronunciation where similar sounding names share the same key. On the other hand, pattern matching approaches aims to quantify the differences in the character sequences between names and commonly used in approximate string matching. Another approach that could be used in name matching is the dictionary-based approach, which lists all possible spelling variations, abbreviations, and derivative forms of proper names. Moreover, there are other techniques that will be discussed in the next section.
r i c h a r d
0 1 2 3 4 5 6 7
r 1 0 1 2 3 4 5 6
t 2 1 1 2 3 4 5 6
s 3 2 2 2 3 4 5 6
h 4 3 3 3 2 3 4 5
a 5 4 4 4 3 2 3 4
r 6 5 5 5 4 3 2 3
d 7 6 6 6 5 4 3 2
Fig. 1. Levenshtein Distance Example
Many other techniques are introduced like N-gram methods, Damerau-Levenshtein Distance, Smith-Waterman, Longest common sub-string (LCS), Skip-grams, Jaro, JaroWinkler, etc. Both Levenshtein distance and N-gram methods have similar drawbacks when used for name matching, in that they do not take into account the phonetic nature of the characters and linguistic variations are treated the same way as random differences.
A. Phonetic Matching Techniques They aim to convert a name string into a code according to the way a name is pronounced; Soundex can be considered the grandfather of several modern namematching methods. Most of the designed techniques have been mainly developed based on English phonetic structure. However, several other techniques have been adopted for other languages as well. These techniques category includes Soundex, Daitch-Mokotoff Soundex, Metaphone, Double Metaphone, Metaphone 3, Phonex, Phonix, NYSIIS, Fuzzy Soundex, etc. [1] [6]. Phonetic matching techniques are mainly considered fast and produce high recall by finding most of the correct matches but have generally low precision because it produces many false hits. Precision becomes lower when matching non-Latin script names that must be first transliterated to Latin characters before using these techniques to match them which adds more difficulty because of the generated name variations.
A comprehensive comparison between the different phonetic and pattern matching algorithms has been done [1] [6] [7] [8] to evaluate these methods which shows that no technique performs better than all others in all cases. C. Dictionary-Based Matching Techniques They use dictionaries of name variations to identify relationships between names that should be matched by listing all possible spelling variations of each name. The dictionary may contain many different classes of name variants; including transcription variants (such as Yeltshin, Jelzin, and Eltsine), homophones (such as Thompson and Thomson), nicknames (such as Will, Bill, and Billy for William), and common abbreviations (such as Ltd and Limited). Having all these classes would provide a perfect solution for matching names; however, it is not easy to find or create such dictionary. The performance of dictionarybased methods can be slow because very large lists must be maintained. Furthermore, name variations not appearing in its lists will not be matched.
B. Pattern Matching Techniques They calculate character-by-character distance between two names which are commonly used in approximate string matching, a normalized similarity measure between 1.0 (strings are the same) and 0.0 (strings are completely different) is usually calculated. For each technique, a different approach is used to calculate such a similarity measure.
III. RELATED WORK Multiple researches have been presented for dealing with the name matching problems including hybrid techniques that use two or more methods from the previously described techniques, which allow the strengths of one to compensate the weaknesses of the other. Some techniques presented that combine phonetic encoding and pattern matching methods, which aim to improve the matching quality as Editex [9] that improves the phonetic matching accuracy by combining edit distance based methods with the letter-grouping techniques of Soundex and Phonix in Latin script only.
The Levenshtein distance is defined to be the smallest number of edit operations (insertions, deletions and substitutions) required to change one string into another, it can be calculated using dynamic programming. The greater the Levenshtein distance, the more different the strings are. By default, the Levenshtein distance can be calculated between words from the same writing script, so in order to compare two names from different scripts, one of them should be initially transliterated to the other script. For example, the name "Richard" can have many forms when written in Arabic as "", "" or "". The example in Fig. 1 demonstrates the usage of Levenshtein
By evaluating the standard string comparison measures to match Arabic language person names with their
205
corresponding English representations, it shows poor performance due to variety of transliteration conventions in both languages and the fact that Arabic script does not usually represent short vowels. A personal name matching methodology is proposed in [10], which matches Arabic language person names with their corresponding English representations by augmenting the classical Levenshtein edit-distance algorithm with character equivalency classes which uses sets of rules that treat characters as identical in the computation of edit-distance. However, this method only considers a one-to-many mapping between single characters from Arabic to English.
IV. METHODOLOGY The proposed solution represents a generalized framework that can be used for matching names between different languages along with a new cross-language technique (CLLD) that outperforms the quality of matching names across different languages. As shown in Fig. 2, the proposed name matching process consists of five main phases: Language Identification, Normalization, Parsing, Pair-wise comparison, and Classification.
Another framework for name matching between the Arabic language and other languages is proposed in [11] and extended in [12], which uses a phonetic dictionary-based approach on a new proposed version of the Soundex called "Arabic Combined Soundex" algorithm to handle compound Arabic names (such as Abdel Aziz, Essam El Din). The dictionary is consulted to find the transliteration of an Arabic name and results are verified by consulting the dictionary in a reverse way. This framework can be considered as a fast cross-language matching method with high recall but as the other phonetic techniques, it suffers from a lower precision.
Language Identification
Cleaning and Normalization General Normalization
Language Specific Normalization
Grammar Rules
Parsing
However we are focusing on matching single names, on the other side, researches addressed the matching of full person name within the same language by finding the best transformation path for a name [13] using a graph-based algorithm and Support Vector Machine (SVM). Another full person name matching approach is proposed in [14] which uses an extended Smith-Waterman alignment algorithm and logistic regression.
Norm. Rules
Pair-wise Comparison Phonetic Similarity
Pattern Similarity
NYSIIS Metaphone Soundex
CLLD Editex Jaro Winkler Levenshtein
Characters Map
For evaluating the quality of the matching technique, the following measures are usually used [11] : Precision (positive predictive value) is the fraction of retrieved instances that are relevant.
P = TP/ (TP + FP)
Recall (sensitivity) is the fraction of relevant instances that are retrieved. R = TP/ (TP + FN)
F-Measure is a measure that combines precision and recall by computing their harmonic mean.
F-Measure = 2 * (P * R)/ (P + R)
Classification
Fig. 2. Solution Architecture
A. Language Identification The language identification process is used to determine the writing script of the presented names using the Unicode values of the characters in a name. B. Normalization The text normalization process is an essential step for cleaning and standardizing the input data. It consists of general and language specific normalization rules
Where, TP (true positive): the number of items correctly labeled as belonging to the positive class. FP (false positive): the number of items incorrectly labeled as belonging to the positive class.
1) General Normalization: Large number of Natural Language Processing (NLP) tasks requires the text be free of punctuation. Therefore, simple rules have been used to remove punctuation like comma (,), semi-colon (;), colon (:), exclamation mark (!), question mark (?), etc. In more details, the following normalization steps are executed: (1) Case normalization by
FN (false negative): the number of items which were not labeled as belonging to the positive class but should have been.
206
rules is used to identify these fields according to its position. The parser is also able to identify the compound names that include prefixes and/or suffixes like (Abdel Rahman, Abd Allah, Salah Eldeen). After parsing the compound names, the spaces between them will be removed in order to represent a single word.
converting all letters into lowercase letters. (2) Punctuation suppression by removing all punctuation with the ability to exclude some marks that can be used later in the parsing phase like left and right brackets. (3) Link stripping by replacing links between words such as dot (.), underscore (_), and dash (-) characters with blanks. (4) Blank normalization by replacing any whitespace (blank characters, tabs and carriage returns) with a single blank character.
D. Pair-wise comparison: The proposed pair-wise name comparison here is a hybrid implementation, which makes a first-pass using a phonetic encoding method to take the advantage of its high recall by filtering the highest scoring matches then perform a slower but a high precision second-pass using the proposed CLLD algorithm to compare names directly in their original script.
2) Language-Specific Normalization: Another issue that should be addressed while normalizing the text is the variations of specific languagebased letters. a) Arabic Characters Normalization Different forms of some Arabic characters might be written interchangeably like ( , , , and ), ( and ), ( and ). These misspelling errors might affect some words, so these characters need to be normalized to a single letter which includes normalization of all hamzated Alef to plain Alef, Taa marbutah to Haa' , and Alef maqsurah to Yaa'. In addition to these rules, some extra characters should be removed like the Tatweel character and Arabic diacritics like damma( ), fatha( ), kasra( ), or sekon().
The standard implementation of Levenshtein edit distance metric is designed to use one-to-one characters comparison without considering the phonetic structure of them. The proposed CLLD is able to calculate the distance between two names from different writing scripts, consider phonetic mapping between characters, and handle the issue of many-to-many characters mapping between them. The proposed method depends on having a predefined list of possible mappings between characters from source to target language including characters with no mapping and other many-to-many substrings. For example, TABLE I. demonstrates part of the map (Arabic as a source language and English as a target language). Note that "NULL" represents characters that may not have a representation in the other language.
b) Latin Characters Normalization Different forms of some Latin characters might also be written interchangeably like (a, à, and â), (e, é, è, ê, and ë), (i, ï, and î), (o, ó, ö, ô, and ), (u, ú, ü, and ). So, these characters will be normalized to their unaccented form. In addition, when dealing with Arabic informal text written in Latin script (Arabic chat alphabet), some numbers can be used to represent Arabic characters that has no phonetic equivalent in English like representing '' with '7' or '' with '3'. So, these numbers will be normalized to an equivalent character in English like normalizing '7' to 'h' and '3' to 'a'.
TABLE I. Source Characters # NULL
C. Parsing After normalizing the input data, in case this data is not represented by a single word or it has specific format that should be considered while processing it, a language-based parser is presented to handle this issue. For example, person full name might be presented in many forms like ( ), ( ), (, ), optional like "Doctor" might be presented before name, optional like "II" might be presented after last name, and so on.
CHARACTERS MAP Target Characters b, p f, ph, v sh, ch ch, sch X NULL NULL a, e, i, o, u a, e, i, o, u
For the gathering of characters mapping, an automated process has been implemented using GIZA++ [5] to collect one-to-many character mapping from Arabic-to-English and one-to-many character mapping from English-to-Arabic then merged them together to collect the many-to-many characters mappings from a parallel corpus of names. This method can be used to bootstrap the creation of characters mapping then additional mappings can be added or filtered.
According to each language, parser and lexer rules are defined by the user in the grammar rules repository, then an automatic Parser generator (like ANTLR[15]) is used to generate the parser and lexer code. For person names, the parser will be able to parse the different name constituents according to its structure. For example, Titles (Doctor, Mr., Mrs., Ms), Honorifics (Prince, King, Queen), Generational (Jr., Sr.), Order (II, Second, Two), Location (of Egypt), Description (the Greater, the German), and other name parts
The following modifications will be applied on the original Levenshtein distance algorithm in order to calculate the new distance using the proposed CLLD algorithm:
207
x
Initialization: load the characters mapping
x
At each cell, the value of it equals the minimum value of the following:
o
Cost of previous source cell in addition to the insertion cost. Insertion cost equals one. However, if the source character can be aligned with null, the insertion cost equals zero.
o
Cost of previous target cell in addition to the deletion cost. Deletion cost equals one. However, if the target character can be aligned with null, the deletion cost equals zero.
o
pages and extracting the following data: ID, Arabic name, English name, Biography, Place of Birth, Data of Birth, Photos, Videos, Titles, etc. The collected dataset contains more than 80,000 records and is available on http://goo.gl/a7rhNQ. In order to collect a gold standard data for evaluating the experiments, some of the collected records have been manually filtered, as they do not contain correct data or referring to entities not representing person names. Then, the actors' records with Arabic nationality have been gathered. After that, GIZA++ has been used to align each Arabic and English full name within the same record to extract similar single names (with confidence around 90% of correctly aligned names as some data are noisy and not written correctly which affects the overall quality of matching). Finally, a distinct set of Arabic-English single names has been used as the gold standard (the set contains 6,397 distinct names mapping).
For each mapping that ends with the current characters, the cost will equal the cost of the cell before the longest match. Otherwise the cost equals the cost of the diagonal cell in addition to substitution cost equals one.
The example shown in Fig. 3 demonstrates the new CLLD algorithm for the same example discussed in Section II., as a result, the final distance between the two words equals to zero because the insertion cost of "i" equals to zero and the substring "" matched with "ch".
r i c h a r d
0 1 2 3 4 5 6 7
1 0 0 1 2 2 3 4
$ 2 1 1 1 2 2 3 4
3 2 2 2 0 0 1 2
4 3 2 3 1 0 1 2
5 4 3 3 2 1 0 1
A. Experiment 1: Phonetic and Pattern Based Matching In the first experiment, we used an open-source, Java toolkit [16] of name-matching methods to experimentally compare the different string distance metrics on the collected dataset with the Arabic Soundex, Editex and the proposed CLLD algorithm. For algorithms other than Arabic Soundex and CLLD, the Arabic name is first transliterated using predefined rules before comparing it with the other name.
6 5 4 4 3 2 1 0
The results are demonstrated in TABLE II. For each metric in that table, the colors represent the following: Green for best results, Light Green for very good results, Yellow for neutral results, and Red for bad results.
Fig. 3. CLLD Example
To improve the overall f-measure and speed up the matching process, a language-specific Dissimilarity Rules is used to exclude clearly unmatched names like Arabic names ending with ' ' with English names ending with 'yah' or 'ia' (for example, to exclude matching '%&*' with 'Samia').
TABLE II. Matching Algorithm
For the other variations that can occur in the compared names like one name is the initial of the other one, matching two initials, or variations/nickname of a name will be added to a dictionary to handle these cases. E. Classification A threshold-based classifier has been implemented to determine whether two names represent a match or not based on the calculated distance in the previous stage. The calculated distance can be converted into a similarity measure (between 0.0 and 1.0) using the below equation where greater similarity score refers to higher match.
Sim(s1, s2) = 1.0 - distance(s1, s2)/ max(s1, s2)
V. EXPERIMENTS Up to our knowledge, there is no standard public dataset used for evaluating the name-matching task. Therefore, we collected a dataset from the available actors' names in elcinema.com website (an online Arabic movie database) which provides information about Movies, Actors, News, Videos, etc. The data is collected by parsing the HTML
Soundex Arabic Soundex Metaphone Double Metaphone Daitch Mokotoff Nysiis Jaro Winkler Leveneshtein Editex CLLD (0.95) CLLD (0.90) CLLD (0.85)
EXPERIMENT 1 RESULTS Results
5443 68937 954 6011 75913 386 4184 42746 2213
0.073 0.073 0.089
0.851 0.94 0.654
FMeasure 0.135 0.136 0.157
5031 56898 1366
0.081
0.786
0.147
3940 42030 2457
0.086
0.616
0.15
804 5347 1403 3732 263 24 336 57 5494 3892 5522 4014 5813 27577
0.131 0.273 0.916 0.855 0.585 0.579 0.174
0.126 0.219 0.041 0.053 0.859 0.863 0.909
0.128 0.243 0.079 0.099 0.696 0.693 0.292
TP
FP
FN
5593 4994 6134 6061 903 875 584
Precision Recall
According to the experimental results, the proposed CLLD algorithm outperforms the other algorithms in terms of precision with high recall, however, the Arabic Soundex showed a better recall. The overall F-Measure shows that the proposed algorithm outperforms all other techniques.
208
By manually reviewing some of the results, we noticed that according to the structure of the Arabic names, some names could be matched with other names that are not presented in our gold standard names, which affects the number of false positive names.
[3]
[4]
According to the results, the selected threshold for classifying the CLLD similarity score affects the precision and recall, accordingly, the best-selected threshold is 0.95. [5]
B. Experiment 2: Proposed Hybrid MatchingTechnique According to the results of the first experiment, CLLD and Arabic Soundex have been selected to create our hybrid matching solution to evaluate the effect of applying the CLLD algorithm on names filtered by a phonetic algorithm.
[6]
[7]
[8]
[9]
Fig. 4. Experiment 2 Results
[10]
From the results shown in Fig. 4, the overall F-Measure of the hybrid algorithm with using the dissimilarity rules has outperformed all other techniques with also high recall and precision. In addition, the time consumed becomes better than using the CLLD alone. VI. CONCLUSION
[11]
In this paper, we have proposed a novel Cross-Language matching algorithm (CLLD) that is able to match names in different scripts with high precision that outperforms wellknown standard phonetic and approximate string similarity methods. In addition, we proposed the use of phonetic name matching and dissimilarity rules to speed-up the matching process and improving the results. As a future work, we will evaluate the system with other languages.
[12]
[13]
[14]
REFERENCES [1]
[2]
P. Christen, "A comparison of personal name matching: Techniques and practical issues," in Data Mining Workshops, 2006. ICDM Workshops 2006. Sixth IEEE International Conference on, 2006, pp. 290-294. B. Lisbach and V. Meyer, Linguistic identity matching: Springer Science & Business Media, 2013.
[15] [16]
209
A. Lait and B. Randell, "An assessment of name matching algorithms," Technical Report SeriesUniversity of Newcastle Upon Tyne Computing Science, 1996. K. Shaalan and H. Raza, "Person name entity recognition for Arabic," in Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, 2007, pp. 17-24. M. M. Kashani, et al., "Automatic transliteration of proper nouns from Arabic to English," in Proceedings of the Second Workshop on Computational Approaches to Arabic Script-based Languages, 2007. C. Snae, "A comparison and analysis of name matching algorithms," International Journal of Applied Science. Engineering and Technology, vol. 4, pp. 252-257, 2007. T. Peng, et al., "A comparison of techniques for name matching," International Journal on Computing (In press), vol. 2, 2012. R. Shah and D. Kumar Singh, "Analysis and Comparative Study on Phonetic Matching Techniques," International Journal of Computer Applications, vol. 87, pp. 14-17, 2014. J. Zobel and P. Dart, "Phonetic string matching: Lessons from information retrieval," in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, 1996, pp. 166-172. A. T. Freeman, et al., "Cross linguistic name matching in English and Arabic: a one to many mapping extension of the Levenshtein edit distance algorithm," in Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 2006, pp. 471-478. A. H. Yousef, "Cross-Language Personal Name Mapping," International Journal of Computational Linguistics Research, vol. 4, 2014. A. H. Yousef, "Cross Language Duplicate Record Detection in Big Data," in Big Data in Complex Systems. vol. 9, ed, 2015, pp. 147-171. J. Gong, et al., "Matching person names through name transformation," in Proceedings of the 18th ACM conference on Information and knowledge management, 2009, pp. 1875-1878. P. Treeratpituk and C. L. Giles, "Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching," in AAAI, 2012. T. Parr, The Definitive ANTLR 4 Reference: Pragmatic Bookshelf, 2013. W. Cohen, et al., "A comparison of string metrics for matching names and records," in Kdd workshop on data cleaning and object consolidation, 2003.