Cross-language Phonetic Similarity Measure on ... - Semantic Scholar

3 downloads 0 Views 905KB Size Report
International Journal of Intelligent Information Processing Volume 2, Number 2, ... Romanization, the International Phonetic Alphabet, the Soundex algorithm, ...
Cross-language Phonetic Similarity Measure on Terms Appeared in Asian Languages Ohnmar Htun, Shigeaki Kodama, Yoshiki Mikami International Journal of Intelligent Information Processing Volume 2, Number 2, June 2011

Cross-language Phonetic Similarity Measure on Terms Appeared in Asian Languages Ohnmar Htun, Shigeaki Kodama, Yoshiki Mikami Dept. of Management & Information Systems Science Nagaoka University of Technology, Japan [email protected], [email protected], [email protected] doi : 10.4256/ijiip.vol2.issue2.2

Abstract This study aims to develop a phonetic similarity measurement method across Asian languages. The method, cross-language similarity algorithm aggregates the transcription of language-specific Romanization, the International Phonetic Alphabet, the Soundex algorithm, and Levenshtein distance. To evaluate the proposed algorithm, this study involves an experiment using ninety-two chemical element names in nine different languages. The scores of the similarity of names were calculated between a source language and each target language. We could draw a line of threshold between the scores of similarities in each language into two groups (phonetic and semantic adoption groups). After evaluating the ratios of precision, recall, and F-measure, the results show that the proposed methodology successfully differentiates between phonetic and semantic groups by allocating the thresholds in all Asian languages, with the exception of Chinese. The results reported here prove that the proposed method has the potential to be applied to cross-language information retrieval and various linguistic studies.

Keywords: Phonetic similarity, Normalization, Cross-language information retrieval, Romanization, International phonetic alphabet, Soundex, Levenshtein distance

1. Introduction In recent years, large numbers of science and technical documents have emerged in many languages on the Internet. Multi-language supported search engines play a critical role on the World Wide Web. However, the information retrieval techniques of many search engines (except Google) still lack support for cross-language information retrieval (CLIR) [1]. In general, CLIR requires not only bilingual and/or multilingual dictionaries, corpora, and thesauri but also the ability to identify many proper nouns and terms from a variety of sources. Typically, most of the terms in many domains appear as loanwords or borrowed words from other languages. In general, a foreign word can be adopted into another language by creating a new word in a purely local vocabulary (i.e., semantic adoption) or by representing a pronunciation that is close to that of the original language (i.e., phonetic adoption) [2]. For example, “oxygen” in English translates into Japanese as “sanso” ( 酸素 ) using semantic adoption, and “silicon” in English translates into Japanese as “shirikon‟ (シリコン) using the phonetic adoption. Many Asian languages represent loanwords based on their phonograms, particularly with science terms, words relating to technology, and proper nouns. Although the literature includes many techniques for calculating the similarity between words, much of this research is limited to European languages. The lack of studies into other languages (such as Asian languages) has restricted the performances of many CLIR applications and search applications across languages. The availability of a reliable phonetic similarity measurement methodology can be helped to perform CLIR and various linguistic researches more efficiently. In this article, we present a new cross-language algorithm that can effectively measure similarity of words phonetically. We evaluated the algorithm with eight Asian languages using English as the source language. This algorithm can also measure the similarity between words belonging to the same language [3]. Based on the International Phonetic Alphabet (IPA) [4] transcription table, some trivial distinctions are grouped into a symbol (e.g. ʃ, (ʂ,ɕ)Z) to simplify the sound in different languages. The similarities between the language pairs were measured firstly by Levenshtein distance algorithm.

-9-

Cross-language Phonetic Similarity Measure on Terms Appeared in Asian Languages Ohnmar Htun, Shigeaki Kodama, Yoshiki Mikami International Journal of Intelligent Information Processing Volume 2, Number 2, June 2011

As the second process, the calculated Levenshtein distance values were normalized. We compared these two results to evaluate our methodology. This paper is organized as follows. Section 2 reviews related works in the field. Section 3 describes our methodology in detail. Section 4 presents the experimental results. Section 5 evaluates our findings. Finally, Section 6 presents our conclusions and future work.

2. Related Work Many algorithms have been developed to measure the similarities between words both written and spoken; however, most of those researches are based mainly on Soundex for phonetic matching and the Levenshtein distance also called edit distance for string matching techniques. Soundex, the best-known algorithm was developed by Russell and O’Dell (1918) as an early effort to assign a common phonetic code to similar sounding words in the Latin alphabets [5] [6]. This algorithm converts each name to a four-character code that is based on the six places of articulation (plosive, fricative, affricate, glide, liquid, and nasal). It retains the first letter of the name, and drops all the other vowels (i.e. a, e, i, o, u, w, h, y) in the word. If an output of Soundex code is less than four characters, it adds zeros to complete the length. If an output is longer than four, it discards in coding. Soundex phonetic codes are given in Table 1. Table 1. Soundex phonetic codes Letters a, e, i, o, u, w, h, y b, f, p, v c, g, j, k, q, s, x, z d, t l m, n r

Assigned Code 0 1 2 3 4 5 6

Soundex algorithm is not multilingual. It is language dependent especially based on English pronunciation. The assigning codes and phonetic categories of other languages cannot be directly fitted into original Soundex’s categories. Therefore, Soundex is adopted by other languages according to their specific characteristics. On the other hand, Levenshtein distance is primarily an algorithm used to investigate a channel model considering the problem of constructing optimal codes capable of correcting deletions, insertions, and reversals [7]. The distance calculates the least number of edit operations that are necessary to modify one string to obtain another string. The cost is normally set to one unit for each of the operations. However, the Levenshtein algorithm does only edit operations between two strings and it does not directly provide knowledge base to identify phonetic similarity among the languages that appeared in different phonemes. But many researches have been conducted by assigning different cost for operations to integrate the knowledge base concept to Levenshtein algorithm [8]. As per literature, research studies into similarity measurements can be categorized into two main classes, namely, multilingual studies and, bilingual or monolingual studies. Although we find many researches under both categories, for non-European languages studies have been restricted mainly into the latter ones. Here, we reviewed such significant researches conducted on non-European languages based on Soundex and Levenshtein distance. Researchers looking at measuring phonetic similarity have presented an algorithm for Thai-English cross-language transliterated-word retrieval based on the Soundex algorithm and Levenshtein distance [9]. It supports only Thai-English (bilingual) word retrieval and their results scored 80% accuracy on recall and precision measurements. However, they could efficiently evaluate the words with more than 4 characters. A similar study had been done in Myanmar language to match the personal names [10]. It has offered a sound-group mapping algorithm based on seven manners of articulation and measured the phonetic similarity within the Myanmar language with 95.88% F-measure.

- 10 -

Cross-language Phonetic Similarity Measure on Terms Appeared in Asian Languages Ohnmar Htun, Shigeaki Kodama, Yoshiki Mikami International Journal of Intelligent Information Processing Volume 2, Number 2, June 2011

Going beyond the concepts of Soundex and Levenshtein distance, Rahul Bhagat et al. (2007) have proposed a new method by combining noisy channel (pronunciation generation) and the Soundex method to generate numerous candidate variant spellings of a name [11]. They used a list of names containing about 89,000 last names and 5,500 first names from the US census as their test bed. Even though the results showed improvement with a respectable precision rate of 68%, their algorithms could not work accurately with the names derived from different languages. Similar attempt has been done by Justin Zobel et al. (1996) who developed a new algorithm called Editex which combines the properties of edit distance with Soundex and Phonix [12]. Even though their algorithm accurately calculated the measurements, it worked only on monolingual data. As such we could find many Phonetic algorithms which integrated with many other techniques, such as Fuzzy Soundex and Fusion [13] but limited to mono- or bilingual studies. A quite different study was conducted by Freeman et al. (2006) who experimented on crosslanguage name matching between English and Arabic [14]. They used the Basis Artrans transliteration tool to transform Arabic letters into English and created an equivalent sound class (Character Equivalence Class) and developed two new algorithms called baseline and enhancements which are based on SecondString and Levenshtein distance. Enhancements integrate on the character equivalence classes and normalization of character strings. Results confirmed that enhancement method is more effective than the baseline method. Therefore, most of the researches conducted on non-European languages have been limited to monolingual and bilingual studies; this is mainly because the Soundex algorithm is monolingual and Levenshtein distance does not directly use the knowledge base for measurements. This paper presents a multilingual algorithm similar to the enhancements approach of Freeman [14] but using the International Phonetic Alphabet and phonetic matching techniques to measure the phonetic similarity across the many Asian languages using the language specific sound classes.

3. Proposed Method Cross-language similarity (CLS) is an algorithm we have developed based on the study of phonetic matching and string measuring techniques. Our algorithm consists of five steps: Romanization, vowel deletion, simplification of similar sounds, calculation of Levenshtein distance, and normalization. Figure 1 depicts the steps of the methodology.

Figure 1. Cross-language Similarity (CLS) Process

- 11 -

Cross-language Phonetic Similarity Measure on Terms Appeared in Asian Languages Ohnmar Htun, Shigeaki Kodama, Yoshiki Mikami International Journal of Intelligent Information Processing Volume 2, Number 2, June 2011

3.1. Romanization Measuring phonetic similarity requires the character strings to have the same character set. Therefore, in the first step, the terms of non-Latin scripts are converted into Romanized alphabets during the pre-processing stage. Even though various Romanization rules exist for each non-Latin script language, we select a standard Romanization rule in each non-Latin language. We apply "99SHIKI" Japanese Romanization system for Japanese [15], Revised Romanization of Korean for Korean [16], Pinyin Romanization of Mandarin Chinese (Pinyin) 1 for Chinese, the Myanmar Language Commission Transcription System (MLCTS) [17] for Myanmar, and the Royal Thai General System of Transcription [18] for Thai. Table 2 provides some examples of characters in the Romanization assignment of each target language. Table 2. Example of Romanization characters in Japanese, Korean, Myanmar and Thai Language Characters カ ka ㄱ g 合 ge က k

Japanese Korean Chinese Myanmar Thai

キ Ki ㅋ K 土 Tu ဂ G

ク ku ㄴ n 竹 zhu င ng

ケ ke ㄷ d 小 xiao ျ -y-

コ ko ㅌ t 用 young ူ ù











k

S

d

t

b

3.2. Vowel Deletion In this step, all vowels (i.e., a, e, i, o, u, y) from both the source and the target languages are eliminated. If a word (initial letter) begins with a vowel, it is not deleted (e.g. iodine  idn). This process complies with Soundex purposes.

3.3. Simplifying Similar Sounds Although the various language scripts are written in Latin scripts, the spelling does not always correspond directly to the pronunciation. Therefore, we need to transcribe spellings to phonetic notations [19]. Simplifying similar sounds can be divided into two sub-steps: language-specific simplification and baseline simplification (Figure 2). We developed a sound-mapping class that corresponds to the native phoneme of a target language. Table 2 shows some examples of the language-dependent sound mapping class in Myanmar [17], Thai [18], and Vietnamese [20]. Table 3. Part of language dependent sound mapping class Language Sound change Myanmar /Jh/="z", /kh/="k", /ch/= "s" Thai /ch/= "k", /dg/,="J", /qu/= "k" Vietnamese /x/= "s", /ng/= "n", /q/= "k" The next step, baseline simplification is language independent and based on the IPA phonetic mapping table [4]. In principle, IPAs can represent phonetic transcription of speech sounds for all languages, but our research does not require such distinctions. Thus, we grouped some different IPAs into one symbol. Depending on articulation place and manner of sound, those symbols are grouped and 1

Romanization of Mandarin Chinese (Pinyin): http://www.mandarintools.com/

- 12 -

Cross-language Phonetic Similarity Measure on Terms Appeared in Asian Languages Ohnmar Htun, Shigeaki Kodama, Yoshiki Mikami International Journal of Intelligent Information Processing Volume 2, Number 2, June 2011

each group has been assigned into a code. Table 4 specifies the relation of phonetic features, phonetic symbol, baseline IPAs, and the assigned codes used in our algorithm. Table 4. Baseline phonetic alphabet group mapping Articulatory place Labial Dental

Velar

Articulatory manner Positive Fricative Positive

Symbol

IPA

p, b f, P,v T, d

Fricative

T, D, s, z, S, Z C J, j

P, b f, (ɸ), v,(β) t, (ʈ), d, (ɖ) Θ, ɚ, s, z ʃ, (ʂ,ɕ) ʒ (ʐ,ʑ) ʧ, ʦ(c), ʤ, ʣ (j)

Plosive

K,c,q,x g X h H Q Symbol

K, (q) g, (ʛ) ҁ, x, (X) h (ɦ, ћ, ʕ) ɣ (ʝ, ɕ) IPA

Fricative

Other Sounds r-sounds

r R l L m n G N y w Y W $

l-sound Nasals

Approximants

Initial zero consonant

r-sound Used when another r-sound in language L Other laterals M n, (ɳ) ɧ Used when there is other nasals j w (ɥ) (ʍ) Ø(ʔ)

Assigned Code 1 2 3 4

5 6

7 8 9

A

B

For example, after vowel deletion, a term such as “irn” (iron) in English is converted from symbol “$rn” to assigned code “B79”, Figure 2 dispicts the steps of symplifying sound process,

Figure 2. Steps of Simplifying Sound Process

3.4. Levenshtein distance LD is calculated using the Levenshtein method [7], which is based on binary operation involving deletions, substitutions, and insertions. LD is defined as:

LD  D  I  S

- 13 -

(1)

Cross-language Phonetic Similarity Measure on Terms Appeared in Asian Languages Ohnmar Htun, Shigeaki Kodama, Yoshiki Mikami International Journal of Intelligent Information Processing Volume 2, Number 2, June 2011

where D = the number of deletions I = the number of insertions S = the number of substitutions  = variable weight (i.e. referring to the group of sounds in a place of articulation and manner) In this calculation, each operation counts as 1, however, we set up variable weights (  ) in substitution. In this process, if the relation of sound symbols belongs to the same place of articulation and manner, we set  as 0.5. Even if the relation of sound symbols exists in the different place of articulation and manner, we set  as 1. All these weight values are taken into account in the calculation of the Levenshtein distance. Results in LD, a score of zero represents a perfect similarity between two words.

3.5. Normalization Finally, LDs are normalized in the process of calculation. The normalized similarity [11] [12] [19], which is denoted here by NS is defined as:

NS  1 

LD L1  L2

(2)

where L1 and L2 are the length of the converted strings due to simplification of the sound process. In fact, the normalization intends to eliminate the length of the string effect. LD is divided by the length of both strings to minimize the weight of a mismatched character in longer strings [12] [19] [21]. A score of 1 for NS (0