Alignment of Bilingual Named Entities in Parallel Corpora ... - CiteSeerX

7 downloads 1875 Views 504KB Size Report
and ORG named entities are difficult to handle with a fixed set of rules, since new entity ..... sent the pronunciation of each Chinese character and then find the mapping rules ...... “http://www.info.gov.hk/digital21/eng/structure/jyutping.html” and ...
Chun-Jen Lee et al.

September, 2005

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Models and Multiple Knowledge Sources Chun-Jen LEE1, 2 Jason S. CHANG2 Jyh-Shing R. JANG2 1 2 Telecommunication Labs. Department of Computer Science Chunghwa Telecom Co., Ltd. National Tsing Hua University Chungli, Taiwan Hsinchu, Taiwan [email protected] {jschang, jang}@cs.nthu.edu.tw Abstract Named entity (NE) extraction is one of the fundamental tasks in natural language processing (NLP). Although many studies have focused on identifying NEs within monolingual documents, aligning NEs in bilingual documents has not been investigated extensively due to the complexity of the task. In this article, we introduce a new approach to aligning bilingual NEs in parallel corpora by incorporating statistical models with multiple knowledge sources. In our approach, we model the process of translating an English NE phrase into a Chinese equivalent using lexical translation/transliteration probabilities for word translation and alignment probabilities for word reordering. The method involves automatically learning phrase alignment and acquiring word translations from a bilingual phrase dictionary and parallel corpora, and automatically discovering transliteration transformations from a training set of name-transliteration pairs. The method also involves language-specific knowledge functions, including abbreviation handling, Chinese person name recognition, and acronym expansion. At run time, the proposed models are applied to each source NE in a pair of bilingual sentences to generate and evaluate the target NE candidates, and the source and target NEs are aligned based on the computed probabilities. Experimental results demonstrate that the proposed approach, which integrates statistical models with extra knowledge sources, is highly feasible and offers significant improvement in performance compared to our previous work as well as the traditional approach of IBM Model 4.

1

Chun-Jen Lee et al.

September, 2005

Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing – Machine translation; I.5.1 [Pattern Recognition]: Models - Statistical; I.5.4 [Pattern Recognition]: Application – Text processing General Terms: Algorithms, Language, Performance Additional Key Words and Phrases: Named entity alignment, Phrase translation, Transliteration, Parallel corpora, Machine translation

1

Introduction

NEs are essential components of texts, especially for genres news. NE extraction and translation are vital in the field of NLP for research on machine translation, cross-language information retrieval, bilingual lexicon construction, etc. There are three types of NEs [Chinchor 1997]: entity names (organizations (ORG), persons (PER), and locations (LOC)), temporal expressions (dates and times), and number expressions (monetary values and percentages). Temporal expressions and number expressions, being more regular than entity names, can be generally described by grammar rules. On the other hand, PER, LOC, and ORG named entities are difficult to handle with a fixed set of rules, since new entity names are constantly being created. Thus, there is an increasing need to investigate techniques for NE extraction and translation. In this article, we will focus on extracting bilingual pairs of entity names1. Transforming NEs between two languages involves both translation and transliteration. In general, it is difficult for human translators to translate/transliterate unfamiliar person names, place names, and organization names. Specifically, person names are almost always transliterated. In this case, words can be decomposed into transliteration units (TUs) and transliterated (Section 3.4). For example, in the NE pair (Ada, 艾妲 “Ai Ta”), the alignments of the TU matching pairs are “a-ai,” “d-t,” and “a-a.” Transformation of location names and organization names between two languages is typically performed via a com-

2

Chun-Jen Lee et al.

September, 2005

bination of translation and transliteration. For example, in the NE pairs (Little Smoky River, 小斯莫基河) and (Carnegie Mellon University, 卡內基麥隆大學), “Little,” “River,” and “University” are translated as “小,” “河,” and “大學,” respectively, and “Smoky,” “Carnegie,” and “Mellon” are transliterated as “斯莫 基 (Ssu Mo Chi),” “卡內基 (Ka Nei Chi),” and “麥隆 (Mai Lung),” respectively. There are no rules for deciding which words should be translated or transliterated. For example, compared with (Little Smoky River, 小斯莫基河), “Smoky” in (Great Smoky Mountains, 大煙山脈) is translated as “煙,” not transliterated as “斯莫基.” Moreover, word order is not preserved when transforming organization names, as in the case of (University of California, 加州大學), where “大學” and “加州” are translations of “University” and “California,” respectively. From the above observations, it is obviously advantageous to combine phrase translation and transliteration while extracting or translating NEs. Extracting bilingual NE pairs is a crucial step in retrieving NE translation knowledge from bilingual documents. Currently, English NE identifiers are well established and are already being used in commercial products, such as BBN’s IdentiFinder [Bikel et al. 1999], whereas Chinese NE identifiers are still immature, partly due to the difficulty of Chinese word segmentation [Chen and Liu 1992; Chien et al. 1999]. To achieve the goal of extracting bilingual NEs from parallel corpora, we introduce here a new approach that aims to identify the NEs in an English sentence first and then extract corresponding Chinese equivalents from the aligned sentence by integrating a phrase translation model, a transliteration model, and extra language-specific sources [Lee et al. 2004a; 2004b]. At run time, for a given English NE identified in the source sentence, we extract NE pairs from aligned sentences as follows: (1) transform the source NE into a set of possible translation candidates; (2) find the set of candidates occurring in the target sentence to extract a set of possible NE equivalents; (3) evaluate the set of possible NE equivalents; (4) align the source and target NE pair with the highest probability. 1

For simplicity, NEs referred to in the rest of this article are in fact entity names, unless otherwise noted.

3

Chun-Jen Lee et al.

September, 2005

A formal description of the proposed approach will be given in Section 3. The remainder of the article is organized as follows. We review related work in Section 2. Section 3 describes the proposed methods for acquiring bilingual NE pairs from parallel corpora. The experimental setup and a quantitative assessment of the achieved performance enhancement are presented in Section 4. Concluding remarks and future research directions are given in Section 5.

2

Related Work

Two major approaches to automatically harvesting bilingual translation pairs from corpora have received much attention recently. One approach is to mine translation pairs from the Web, whereas the other is to extract them from bilingual corpora. Several studies have been conducted on the web-based approach [Kraaij et al. 2003; Cheng et al. 2004; Lu et al. 2004; Zhang and Vines 2004]. Most of the above studies focused on crawling large numbers of web pages to gather sufficient statistics; hence, more computer disk space (for storing and indexing the crawled pages) and more time-consuming work (due to bottlenecks in the Internet and the need to remove noisy web pages of crawled data) are required, compared with the approach of using existing parallel corpora. Our work, on the other hand, addresses the task of extracting bilingual pairs from bilingual corpora. Relevant studies on this topic will be described next. Although much work on NE identification within monolingual documents has been reported [Chen et al. 1998; Mikheev et al. 1998; Bikel et al. 1999; Borthwick 1999; Black and Vasilakopoulos 2002; Carreras et al. 2002; Wu et al. 2002; Zhou and Su 2002; Sun et al. 2003], little work has been reported on NE translation. Machine transliteration plays an important role in NE translation and is performed in order to convert a proper noun in the source NE into an approximate phonetic equivalent in the target language. In the past few years, machine transliteration has been studied by numerous researchers working with various language pairs, including English/Arabic [Stalls and Knight 1998], English/Chinese [Wan and Verspoor 1998; Lin and Chen 2002; Lee and Chang 2003; Lee et al. 2003], English/Japanese [Knight and Graehl 1998; Tsuji 2002], and English/Korean [Lee and Choi 1997; Oh and Choi 2002]. Unlike previous

4

Chun-Jen Lee et al.

September, 2005

studies, we proposed a method [Lee and Chang, 2003] for tackling this issue which requires neither a pronunciation dictionary for converting source words into phonetic symbols nor manually assigned phonetic similarity scores between bilingual name pairs. The proposed method is easier to port to other language pairs as long as some transliteration training data are available. Since NE transformation involves both transliteration and phrase translation, the goal of extracting NE pairs cannot be achieved via transliteration alone. Previous work on mapping between bilingual NEs has been closely related to our study presented here. As mentioned by Moore [2003], two different strategies, asymmetry and symmetry, can be applied to deal with the mapping problem. The asymmetric strategy assumes that NEs in the source part are given and that the task is to identify the translation equivalents in the target part. On the other hand, the symmetric strategy tries to find NEs in both languages and then establish the associations between NE pairs. Obviously, the symmetric approach is more difficult to apply since it requires that NEs be identified in both languages. Moreover, the errors and inconsistency induced by NE identification are, subsequently, propagated to NE alignment. Therefore, in this article, we adopt the asymmetric approach to extracting bilingual NE equivalents from parallel corpora. Studies based on the symmetric strategy include those of Chen et al. [2003], Huang et al. [2003], and Kumano et al. [2004], while studies based on the asymmetric strategy include those of Al-Onaizan and Knight [2002], Moore [2003], Lee et al. [2004a; 2004b], and Feng et al. [2004]. Research based on the symmetric strategy will be briefly described in the following. Chen et al. [2003] investigated formulation and transformation rules for English-Chinese NEs. In that study, they used a frequency-based method to construct rules for identifying keywords of NEs from phrase-aligned corpora. Their study focused on constructing transformation rules for NE mapping between languages. However, the performance of NE alignment was not well reported in that study. Huang et al. [2003] proposed a method for acquiring English-Chinese NE pairs from a parallel corpus. Their method is based on a linear combination of the transliteration cost, translation cost, and tagging cost. Kumano et al. [2004] proposed a method for acquiring English-Japanese NE pairs from content-aligned corpora. Their approach tries to

5

Chun-Jen Lee et al.

September, 2005

find correspondences between bilingual NE groups based on the similarity of the appearance order in each document. However, the methods proposed by both Huang et al. [2003] and Kumano et al. [2004] require that NEs on both sides be identified beforehand, which is not suitable for our task, since we only identify NEs on the source side. Another line of research based on the asymmetric strategy will be briefly described in the following. Al-Onaizan and Knight [2002] proposed an algorithm for translating NEs from Arabic to English using monolingual and bilingual resources. Given an Arabic NE, they use transliteration models (including a phonetic-based model and a spelling-based model), a bilingual dictionary, and an English news corpus to generate a list of English candidates. Then, the list is re-ranked using monolingual cues, such as web counts. However, the accuracy achieved in their experiment left much to be desired. Moore [2003] proposed an approach to choosing English-French NE pairs in parallel corpora based on a sequence of refined models. This approach heavily depends on linguistic information, such as the same NE phrase occurring in the source and target parts, and cues from capitalization. Thus, it is not suitable for language pairs of different language families, such as English/Chinese. In our previous work [Lee et al. 2004a], we proposed an approach that uses phrase translation and transliteration models to extract English-Chinese NE pairs from parallel corpora. The parameters of the proposed models are automatically estimated using the EM training algorithm in an unsupervised manner. This approach can be further improved by incorporating language-specific knowledge sources [Lee et al. 2004b], which will be explained in this article. Feng et al. [2004] proposed an approach to English-Chinese NE alignment in parallel corpora. After recognizing English NEs, they use a maximum entropy model that integrates the translation score, transliteration score, co-occurrence score, and distortion score to extract corresponding Chinese equivalents from the aligned sentences. In order to train the maximum entropy model, a supervised learning approach with a bootstrapping strategy is adopted in their method. In contrast to the previous research on bilingual NE processing, we present here a framework for aligning bilingual NEs in parallel corpora by incorporating proposed statistical models, i.e., a phrase translation model and a transliteration model, along with multiple knowledge sources, including abbrevia-

6

Chun-Jen Lee et al.

September, 2005

tion handling, person name recognition, and acronym expansion. To reduce the errors and inconsistency that could be propagated by NE identifiers in both the source and target languages, we only require that NEs be identified on the source side.

3

Bilingual NE Alignment

Bilingual NE alignment plays a vital role in extracting key information from bilingual corpora, which is essential in multilingual language processing. However, traditional statistical MT approaches, such as that of Brown et al. [1993], mainly focus on aligning words in parallel corpora. These approaches cannot achieve the goal of NE alignment due to their inability to handle many-to-many alignment within bilingual NE phrases. To achieve the goal of aligning NE pairs, we adopt a many-to-many word alignment scheme based on statistical models and additional knowledge clues. Moreover, we get rid of the errorprone process of Chinese word segmentation by using an asymmetric approach to aligning EnglishChinese NE pairs. 3.1 Problem Statement We aim to return a list of English-Chinese NE pairs automatically extracted from parallel corpora. As mentioned previously, without relying on NE identification in the Chinese part, we focus on aligning an English NE (which has been automatically or manually labeled) with its Chinese equivalent in the aligned sentence. A formal statement of the problem is given below. Problem Statement: Given an English-Chinese parallel corpus and a set of English (i.e., source) NE phrases {e} labeled in the corpus, our goal is to extract bilingual NE pairs {(e, f)} from the corpus, where f is the translation equivalent of e. For this purpose, each e is transformed into a set of translation candidates {f*} via our proposed models, such that {f*} is likely to contain the translation equivalent f of e. Our solution to the above problem will be described in the rest of this section. An outline of our approach to aligning NE pairs in parallel corpora will be given in Section 3.2. In contrast to previous studies,

7

Chun-Jen Lee et al.

September, 2005

the proposed method integrates a statistical phrase translation model (SPTM) (Section 3.3), a transliteration model (TM) (Section 3.4), and other language-specific modules in a unified way. The languagespecific modules include abbreviation handling (AH) (Section 3.5), Chinese person name recognition (CPNR) (Section 3.6), and acronym expansion (AE) (Section 3.7). The framework, which incorporates SPTM, TM, AH, CPNR, and AE to extract NE pairs from parallel corpora, will be presented in Section 3.8. 3.2 Outline of the Proposed Approach We attempt to align source NEs with their translation equivalents in parallel corpora. The following steps are performed in the NE alignment process: (1) Employ a sentence alignment procedure and a source NE identifier to align parallel texts at the sentence level and to label NEs in each source sentence, respectively. (2) Utilize phrase translation and transliteration models to generate a set of translation candidates that appear in the target sentence for each source NE e. (3) Sort the translation scores associated with the set of translation candidates in descending order. Choose the top-1 candidate f with the highest score as the target NE. (4) Denote {(e, f)} as NE pairs. As we mentioned previously, NE translation involves word translation and word reordering. Thus, to translate a source NE e into its target NE f, we propose using a phrase translation model to approximate the translation score function, Score ( f | e ) , by decomposing Score ( f | e) into a lexical translation score function, ScoreLEX ( f | e) , and a position alignment score function, ScoreALI ( f | e) , as shown in Eq. (1):

Score ( f | e ) = Score

LEX

( f | e ) + Score

ALI

( f | e ).

(1)

Translating NEs also involves transliteration, especially in the case of person names. To do so, we propose adopting a transliteration model to model proper noun transliteration. In the following sections,

8

Chun-Jen Lee et al.

September, 2005

we will formally explain how the phrase translation and transliteration models, and language-specific knowledge sources can be integrated in a unified way to align NE pairs in parallel corpora. 3.3 Statistical Phrase Translation Model (SPTM) In the noisy channel approach to machine translation proposed by Brown et al. [1993], a source sentence e is fed into a noisy channel and translated into a target sentence f. Following Brown et al. [1993], we model the probability of translating an English phrase e with l words into a Mandarin Chinese phrase f with m words by decomposing the channel function into two independent probabilistic functions: (a) a lexical translation probability (LTP) function, P ( f a | ei ) , where ei is the i-th word in e and ei is aligned i

with f a in f under the alignment a, and (b) a position alignment probability (PAP) function, P(a | l, m). i

Based on the above model, finding the best translation f* for a given e is expressed as follows: f * = arg max P ( f | e) = arg max ∑ P ( f , a | e). f

f

(2)

a

For simplicity, the best alignment with the highest probability is chosen to decide the most probable translation f*, instead of summing all possible alignments a. Eq. (2) can, thus, be expressed as f * = arg max max P( f , a | e) = arg max max P(a | l , m) × ∏ i =1,l P( f ai | ei ). f

a

a

f

(3)

In the original formulation, Brown et al. decomposed the probability of alignment for a sentence as the product of the alignment of the i-th word for i = 1 to l. Since the number l is usually quite small for phrase to phrase translation, it might be better to compute the phrase alignment probability as a whole instead of as the product of individual word alignment, P(ai | i, l, m). Therefore, we have P(a | l , m) ≡ P(a1 , a2 ,..., al | l , m).

For example, consider the case where the source phrase e = “Ichthyosis Concern Association” and its translation equivalent f = “關懷魚鱗癬協會.” Reasonable word segmentations for f, in this case, are “關

9

Chun-Jen Lee et al.

September, 2005

懷,” “魚鱗癬,” and “協會.” The correct alignment is (a1 = 2, a2 = 1, a3 = 3). Thus, the phrase translation probability is represented as P (關懷 魚鱗癬 協會 | Ichthyosis Concern Associatio n) ≈ P (魚鱗癬 | Ichthyosis ) × P (關懷 | Concern ) × P (協會 | Associatio n) × P ( 2,1,3 | 3,3).

Based on the modified formulation for alignment probability, Eq. (3) can be written as f * = arg max max P ( f , a | e) = arg max max P ( a1 a 2 ...a n | l , m ) × ∏ i =1,l P ( f ai | ei ). f

a

f

a

The integrated score function for the target phrase f, given e, is defined as follows by regarding the score function as a log probability function: Score( f | e) ≡ log( P ( a | l , m) ∏ i =1,l P ( f ai | ei )) = Score ali ( a | l , m) + ∑ Scorelex ( f ai | ei ). i =1,l

Accordingly, ScoreLEX ( f | e) and Score ALI ( f | e) in Eq. (1) can be defined as follows: Score LEX ( f | e) ≡ ∑ Scorelex ( f ai | ei ), i =1,l

Score ALI ( f | e) ≡ Scoreali (a | l , m).

Although we focus here on the alignment of NE pairs in parallel corpora, words in NEs are not always translated literally. For instance, there is one insertion and one deletion operation in the transformation of the NE pair (East Coast National Scenic Area, 東海岸風景特區), as shown in Figure 1. To tackle the insertion and the deletion issues that are frequently encountered in English-Chinese NE phrase translation, we are motivated to apply the method proposed by Damerau [1964] to approximate ScoreLEX ( f | e) . We assume that l and m are the numbers of words in e and f, respectively. Then, ScoreLEX ( f | e) between e and f can be calculated using the recurrence relation as shown below.

10

Chun-Jen Lee et al.

September, 2005

Step 1 (Initialization): g (0,0) = 0.

Step 2 (Recursion):  g (i − 1, j ) − cγ 1 ,    g (i, j ) = max  g (i, j − 1) − cγ 2 , ,0 ≤ i ≤ l , 0 ≤ j ≤ m.  g (i − 1, j − 1) + Score ( f | e ) lex ai i  

Step 3 (Termination):

Score LEX ( f | e) = g (l , m), where cγ 1 and cγ 2 are penalty score values for an insertion operation and a deletion operation at the word level. English Named Entity East Coast National Scenic Area Translation



海岸

國家

風景

區 Position Alignment



海岸

國家

風景

區 Insertion & Deletion



海岸 風景 特定 區 Chinese Named Entity

Figure 1. Transformation of an NE pair with insertion and deletion operations To estimate the parameters LTP and PAP of the SPTM model, an EM algorithm for maximizing the likelihood of generating the target f given e is adopted. We use a bilingual dictionary and parallel corpora as the training data for learning SPTM. First, we make some initial estimates, such as by using a uniform segmentation scheme. Then, we perform an iterative learning process to find the optimal phrase alignment under the model and re-estimated parameters according to the optimal alignment found. Further de-

11

Chun-Jen Lee et al.

September, 2005

tails about the process of estimating parameters can be found in [Chang et al. 2001]. In addition, a method for automatically exploiting domain-specific bilingual lexicons from relevant bilingual corpora has also been developed [Wu and Chang 2004]. For a source word ei in a given source NE e, the probability P( f ai | ei ) for a translation candidate f ai is estimated using a weighted average strategy as follows:

P( f ai | ei ) = λ1 Pgen ( f ai | ei ) + λ2 Pne ( f ai | ei ) + λ3 Pcor ( f ai | ei ),

λ1 + λ2 + λ3 = 1,

(4)

where Pgen ( f ai | ei ) , Pne ( f ai | ei ) , and Pcor ( f ai | ei ) are estimated from a general bilingual dictionary, an NE-pair list, and a domain-relevant corpus, respectively, and λ1 , λ2 , and λ3 are weighting factors to be determined empirically. Similarly, the probability P(a | l , m) is also estimated using a weighted average strategy:

P(a | l , m) = λ4 Pgen (a | l , m) + λ5 Pne (a | l , m) + λ6 Pcor (a | l , m),

λ4 + λ5 + λ6 = 1, where Pgen ( a | l , m) , Pne ( a | l , m ) , and Pcor ( a | l , m) are estimated from a general bilingual dictionary, an NE-pair list, and a domain-relevant corpus, respectively, and λ4 , λ5 , and λ6 are weighting factors to be determined empirically. 3.4 Transliteration Model (TM) Each of the nouns in the phrase being translated may be a common noun or a proper noun. Proper nouns, especially person names, are typically transliterated into phonetic equivalents. For common nouns, we rely on Eq. (4) to estimate LTP. For proper nouns, we apply machine transliteration to estimate LTP. Since Chinese and English are disparate languages and no simple rules are available for direct mapping

12

Chun-Jen Lee et al.

September, 2005

between them based on sounds, one possible solution is to adopt a Chinese romanization system to represent the pronunciation of each Chinese character and then find the mapping rules between them. In the following discussion, E and F are assumed to be an English word and a romanized Chinese character sequence, respectively. One can consider machine transliteration as a noisy channel. The language model, P(E), generates a source proper name, E, and the transliteration model, P(F|E), converts E into a target transliteration, F. P(E) describes the probability associated with E, whereas P(F|E) estimates the probability of F, conditioned on E. P(F|E) can be approximated by decomposing E and F into transliteration units (TUs). A TU is defined as a sequence of characters transliterated as a group [Lee and Chang 2003; Lee et al. 2005]. For instance, Figure 2 shows the TU alignment of the word pair (Smith, 史密斯 “Shih Mi Ssu”). A formal statement of this approximation scheme is given below.

S

m

i

th

Shih

m

i

ssu







Figure 2. TU alignment of (Smith, 史密斯 “Shih Mi Ssu”) A word E with l characters and a romanized word F with m characters are denoted by E1 E2 …El and F1 F2 …Fm, respectively. We can represent the mapping of (E, F) as a sequence of matched n TUs, {(u1, v1), (u2, v2), … (un, vn)}:  E = E 1 E 2 ... E l = u 1 u 2 ... u n .   F = F1 F 2 ... F m = v1 v 2 ... v n

Hence, the alignment a between E and F can be represented as a match type sequence (m1 m2 …m n), where mi denotes as a pair of lengths of ui and vi. Therefore, P(F|E) is expressed as follows:

13

Chun-Jen Lee et al.

September, 2005

P(F | E) = ∑P(F, a | E).

(5)

a

According to the above definitions and independent assumptions, Eq. (5) can be further derived as follows: n

P(F | E) = ∑ P(v1v2 ...vn | u1u2...un )P(m1m2...mn ) = ∑∏ P(vi | ui )P(mi ). a i=1

a

Then, the process of finding the most probable transliteration F*, for a given E, can be approximated as n

F * = arg max max P ( F , a | E ) = arg max max ∏ P ( v i | u i ) P ( m i ). a

F

F

a

i =1

By regarding the score function as a log probability function, we can formulate the transliteration score function for F, given E, as n

Scoretm ( F | E ) = max log(∏ P(vi | ui ) P( mi )). a

i =1

Thus, the transliteration score can be used to estimate LTP for proper nouns, which are transliterated during the process of NE translation. Suppose that there is an entry (ei , wf ) with probability p (or score c p = log p ) in the derived bilingual lexicon based on Eq. (4). Scorelex ( f ai | ei ) can be formulated as if f ai = w f c p ,  Scoretm ( R( f ai ) | ei ), if f ai ≠ w f and Scorelex ( f ai | ei ) =  Scoretm ( R( f ai ) | ei ) ≥ Thr1  c , otherwise,  γ3

(6)

where R( f ai ) is the romanization of f a , and cγ 3 and Thr1 denote a floor score value and a threshold, i

respectively.

14

Chun-Jen Lee et al.

September, 2005

To estimate the parameters P(vi|ui) and P(mi) of the TM model, an English-Chinese name list and several Chinese romanization schemes, such as Wade-Giles and Hanyu Pinyin, are applied to train the STM. For more details about the learning process, please refer to [Lee et al. 2005].

3.5 Abbreviation Handling (AH) In practice, the transformation of NEs from English to Chinese is more complicated than the transformation process mentioned above. Usually, an English NE phrase may have several equally acceptable Chinese NE candidates. For example, the NE “International Commercial Bank of China” can be translated as “中國國際商業銀行,” “中國商業銀行,” or “中國商銀.” We can simply measure the similarity between two Chinese NE candidates when estimating the phrase translation probability. For example, a high probabilistic value for P(“中國商業銀行” | “International Commercial Bank of China”) and a high similarity measure between “中國商業銀行” and “中國商銀” imply that we should also give a high probabilistic value for P(“中國商銀” | “International Commercial Bank of China”). Therefore, we can enhance the lexical score function by using approximate string matching [Damerau 1964]. As a result, Eq. (6) can be modified as follows: if f ai = w f c p ,  c × (1 + J − I ) − c , if 0 < I < J γ4  p J  Score lex ( f ai | ei ) =  Score tm ( R ( f ai ) | ei ), if I = 0 and  Score tm ( R ( f ai ) | ei ) ≥ Thr1  c , otherwise ,  γ3 

(7)

where J is the number of Chinese characters in wf, I is the number of matched Chinese characters between f ai and wf, and cγ 4 denotes a floor score value.

Figure 3 shows the transformation of the NE pair (International Commercial Bank of China, 中國商銀) with one deletion and two approximate matching operations.

15

Chun-Jen Lee et al.

September, 2005

English Named Entity International Commercial Bank of China Translation

國際

商業

銀行 中國 Position Alignment

中國

國際

商業 銀行

中國 商 銀 Chinese Named Entity

Deletion & Approximate Matching

Figure 3. Transformation of an NE pair with one deletion and two approximate matching operations

3.6 Chinese Person Name Recognition (CPNR) In some cases, the association between the members of a bilingual NE pair is hard to obtain through only phrase translation, transliteration, and abbreviation handling. More language-dependent features, such as CPNR, can be introduced to improve the performance. The following example is taken from the magazine Sinorama [Sinorama 2002]: (S1) English sentence: “…But Sinorama boldly broke through, under the direction of then publisher King-yuh Chang, and under the planning of editors Wang Chi and Gypsy Chang, produced the "Greater China" special report. This expressed our concern about mainland China by reporting on the evaporation of Tungting Lake and the desertification of the Huangtu Plateau.” Chinese sentence: “…,但「光華」勇於突破,在當時發行人張京育指示、總編輯汪琪與 編輯張靜茹策劃下,推出「大地中國」專題,以報導洞庭湖日益淤淺、黃土高原沙漠化 的現況,來表達我們對大陸的關懷。” In (S1), there are seven labeled bilingual pairs of entity names. Among them, (King-yuh Chang, 張京育), (Wang Chi, 汪琪), and (Gypsy Chang, 張靜茹) are person names. In this example, (King-yuh Chang, 張

16

Chun-Jen Lee et al.

September, 2005

京育) and (Wang Chi, 汪琪) are well aligned via the proposed method mentioned above. However, in (Gypsy Chang, 張靜茹), “Gypsy” is an English name which does not have any direct relationship with “靜茹,” a traditional Chinese name. To deal with the mapping between a foreign name and a Chinese name in parallel corpora, we apply CPNR to extract the Chinese part of the PER-typed NE. Chinese person names consist of surnames and given names. In most cases, surnames and given names are composed of one or two characters. Our CPNR model is automatically trained from a large person name corpus consisting of one million entries. We use Chinese surnames as anchor points and then determine if the following one or two characters is a Chinese given name or not. Suppose that c1c 2 are two subsequent Chinese characters. The decision function d (c1c2 ) for the two-character given name is defined as follows: true ,  d ( c1c 2 ) =   false , 

if

P ( c1c 2 | GN 122 ) > Thr 2

or

P ( c1 | GN ) × P ( c 2 | GN ) > Thr3 , 2 1

2 2

otherwise ,

where GN122 , GN12 , and GN22 stand for the two-character given name, the first character of the twocharacter given name, and the second character of the two-character given name, respectively, and Thr2 and Thr3 are constants. The decision function d (c1 ) for a single-character given name is defined as follows: true, d (c1 ) =   false ,

if

P (c1 | GN 1 ) > Thr4 ,

otherwise ,

where GN 1 is the one-character given name and Thr4 is a constant. The threshold values Thr2, Thr3, and Thr4 were empirically determined so as to let 95% of the training set pass the verification test. Since bilingual sentences are well aligned and surnames are used as anchor points, this approach works quite well for aligning foreign names with their corresponding Chinese names.

17

Chun-Jen Lee et al.

September, 2005

CPNR is applied only when the given NE is a named person and Score tm ( R ( f ai ) | ei ) in Eq. (7) is less than Thr1. Then, given a named person, the transliteration score function is reformulated as Scoretm ( R ( f ai ) | ei ) = max {log( P(c1c2 | GN122 )), log( P(c1 | GN12 ) × P(c 2 | GN 22 ))},  if d (c1c 2 ) is true,   1 otherwise if d (c1 ) is true. log( P(c1 | GN ),

3.7 Acronym Expansion (AE) Acronyms carry significant information in corpora. However, they are frequently created in a domain specific manner and cannot be completely covered by any existing dictionaries. If acronym-expansion pairs are not mined, it is highly difficult to align NE pairs from corpora. To acquire acronym-translation pairs, a two-stage approach is proposed in this article. In the first stage, in which an acronym-expansion list is compiled, a simple algorithm is applied to extract a possible expansion candidate for each acronym in the source sentence. In the second stage, we apply the proposed bilingual NE alignment algorithm to extract the translation of the expansion from the aligned target sentence. The strategy of acronym expansion is based on the observation that an acronym and its expansion typically appear within a sentence in a specific pattern. This usually occurs in one of the following two canonical forms: a pair of parentheses around an acronym, such as “World Health Organization (WHO)”; a pair of parentheses around an expansion, such as “WHO (World Health Organization).” In this study, we used the algorithm proposed by Schwartz and Hearst [2003] to extract acronymexpansion pairs from the corpora we employed. Table 1 shows a partial list of the acronym-expansion pairs extracted from Sinorama Magazine. This list can be applied in subsequent bilingual NE extraction.

18

Chun-Jen Lee et al.

September, 2005

Table 1. A partial list of the acronym-expansion pairs automatically extracted from the corpora Acronym CSC

Text

Expansion

Three years ago when Yeh Man-sheng, president of China China Shipbuilding Corp. Shipbuilding Corp. (CSC)…

NTNU

…states Professor Wang Ying of the National Taiwan Normal National Taiwan Normal UniverUniversity (NTNU) Department of Biology.

sity

CLA

Under a chorus of pleas, the Council of Labor Affairs (CLA)… Council of Labor Affairs

WHO

We could link up with the World Health Organization World Health Organization (WHO)…

MOEA

The amended Taiwan Energy Policy, produced by the Energy Ministry of Economic Affairs Commission of the Ministry of Economic Affairs (MOEA)…

3.8 Framework for Aligning Bilingual NEs We propose a framework that aims to align bilingual NEs in parallel corpora. The overall process is performed via a two-stage approach. The proposed approach involves integrating the SPTM, TM, AH, CPNR, and AE modules, and applying them to align bilingual NEs in parallel corpora. Figure 4 summarizes the framework of the overall process. Table 2 shows some examples of bilingual NE pairs extracted from aligned sentences in Sinorama. For the sake of clarity, NEs are underlined in Table 2.

19

Chun-Jen Lee et al.

I.

II.

September, 2005

Data preprocess: (I.1).

Perform sentence alignment.

(I.2).

Label English named entities.

Main process: For each English named entity e in an English sentence Se, align the corresponding Chinese named entity f in the aligned Chinese sentence Sf as follows: (II.1).

Generate all possible Chinese NE candidates by means of the proposed

SPTM model using general-purpose and domain-specific lexicons. More specifically, for each labeled e, apply SPTM and AH to find translation equivalents {f1} in Sf. (II.2).

For each content word w in e that does not have a corresponding translation

in Sf, apply the proposed TM and CPNR modules to extract the corresponding translation equivalents {f2} in Sf. (II.3).

Merge {f1} with {f2} to form a set of potential translation equivalents {f}.

(II.4).

Rank {f} based on the scores. Choose the candidate f with the maximum

score as the answer to form the pair (e, f) as the result. Figure 4. The process of aligning bilingual NEs in parallel corpora In Step (I.1) in Figure 4, a sentence alignment procedure based on length and lexical information [Chuang et al. 2002] is applied to align parallel texts at the sentence level. In Step (I.2), an HMM based English NE identifier, based on case information, POS tags, the words themselves, and previously predicted NE tags, is applied to approximately label NE candidates for each sentence in the English text. Next, the labels are manually corrected. A general overview of NE recognition systems adopted in the NE recognition shared task of CoNLL-2003 can be found in [Sang and Meulder 2003]. Many studies have focused on identifying monolingual NEs, especially in English. In this study, on the other hand, our focus

20

Chun-Jen Lee et al.

September, 2005

is the alignment of bilingual NEs. We shall further clarify the main process by means of illustrative examples in the following. In Step (II.1), an acronym can be expanded to its full name by looking up an acronym-expansion table, as shown in Table 1. For instance, in example (1) in Table 2, “CLA” is expanded into “Council of Labor Affairs.” In this step, a set of potential Chinese NE candidates {f1} for each e is generated via the proposed phrase translation model. For instance, in example (1) in Table 2, possible translations of “Council,” “Labor,” and “Affairs” are {協會, 委員會, …}, {勞工, 人工, …}, and {事務, 行政, …}, respectively. Therefore, after applying SPTM and AH, we find that the set {f1} of “CLA” is {勞委會, 勞 委, 勞會, 委會, …}. In this case, {f2} is empty, since neither TM nor CPNR is activated in Step (II.2). Thus, {f} is the same as {f1}. Finally, in Step (II.4), we can extract the NE pair (CLA, 勞委會) by choosing the top-1 ranking of the candidates in {f}. Note that “勞委會” is an abbreviation of “勞工事務委員 會,” which is a translation equivalent of “Council of Labor Affairs.” Table 2. Examples of NE pairs in aligned sentences Example (1)

Bilingual Sentences

Bilingual NE Pairs

According to statistics of the CLA, nearly 30,000 local households have hired (CLA, 勞委會) housekeepers, of which foreign nationals constitute two-thirds. That means that 20,000 households more or less now rely on foreign housemaids to look after the children and the home. 根據勞委會統計,目前國人家中雇有女傭的家庭將近三萬戶,其中外籍 約佔三分之二,亦即二萬戶左右的家庭已在仰賴外籍女傭照顧幼兒及管 家。

(2)

“Visas are hard to come by,” says Amy Hung, who is currently studying at (Amy Hung, 洪瑩芬), the Lincoln College Center. Quite a few people who come to Vancouver and (Lincoln College Center, find out things aren’t quite right try to switch to the U.S., but they usually 林肯大學中心), wind up coming right back.

(Vancouver, 溫哥華),

「簽證很難拿」,目前就讀「林肯大學中心」的洪瑩芬表示,有不少人 來到溫哥華後發現情況不對,想轉往美國,結果都被打了回票,

21

(U.S., 美國)

Chun-Jen Lee et al.

September, 2005

In the next example, we will demonstrate how CPNR helps to improve the performance of PER-type alignment. Person names are almost always transliterated. However, in example (2) in Table 2, “瑩芬 (Ying Fen)” is neither transliterated nor translated from “Amy” in the NE pair (Amy Hung, 洪瑩芬 “Hung Ying Fen”). In this example, “Hung” is a Chinese last name and can be translated as “洪,” which forms {f1}. But “Amy” is a foreign name that does not have a corresponding transliteration in the aligned Chinese sentence. The goal of the next step is to find the association between a foreign name and a Chinese name. Therefore, in Step (II.2), CPNR is activated by “Amy,” and the Chinese given name “瑩芬” is then detected by CPNR, forming {f2}. Thus, the pair (Amy Hung, 洪瑩芬) is successfully detected by merging {f1} with {f2}. To extract bilingual NE pairs, the symmetric approach in previous studies requires identifying NEs in both languages. However, developing two NE identifiers instead of one requires a lot of effort. Moreover, two NE identifiers do not always extract NE pairs consistently, especially when one of the NE identifiers is not as capable as the other one. For example, the Chinese NE identifier is currently not well developed due to the fact that there are no spaces between Chinese characters, leading to ambiguity in word segmentation. The framework proposed here, on the other hand, only needs a reliable English NE identifier together with the proposed models associated with multiple knowledge sources to extract NE pairs. Experimental results show that our approach can achieve excellent performance for bilingual NE alignment. More details about the experiments will be reported in Section 4.

4 Experiments This section describes the experimental setup and performance evaluation of the proposed approach to bilingual NE alignment in parallel corpora.

22

Chun-Jen Lee et al.

September, 2005

4.1 Experimental Setup Several corpora were collected to estimate the parameters of the proposed models. Noun phrases from the BDC Electronic Chinese-English Dictionary [BDC 1992] were used to train SPTM. A bilingual organization name corpus and a bilingual location name corpus compiled by the Central News Agency [CNA 2003] and Britannica Concise Encyclopedia [BCE 2003], respectively, were used to train an NE-specific SPTM. The parallel corpus collected from Sinorama Magazine was used to construct the corpus-based lexicon and to estimate LTP. To train TM, 2,430 pairs of English names together with their Chinese transliterations [Huai 1989] and Chinese romanization tables were used. To train CPNR, a Chinese person name corpus containing one million Chinese person names was used. Test cases were also drawn from Sinorama Magazine to evaluate the performance of bilingual NE alignment. Sinorama covers a wide range of topics, including personalities, places, and events in Taiwan. In Table 3, we report the statistics of the selected corpus derived from Sinorama. Table 3. Statistics of the Sinorama corpus Dates 1995-2002

Aligned Sentences English Words Chinese Characters 50,000

2,420,000

2,534,000

The NE alignment performance was evaluated according to the precision rate at the NE phrase level: Phrase Precision =

number of correctly aligned NE pairs . number of correct NE pairs

To analyze the performance of the proposed methods for NE alignment, we randomly selected 500 aligned sentences from Sinorama and manually labeled the answer keys. Each chosen aligned sentence contained at least one NE pair. Currently, we restrict the lengths of English NEs to be less than 6 words. In total, 1432 pairs of NEs were labeled. The numbers of NE pairs for types PER, LOC, and ORG were 380, 522, and 530, respectively. Table 4 shows the statistics of these bilingual NE pairs.

23

Chun-Jen Lee et al.

September, 2005

Table 4. Occurrence statistics for bilingual NE pairs in the Sinorama test set NE Type

PER

LOC

ORG

Total Occurrences

380 (26.54%) 522 (36.45%) 530 (37.01%)

Unique Occurrences

352 (31.18%) 327 (28.96%) 450 (39.86%)

4.2 Experimental Results and Discussion Several experiments were subsequently conducted to analyze the performance enhancement achieved with the proposed methods. Moreover, for the purpose of comparison with established baselines, we evaluated NE alignment against IBM Model 4 [Brown et al. 1993] using the toolkit Giza++2 [Och and Ney 2003], which is a publicly available implementation of the IBM models. Experimental results presented here are based on the evaluation criterion mentioned above, as shown in Table 5.

Table 5. Performance in bilingual NE alignment with the Sinorama test set Method

PER

LOC

ORG

Average

SPTM+TM (baseline)

85.79%

93.87%

75.66%

84.99%

SPTM+TM+AE

85.79%

93.87%

78.30%

85.96%

SPTM+TM+AE+AH

85.49%

96.17%

84.34%

88.97%

SPTM+TM+AE+CPNR

93.42%

93.87%

78.30%

87.99%

SPTM+TM+AE+AH+CPNR

93.42%

95.79%

84.91%

91.13%

IBM Model 4

29.47%

56.51%

37.92%

42.46%

From the above data, we can observe the following facts: (1) AH and AE contributes to ORG remarkably due to the fact that AH and AE help to deal with ORG-type abbreviations occurring in both Chinese and English. (2) CPNR contributes to PER significantly since CPNR attempts to solve the problem of mapping a foreign name to its corresponding Chinese.

24

Chun-Jen Lee et al.

September, 2005

(3) The performance of ORG is the worst of all. The major reasons are its highly complex structure and great variety. ORG-type NEs are also longer than other types of both English NEs and Chinese NEs, which is also a risk factor for transforming ORG-type NEs. Table 6 shows the average lengths of the NE types for the answer set. (4) Each individual method consistently helps to improve the baseline for all types of NEs. Moreover, the approach with all knowledge sources achieves much better results than any other system with partial knowledge sources. (5) The proposed approach significantly outperforms the tranditional approach of IBM Model 4.

Table 6. Average lengths of the NE types for the answer set in the Sinorama test set NE Type

PER

LOC

ORG

Avg. Length in words (English NE)

1.81

1.58

2.99

Avg. Length in characters (Chinese NE)

2.79

2.70

4.74

More specifically, Table 7 shows some examples that demonstrate the performance enhancement achieved by adding more knowledge sources. For simplicity, we will only focus on certain NEs that are underlined. In example (1), after the anchor point (Liu, 劉 “Liu”) is aligned, CPNR is activated to successfully detect the first name (Hamilton, 國芊 “Kou Chien”) even though “國芊” cannot be directly transformed from “Hamilton” via transliteration or translation. In example (2), the set of high ranking translations of “Taipei First Girls' High School” via SPTM is {“台北第一女孩高學校,” “台北第一女子高學校,” “台北第一女子高級學校,” …}. The correct complete translation is “台北第一女子高級中學.” Even though the complete trans-

2

http://www.fjoch.com/GIZA++.html.

25

Chun-Jen Lee et al.

September, 2005

lation is not generated due to the lack of lexicon coverage, the Chinese abbreviation “北一女” can still be well approximated via AH. Example (3), this example demonstrates the alignment between an acronym and its translation. “Council of Labor Affairs” is an expansion of “CLA,” and one of its translations, “勞工事務委 員會,” can be aligned with “勞委會” via the proposed AH.

Table 7. Examples of possible Chinese NEs extracted using the proposed approach Example

(1)

(2)

(3)

Bilingual Aligned Texts

Result Obtained Improved Using the Baseline Result Method After numerous passenger protests, Hamilton 劉 劉國芊 Liu, Deputy Director of the China Airlines Public Relations Office, says,… 經歷了多次旅客抗議事件,華航公關室副主 任劉國芊也說,… "When the travel agency notified us that we'd 學校 北一女 been accepted by the 'Royal Canadian College,' I was really proud…says Kung Hsi, who passed up 11th grade at the Taipei First Girls' High School evening class division to come to Vancouver to study with her younger brother. 「當初旅行社通知說申請到『皇家學院』, 我好得意哦,還到學校吹噓;…原本就讀北 一女夜間部,高二時和弟弟一起到溫哥華當 小留學生的孔曦吐吐舌頭表示。 According to statistics of the CLA, nearly 30,000 勞委會 local households have hired housekeepers, of which foreign nationals constitute two-thirds. That means that 20,000 households more or less now rely on foreign housemaids to look after the children and the home. 根據勞委會統計,目前國人家中雇有女傭的 家庭將近三萬戶,其中外籍約佔三分之二, 亦即二萬戶左右的家庭已在仰賴外籍女傭照 顧幼兒及管家。

Method(s) Used Improved by invoking CPNR

Improved by invoking AH

Improved by invoking AH and AE

To investigate whether the proposed approach is sensitive to the change of domain, we also conducted experiments on the LDC3 parallel corpus of the Hong Kong News Parallel Text (abbreviated as HKNPT). First of all, to analyze the performance of the proposed methods for NE alignment, we randomly selected

26

Chun-Jen Lee et al.

September, 2005

400 aligned sentences from HKNPT and manually labeled the answer keys. In total, 893 pairs of NEs were labeled. The numbers of NE pairs for types PER, LOC, and ORG were 168, 348, and 377, respectively. Tables 8 and 9 show the relevant statistics of these bilingual NE pairs. Table 8. Occurrence statistics for bilingual NE pairs in the HKNPT test set NE Type

PER

LOC

ORG

Total Occurrences

168 (18.81%) 348 (38.97%) 377 (42.22%)

Unique Occurrences

149 (25.17%) 190 (32.09%) 253 (42.74%)

Table 9. Average lengths of the NE types for the answer set in the HKNPT test set NE Type

PER

LOC

ORG

Avg. Length in words (English NE)

1.99

2.17

2.73

Avg. Length in characters (Chinese NE)

2.81

2.95

4.84

Since Cantonese transliterations of Chinese characters are quite different from Mandarin transliterations of Chinese characters, two Cantonese romanization systems4 were adopted in the experiment. The experimental results are shown in Table 10. Table 10. Performance in bilingual NE alignment with the HKNPT test set

3

Method

PER

LOC

ORG

Average

SPTM+TM (baseline)

51.79%

76.72%

54.91%

62.82%

SPTM+TM+AE

51.79%

80.75%

66.84%

69.43%

SPTM+TM+AE+AH

51.79%

83.62%

72.94%

73.12%

SPTM+TM+AE+CPNR

83.93%

80.46%

66.58%

75.25%

SPTM+TM+AE+AH+CPNR

84.52%

83.91%

74.80%

80.18%

IBM Model 4

40.48%

83.33%

74.01%

71.33%

http://www.ldc.upenn.edu/.

27

Chun-Jen Lee et al.

September, 2005

As shown in Table 10, without fine-tuning the phrase translation model, our approach outperforms IBM model 4 on average. The performance of the proposed approach is significantly better than IBM model 4 on PER-type NEs and competes with IBM model 4 on LOC-type and ORG-type NEs. Although most NE pairs were extracted correctly from the test corpora, some NE pairs were not, as shown in Table 11. For simplicity, only erroneously aligned NE pairs are underlined in the table. These errors are explained as follows: In example (1), the proposed TM fails to extract “阿爾讓特 (A Erh Jang Te),” since the transliteration score of the pair (Argenteuil, 阿爾讓特) is too low to exceed the threshold. This is due to the fact that “Argenteuil” is French, not English. In example (2), (“Pan Viet”, 越盛公司) cannot be identified correctly since “Viet” is much closer to “為 (Wei)” than to “盛 (Sheng),” based on the similarity at the grapheme level. In fact, “Pan Viet” is a foreign language name of “越盛公司 (Yueh Sheng Company).” Certainly, the proposed approach has difficulty solving this case, where the NE mapping is not transformed through the combination of transliteration and translation. In examples (3) and (4), the errors are caused by the limited coverage of the lexicons we used. Currently, the employed bilingual dictionary does not have the translations (Pratas, 東沙) and (Vetting, 評審). Of course, if we can incrementally add vocabulary entries to the employed dictionary, the performance will be further improved. In example (5), the correct Chinese NE “尤曾家麗 (Yau Jang Ga Lai)” consists of the surname “尤曾” and the given name “家麗.” (Carrie Yau, 尤曾家麗) was not extracted, because the anchor point “Yau” was first aligned with the Chinese surname name “尤” instead of “尤曾.” Thus, CPNR was applied to detect the subsequent two characters “曾家” instead of “家麗.”

4

Details about Cantonese romanization schemes we used can be found “http://www.info.gov.hk/digital21/eng/structure/jyutping.html” and “http://home.netvigator.com/~spikel/canton.txt.”

28

at

Chun-Jen Lee et al.

September, 2005

Another type of error occurs due to the fact that some NEs are not transformed literally. For instance, in example (6), (Osteogenesis Imperfecta Association, 玻璃娃娃協會) cannot be translated directly by translating the individual words (glass, 玻璃), (doll, 娃娃), and (association, 協 會). Obviously, the semantic meanings of “osteogenesis” and “imperfecta” are quite different from meanings of “glass” and “doll,” respectively, even though we have these translations of “osteogenesis” and “imperfecta” in the employed dictionary.

29

Chun-Jen Lee et al.

September, 2005

Table 11. Examples of alignment errors made using the proposed approach Example (1)

(2)

(3)

(4)

(5)

(6)

Type

Correct NE Pairs Miss-aligned Corpus

Bilingual Sentences

Chinese NEs

LOC At the end of the 19th century, French urbanites were well accustomed to the comforts of life in a modern city, and liked to make outings by train to destinations like the newly popular Seine town of Argenteuil. 在上一個世紀的交替,歐洲的法國市民已經過著 工業化與現代城市的生活。人們坐著火車到近郊 休閒,像是塞納河沿岸的阿爾讓特港就是新興的 休閒市鎮。 ORG Vietnam's land, ten times larger than Taiwan's, provides broad spaces to roam, and Pan Viet chairman C.F. Chang moved quickly to seize the investment opportunity. 越南十倍於台灣的土地,則提供了開闊的馳騁空 間,越盛公司董事長張哲發投資的快馬,便趁勢 奔騰其中。 LOC To survey the distribution and numbers of green turtles in the area, in July of last year Cheng Yichun went to the Pratas Islands,… 為了調查該地區綠蠵龜的分佈及族群狀況,去年 七月,程一駿曾前往東沙群島做綠蠵龜分佈調 查,… ORG The vetting process was not easy because of the large number and diverse nature of the applications. The main criterion adopted by the Vetting Committee was whether the application would contribute to the further development of the service sectors. 由於申請數量眾多,且來自不同的服務行業,評 選工作並不容易,而評審委員會所採納的主要評 選標準,是計劃是否有助各個服務行業進一步發 展。 PER Following is a question by the Hon Howard Young and a reply by the Acting Secretary for Security, Mrs Carrie Yau, in the Provisional Legislative Council today ( Wednesday ) : 以下為今日(星期三)在臨時立法會會議上楊孝 華議員的提問和署理保安局局長尤曾家麗的答 覆: ORG Lin Yu-chih, founder of the Osteogenesis Imperfecta Association, says that the law should not only reduce the financial burden on patients' families, but also… 「玻璃娃娃協會」發起人林煜智指出,法案通 過,一方面減輕病患家庭的負擔,也…

30

(Argenteuil, 阿爾 讓 特 “A Erh Jang Te”)

Sinorama

(Pan Viet, 越 盛 為 “Wei” 公 司 “Yueh Sheng Company”)

Sinorama

(Pratas Islands, 群島 東沙群島 “Tung Sha Islands”)

Sinorama

(Vetting Com- 委員會 mittee, 評審委員 會)

HKNPT

(Carrie Yau, 尤 尤曾家 曾 家 麗 “Yau Jang Ga Lai”)

HKNPT

(Osteogenesis 協會 Imperfecta Association, 玻 璃 娃 娃協會)

Sinorama

Chun-Jen Lee et al.

September, 2005

Comparison between Sinorama and HKNPT in terms of the performance achieved using languagespecific knowledge sources, including SPTM, TM, AE, AH, and CPNR, is shown in Table 12. Detailed statistics on the numbers of translations and transliterations occurring in two test corpora are given in Table 13.

Table 12. Performance for each language-specific knowledge source in the two corpora Corpus

SPTM

TM

AE

AH

CPNR

Sinorama

92.74%

93.37%

88.46%

87.01%

88.24%

HKNPT

83.41%

84.01%

90.41%

87.50%

92.45%

Table 13. Detailed statistics on the numbers of translations and transliterations in the two corpora Corpus Sinorama Numbers of Translations Numbers of Transliterations HKNPT Numbers of Translations Numbers of Transliterations

PER

LOC

ORG

37

469

512

345

118

78

5

263

377

163

145

15

Notably, on average, the performance of the proposed approach in the Sinorama test was better than that in the HKNPT test. One major reason is that the employed translation lexicon was trained with a general dictionary, a list of NE pairs, and the Sinorama corpus. Thus, the employed lexicon, especially for the proper names of LOC-type NEs and ORG-type NEs, does not cover many word translations in HKNPT. For the LOC-type NEs, some examples that were not aligned correctly in HKNPT are given as follows: (Lantau, 大嶼山), (Castle Peak, 青山), (Stanley, 赤柱), (Repulse Bay, 淺水灣), (Queensway, 金 鐘道), and (Trio, 三星灣). As for the ORG-type NEs, some erroneous examples are given as follows: (Marine and Land Enforcement Command, 海域巡邏組), (Treasury, 庫務局), (Geotechnical Engineering

31

Chun-Jen Lee et al.

September, 2005

Office, 土力工程處), (Department of Justice, 律政司), (Arch SD, 建築署), and (Correctional Services Department, 懲教署). However, the above problems can be alleviated by adding a small gazetteer of well-known names or by automatically learning word translations from domain-specific corpora. The other reason for the poorer performance in the HKNPT test was transliteration failures. Some transliteration error cases, which show the difficulties that were encountered when NEs were transliterated in the HKNPT test, are listed as follows: (Mauritian, 毛里求斯 “Mao Li Chiu Ssu”), (Robert Ribeiro, 李義 “Li I”), (Tony Blair, 貝理雅 “Pei Li Ya”), (Burrell, 貝偉 “Pei Wei”), (Derek Roebuck, 羅德立 “Lo Te Li”), and (Felice Lieh Mak, 麥列菲菲 “Mai Lieh Fei Fei”). All the above cases cannot be solved via transliteration or CPNR. We also noticed that the performance of the IBM Model 4 in the HKNPT test was much better than that in the Sinorama test. One possible reason may be the number of aligned sentences in the corpora used in our experiment, since the IBM Model 4 was designed for word alignment based on sufficient co-occurrence statistics of large parallel corpora. In our experiment, the Sinorama corpus consisted of 50,000 aligned sentences, which is much less than that of the HKNPT corpus, in which 600,000 aligned sentences were used. By incorporating the proposed baseline method with extra knowledge functions, we achieved significant improvement in NE alignment in our experiments on different test data. We believe that the proposed framework, achieved by integrating various language functions, can be further improved through the use of more refined models and language-specified functions. The proposed SPTM can be refined by introducing additional linguistic information. More specifically, we can train more fine parameters of SPTM models for each individual NE type instead of tying all types into a single SPTM. Moreover, if we have sufficient training data, we can train SPTM constrained by NE keywords. For example, the translation “處” for “department” almost always appears in the last position of Chinese ORG-type NEs. As for CPNR, we can enhance Chinese surname detection performance by considering composite surnames that are composed of two one-character surnames, such as “尤曾.” Furthermore, the proposed CPNR can be refined by introducing gender information of PER-typed NEs. For instance, given the source NE “Amy

32

Chun-Jen Lee et al.

September, 2005

Hung” with two candidate names “洪瑩芬 (Hung, Ying-Fen)” and “洪俊傑 (Hung, Chun-Chieh),” it is obvious that “洪瑩芬” should be selected since it is a female name. Contextual information, such as personal titles and speech-act verbs, can also be used to improve the NE alignment performance by analyzing the contextual structures of aligned sentences. One major limitation of training statistical translation models using parallel corpora is the lack of large parallel corpora. A potentially effective way to alleviate this problem is to develop algorithms for automatically extracting parallel corpora from the Web [Nie et al. 1999; Kilgarriff and Grefenstette 2003; Resnik and Smith 2003; Yang and Li 2003]. We believe that, now that the Web has become a huge repository of resources, many bilingual corpora could be automatically mined from the Web. Such trend will benefit the approach to efficiently developing bilingual text processing tools using parallel corpora. We have proposed a unified framework for bilingual NE alignment in parallel corpora. Several experiments were conducted to demonstrate the better performance of the proposed methods when confronted with various types of NEs. We used AH to measure the similarity between a Chinese NE candidate and its abbreviation. AH is highly effective for ORG-type NEs, since Chinese ORG-type NEs frequently appear in abbreviated forms. CPNR was applied to enhance the association between pairs of PER-type NEs, especially for foreign names and their corresponding Chinese names. We have successfully expanded an English acronym via AE and aligned the expansion with its translation, which is a very effective approach to aligning ORG-type NEs. All of the proposed strategies for AH, CPNR, and AE appeared to substantially improve the performance in the experiments. As a consequence, by combining all the proposed knowledge sources, the average rate could be improved from 84.99% to 91.13% for Sinorama, and from 62.82% to 80.18% for HKNPT.

33

Chun-Jen Lee et al.

September, 2005

5 Conclusions and Future Work As the need to acquire bilingual NE pairs is growing, we have presented a new approach that better achieves the goal of bilingual NE alignment. In this article, we have proposed two statistical models for phrase translation and transliteration, for the alignment of bilingual NEs in parallel corpora. Moreover, we have incorporated multiple knowledge sources based on abbreviation handling, Chinese person name recognition, and acronym expansion to form a complete framework for bilingual NE alignment in parallel corpora. The contributions of this paper can be summarized as follows. First of all, we have proposed a novel approach that can effectively handle many-to-many alignment within bilingual NE phrases. Secondly, the proposed approach is mainly implemented through statistical training and, thus, is easier to port to other language pairs as long as there is sufficient training data. Thirdly, the proposed framework achieves significant improvement in NE alignment by incorporating extra knowledge sources. Finally, experimental results also show that our integrated approach outperformed the baseline approach as well as the traditional IBM Model 4 approach when applied to corpora of various subject domains. We believe that the use of various kinds of linguistic information could lead to further improvement in the alignment of bilingual NE pairs. Additionally, due to the limited number of dictionaries that we used, it is probably possible to enhance the performance by using more dictionaries for domain-specific NEs and common names.

Acknowledgements We would like to thank Yuan Liou Publishing for providing data for this study. We gratefully acknowledge the support for this study provided by the National Science Council and Ministry of Education, Taiwan (grants NSC 92-2524-S007-002, NSC 91-2213-E-007-061 and MOE EX-91-E-FA06-4-4). The MOEA also provided support as part of the Software Technology for Advanced Network Application Project of the Institute for Information Industry.

34

Chun-Jen Lee et al.

September, 2005

References

Al-Onaizan, Yaser and Kevin Knight. 2002. Translating named entities using monolingual and bilingual resources. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 400-408. BCE. 2003. Britannica Concise Encyclopedia, http://wordpedia.britannica.com/concise/. BDC. 1992. The BDC Chinese-English electronic dictionary (version 2.0), Behavior Design Corporation, Taiwan. Bikel, Daniel M., Richard Schwartz, and Ralph M. Weischedel. 1999. An algorithm that learns what’s in a name. Machine Learning, 34(1/3). Black, William J. and Vasilakopoulos Argyrios. 2002. Language independent named entity classification by modified transformation-based learning and by decision tree induction. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, pages 159-162. Borthwick, Andrew. 1999. A maximum entropy approach to named entity recognition. PhD Dissertation, New York University. Brown, P. F., Della Pietra S. A., Della Pietra V. J., and Mercer R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19 (2): 263-311. Carreras, Xavier, Lluís Màrquez, and Lluís Padró. 2002. Named entity extraction using adaboost. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), pages 167-170, Taipei, Taiwan. Chang, Jason S., David Yu, and Chun-Jen Lee. 2001. Statistical translation model for phrases. Computational Linguistics and Chinese Language Processing, 6(2): 43-64. Chen, Hsin-Hsi, Yung-Wei Ding, Shih-Chung Tsai and Guo-Wei Bian. 1998. Description of the NTU system used for MET2. In Proceedings of 7th Message Understanding Conference (MUC-7).

35

Chun-Jen Lee et al.

September, 2005

Chen, Hsin-Hsi, Changhua Yang, and Ying Lin. 2003. Learning formulation and transformation rules for multilingual named entities. In Proceedings of the ACL 2003 Workshop on Multilingual and Mixed-language Named Entity Recognition, pages 1-8. Chen, Keh-Jiann and Shing-Huan Liu. 1992. Word identification for Mandarin Chinese sentences. In Proceedings of COLING, pages 101-107. Cheng, Pu-Jen, Jei-Wen Teng, Ruei-Cheng Chen, Jenq-Haur Wang, Wen-Hsiang Lu, and Lee-Feng Chien. 2004. Translating unknown queries with Web corpora for cross-language information retrieval. In Proceedings of the 27th ACM International Conference on Research and Development in Information Retrieval (SIGIR). Chien, Lee-Feng, Chun-Liang Chen, Wen-Hsiang Lu, and Yuan-Lu Chang. 1999. Recent results on domain-specific term extraction from online Chinese text resources. In Proceedings of ROCLING XII, Hsinchu, Taiwan, pages 203-218. Chinchor, Nancy. 1997. MUC-7 Named entity task definition. In Proceedings of the 7th Message Understanding Conference (MUC-7). Chuang, Thomas C., Geeng Neng You, and Jason S. Chang. 2002. Adaptive bilingual sentence alignment. Lecture Notes in Artificial Intelligence, 2499: 21-30. CNA. 2003. Central News Agency, http://client.cna.com.tw. Damerau, F. 1964. A technique for computer detection and correction of spelling errors. Comm. of the ACM, 7(3): 171-176. Feng, Donghui, Yajuan Lv, and Ming Zhou. 2004. A new approach for English-Chinese named entity alignment. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), pages 372-379. Huai, Lu. 1989. Handbook of English Name Knowledge, ISBN 7-5012-0144-7/Z.10, 1st edition. Huang, Fei, Stephan Vogel, and A. Waibel. 2003. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. In Proceedings of ACL Workshop on Multilingual and Mixed-language NER, Sapporo, Japan.

36

Chun-Jen Lee et al.

September, 2005

Kilgarriff, Adam and Gregory Grefenstette. 2003. Introduction to the special issue on the Web as corpus. Computational Linguistics, 29(3): 333-347. Knight, Kevin and Jonathan Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4): 599-612. Kraaij, Wessel, Jian-Yun Nie, and Michel Simard. 2003. Embedding Web-based statistical translation models in cross-language information retrieval. Computational Linguistics, 29(3): 381-419. Kumano, Tadashi, Hideki, Kashioka, Hideki Tanaka and Takahiro Fukusima. 2004. Acquiring bilingual named entity translations from content-aligned corpora. In Proceedings of the First International Joint Conference on Natural Language Processing (IJCNLP-04), Hainan Island, China. Lee, Chun-Jen and Jason S. Chang. 2003. Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts using a statistical machine transliteration model. In Proceedings of HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, Edmonton, Canada, pages 96-103. Lee, Chun-Jen, Jason S. Chang and Jyh-Shing Roger Jang. 2003. A statistical approach to Chinese-toEnglish back-transliteration. In Proceedings of the 17th Pacific Asia Conference on Language, Information, and Computation (PACLIC), Singapore, pages 310-318. Lee, Chun-Jen, Jason S. Chang and Jyh-Shing Roger Jang. 2004a. Bilingual named-entity pairs extraction from parallel corpora. In Proceedings of IJCNLP-04 Workshop on Named Entity Recognition for Natural Language Processing Applications, Hainan Island, China, pages 9-16. Lee, Chun-Jen, Jason S. Chang and Thomas C. Chuang. 2004b. Alignment of bilingual named entities in parallel corpora using statistical model. Lecture Notes in Artificial Intelligence, 3265: 144-153. Lee, Chun-Jen, Jason S. Chang, and Jyh-Shing Roger Jang. 2005. Extraction of transliteration pairs from parallel corpora using a statistical transliteration model. To appear in Information Sciences. Lee, Jae Sung and Key-Sun Choi. 1997. A statistical method to generate various foreign word transliterations in multilingual information retrieval system. In Proceedings of the 2nd International Workshop on Information Retrieval with Asian Languages (IRAL), Tsukuba, Japan, pages 123-128.

37

Chun-Jen Lee et al.

September, 2005

Lin, Wei-Hao and Hsin-Hsi Chen. 2002. Backward transliteration by learning phonetic similarity. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan. Lu, Wen-Hsiang, Lee-Feng Chien, and Hsi-Jian Lee. 2004. Anchor text mining for translation of Web queries: a transitive translation approach. ACM transactions on Information Systems, 22(2): 242-269. Mikheev, Andrei, Calire Grover, and Marc Moens. 1998. Description of the LTG system used for MUC7. In Proceedings of the 7th Message Understanding Conference (MUC-7). Moore, Robert C. 2003. Learning translations of named-entity phrases from parallel corpora. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pages 259-266. Nie, Jian-Yun, Michel Simard, Pierre Isabelle, and Richard Durand. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR). Och, Franz Josef and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1): 19-51. Oh, Jong-Hoon and Key-Sun Choi. 2002. An English-Korean transliteration model using pronunciation and contextual rules. In Proceedings of the 19th International Conference on Computational Linguistics (COLING), Taipei, Taiwan, pages 758-764. Resnik, Philip and Noah A. Smith. 2003. The Web as a parallel corpus. Computational Linguistics, 29(3): 349-380. Sang, Erik F. Tjong Kim and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), Edmonton, Canada, pages142-147. Schwartz, Ariel S. and Marti A. Hearst. 2003. A simple algorithm for identifying abbreviation definitions in biomedical text. In Proceedings of the Pacific Symposium Biocomputing (PSB). Sinorama. 2002. Sinorama Magazine. http://www.greatman.com.tw/sinorama.htm.

38

Chun-Jen Lee et al.

September, 2005

Stalls, Bonnie Glover and Kevin Knight. 1998. Translating names and technical terms in Arabic text. In Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages. Sun, Jian, Ming Zhou, and Jianfeng Gao. 2003. A class-based language model approach to Chinese named entity identification. Computational Linguistics and Chinese Language Processing, 8(2): 1-28. Tsuji, Keita. 2002. Automatic extraction of translational Japanese-KATAKANA and English word pairs from bilingual corpora. Int. Journal of Computer Processing of Oriental Languages, 15(3): 261-279. Wan, Stephen and Cornelia Maria Verspoor. 1998. Automatic English-Chinese name transliteration for development of multilingual resources. In Proceedings of 17th COLING and 36th ACL, pages 13521356. Wu, Chien-Cheng and Jason S. Chang. 2004. Bilingual collocation extraction based on syntactic and statistical analyses. Computational Linguistics and Chinese Language Processing, 9(1): 1-20. Wu, Dekai, Grace Ngai, Marine Carpuat, Jeppe Larsen, and Yongsheng Yang. 2002. Boosting for named entity recognition. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, pages 195-198. Yang, Christopher C. and Kar Wing Li. 2003. Automatic construction of English/Chinese parallel corpora. Journal of the American Society for Information Science and Technology, 54(8): 730-742. Zhang, Ying and Phil Vines. 2004. Using the Web for automated translation extraction in cross-language information retrieval. In Proceedings of the 27th ACM International Conference on Research and Development in Information Retrieval (SIGIR). Zhou, GuoDong and Jian Su. 2002. Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, pages 473-480.

39