KAIST Department of Computer Science

Overview of the KAIST EKMT System Progressive Report Ver1

Kong Joo Lee1 Gil Chang Kim2 CS/TR-97-117 December 1997

KAIST Department of Computer Science

1 2

Korea Advanced Institute of Science and Technology, email: [email protected] Korea Advanced Institute of Science and Technology, email: [email protected] 0

Overview of the KAIST EKMT System KAIST English-to-Korean Machine Translation System

1. Introduction The goal of machine translation (MT) is to automate the process of translating natural languages: English to Korean (E/K), Korean to English (K/E), English to Japanese (E/J), Korean to Japanese (K/J), and so forth. In the ideal case, the translation would be fully automated, highly accurate, stylistically perfect and applicable to any topic and any style of texts. In present day reality, MT is only partially capable of achieving these objectives, with trade-os between degree of automation vs. accuracy, breadth of coverage and text types vs. stylistic appropriateness, etc. Development of the KAIST EKMT system began in April 1996 at Korea Advanced Institute of Science and Technology (KAIST), and the EKMT system is currently under development. The project is nanced by the Republic of Korea Air Force. This paper presents the pilot release of the EKMT system. The domain covered in this system is mainly text of military intelligence. The EKMT system employs a simple and old technology { rule-based and transferbased translation and using bilingual dictionaries { and it adopts statistical techniques to resolve the morphological ambiguities and the syntactic ambiguities. Lexicalized grammars are particularly well-suited for the speci cation of natural language grammars. Alos, lexicalization provides a clean interface for combinng the syntactic and semantic information in the lexicon. The EKMT system is characterized by using lexicalized rules and the layered structure of the dictionary. The lexicalized rules can easily handle the idiomatic expression and the stereotyped phrases. The layered structure of the dictionary directly depends on the processing of the translation. The language processing engine of the EKMT system consists of 4 major modules: morphological analysis, syntactic analysis, transfer, generation. Therefore, the structure of the dictionary is 4-layered according to the processing modules.

2. System Characteristics The KAIST EKMT system's characteristics is as followings:

Domain covered: military intelligence text Platform: H/W: HP workstation 1

OCR

pre EDITOR

source text

scoring MECHANISM

TRANSLATOR ENGINE

DICTIONARY

Dictionary EDITOR

translated text

Grammar Rules post EDITOR

target text

Statistic Database

Figure 1: Overview of the EKMT system. OS: HP-UX Programming Language: C, Motif

Size of Lexicon (pilot release): English lexicon: 24,500 words Korean lexicon: 35,000 words

Size of Grammar (lexical-independent rules): syntactic analysis rules: English 271 rules transfer rules: English-to-Korean 271 rules

3. Technical Infrastructure The ow of control from input to output in the KAIST EKMT system is shown in Figure 1. As shown in this gure, a text is scanned by OCR to enter the translator, and a user can pre-edit the input text, in the same manner, the user can post-edit the text translated by the machine. 2

Also, the EKMT system gathers the target texts that are post-edited by users in order to build the statistics database which is essential to resolve various ambiguities. Currently, the EKMT system can resolve the ambiguities related to morphological and syntactic ones. In this system, linguistic processing(translator engine) is based on the dictionary and the grammar rules, and it generally proceeds through four basic stages: (1) morphological analysis of the source language; (2) syntactic analysis of the source language (parsing); (3) transfer (mapping the phrase structure of English into the dependecy structure of Korean); (4) syntactic and morphological generation of the target language. Morphological analyzer handles the in ected word in a sentence, and try to recover the root form of the word. One of the diculties in morphological analysis is processing the various symbols, proper names, an abbreviation, compound noun, etc. Also, it resolves the ambiguities of part-of-speech tag for a word. Then, morphological analyzer loads information for each word from the dictionary. Syntactic analyzer decides the syntactic structure of the input sentence using the syntactic rules in the system. The EKMT system adopts the scoring mechanims based on the simple probabilistic context free grammar. The output of the syntactic analyzer is a phrase structure. Transfer module transforms the phrase structure into the dependency structure of the target language. Also, it select the target word according to the collocational information. Collocational information is a pair of words that is inclined to cooccur with the particular syntactic relationship. So we can expect better quality of the word selection than one without collocational information.

3.1 Grammar There are two kinds of rules in the grammar that are used in the EKMT system. One is syntactic analysis rule and the other is transfer rule. Syntactic analysis rules are linked to transfer rules, and for a syntactic analysis rule, one or more links to transfer rules are possible. The syntactic analysis rules are the form of the augmented context free rules, which are augmented with the condition and the action that have feature structures. The form of the syntactic analysis rule is as followings: id of a syntactic analysis rule only if lexicalized rule RULE: syntactic analysis rule CONDITION: checks constraints ACTION: validate constraints T LINK: links to the transfer rule(id of T ENTRY)

S ENTRY:

3

The transfer rules convert a phrase tree of the source language into a dependency tree of the target language. Also, they can select the target word according to the collocation of the word. The form of the transfer rule is as followings: id of a transfer rule only if lexicalized rule MAPPING: information which is mapping phrase structure into dependency structure K REL: relation between nodes of dependency structure COLLOCATION: collocational information of the source word G LINK: links to target lexicon only if lexicalized rule T ENTRY:

The rules { syntactic analysis rules and transfer rules { in the grammar used in this system can be classi ed into 3 classes according to their importance. The most important one is lexicalized rule, which has the lexical item on its representation. The second most important one is chunk rule. Because the chunk rules expand only small coverage of an input sentence, they rarely have ambiguities. The general rules attach the small phrases that are constructed by the lexicalized rules and chunk rules. In order to process an input sentence, the EKMT system rst activates the lexicalized rules and chunk rules, and then nally activated the general rules.

lexicalized rules: Because they encode the lexical item in their representation, they can eciently encode the idiomatic expression and the stereotyped phrases that are xed on the speci c lexeme. While the chunk rules and general rules are included in the system grammar, the lexicalized rules are included in the dictionary not in the system grammr. They can link the source lexicon with the target lexicon. The followings is the example of the lexicalized rule. #S ENTRY #RULE #CONDITION #ACTION #T LINK #T ENTRY #E COLL #EK MAP #EK MAP #K REL #G LINK

4

chunk rules: They can analyze the small coverage of the input sentence without ambiguity. Chunk rules are independent of the lexical item, so there is no eld `#S ENTRY' and `#G LINK' that link the source lexicon and the target lexicon. And they are described not in the dictionary but in the grammar. The followings is the example of the chunk rules. #RULE #ACTION #ACTION < np the np1> #T LINK #T ENTRY < np the np1> #E COLL #EK MAP general rules: They can connect the small phrases that are constructed by the lexicalized rules and the chunk rules. General rules are also independent of the lexical item, so they are described not in the dictionary but in the grammar.

#RULE SBCL:sbcl0 conj:conj1 NP:np1 VP:vp1 #CONDITION conj1.subcat [SUBO] #ACTION sbcl0 conj1 #ACTION sbcl0.tense := vp1.tense ** [past pres futu] #T LINK sbcl conj np vp1

>

< > < < > #T ENTRY < sbcl conj np vp1> #EK MAP #EK MAP #EK MAP #K REL #K REL

>

>

3.2 Dictionary The structure of the dictionary is 4-layered according to the procedure of translation. The rules described in the dictionary are all lexicalized ones. Basically, the dictionary includes the source lexicon and the target lexicon, as well as the lexicalized rules(syntactic analysis rules and transfer rules) that connect them each other. Figure 2 shows the structure of the dictionary.

5

source text lexical entry

morphological analysis

syntactic analysis

syntactic entry

transfer

transfer entry

generation

generation entry

Dictionary: lexicalized rule target text

Figure 2: 4-Layer of dictionary. layer 1-layer 2-layer 3-layer 4-layer

processor morphological analysis syntactic analysis transfer generation

dictionary entry lexical entry syntactic entry transfer entry generation entry

form source lexicon syntactic rules transfer rules target lexicon

1-layer and 4-layer are the source lexicon and the target lexicon, respectively. To parse a sentence, syntactic analyzer uses not only 2-layer information of the dictionary but also the lexical-independent syntactic rules(syntactic chunk rules and syntactic general rules). A verb can have several structure patterns. For example, verb `claim' can have simple noun phrase as its direct object, or that-clause as its argument. Also, we know that the meaning of verb `claim' can be dierent according to the form of its argument. In other words, the verb `claim something' is translated to ` ÏÚê cgsv!""'(demand something), while the verb `claim that-clause' is translated to ` ÏÚê |"É!""'(declare that-clause). Therefore, for a source lexicon, more than one syntactic rule are possible. That is to say, one lexical entry can be associated with several syntactic entries. Transfer module uses 3-layer information of the dictionary and the lexical independent rules { transfer chunk rules and transfer general rules. As you know, a word can be dierently translated according to the context of the word. For example, verb `claim' can be translated into `cgsv!""'(ask for , demand) when the verb co-occurs title, property, money, etc. but the verb can be translated into `%Ç"""'(carry o ) when it co-occurs catastrophe or disease. Therefore, for a syntactic rule, more than one transfer rules are possible according to the collocational information. Figure 3 gives some examples of the dictionary. In this gure, `#L ENTRY, #S ENTRY, #T ENTRY, #G ENTRY' mean the lexical entry, the syntactic entry, the transfer entry, and 6

#L_ENTRY #LEX #ROOT #POS #INF_CODE #INF_3SP #INF_PAST #INF_PP #INF_ING #SUBCAT #WEIGHT #S_LINK

#S_ENTRY #RULE claim:claim1 NP:np1> #ACTION #LEFT #RIGHT #CASEFRAME #K_PASS

#T_ENTRY #E_COLL #EK_MAP #EK_MAP #HEAD #K_REL #V_NODE #E_COLL #EK_MAP #K_REL #G_LINK

#S_ENTRY #RULE claim:claim1 INFP:infp1> #ACTION #LEFT #RIGHT #CASEFRAME #K_PASS

Figure 3: Example of dictionary.

7

#T_ENTRY #E_COLL #EK_MAP #EK_MAP #HEAD #K_REL #V_NODE #E_COLL #EK_MAP #K_REL #G_LINK

I

#G_ENTRY #K_WORD < > #LEFT #RIGHT #CASEFRAME #K_PASS

VP #S_ENTRY #RULE claim:claim1 NP:np1> #ACTION cgsv!"", sv!""; ·$ÇÚä!"": Did you claim your insurance after your car accident? "ÇÔò" "UWÉÚê ch 9L _f inÉÚê sv ÍÚð$"? | The flood calimed hundreda of lives. U[ ÜÝò{ÁÆç { ¥ ÊÔäÍÖïÏÚê " ".

"É!~Ñ ,É -¸ %» !8Ñ ,Ü $¿ '» !5Ø %Ç "È

Ñ

'» Ñ

,Ü JKÛ

BK¸-É

Followings is the other description of the verb `claim'. We can build the syntactic analysis rule `VP ! claim INFP' and `VP ! claim THCL' from this entry. Because the meaning is `declare' not `demand', 2 [T+to-v / (that) ] to declare to be true; state sep. in the face of opposition; MAINTAIN(4) || ~ÏÚê " $"UW | !"", !""; ( 4@9L )) !""; $ !"": He claimed to be rich / claims that he is rich but I don't believe him. U[ÁÆç ""UW | !"$ "ÁÆç U[¥ ÏÚê $ ÁÆç".

%»

"Ê

"¿ $¾ "½

$¿ {|Ð.»

"É /»

%».» )-¸ É" %»

5. Future Work and Conclusion In this report, we have presented the pilot release of EKMT system. The EKMT system is rulebased and transfer-based machine translation system. In the bilingual dictionary, the source lexicon and the target lexicon are connected by the lexicalized rules. Syntactic analysis rules and transfer rules of chunk and general rules are also connected in the same manner as the dictionary. 9

This work is the on-going project and much work remains to be done on the EKMT system. Followings is the future work to do in EKMT system.

Semi-automatic construction of broad-coverage translation lexicon.

Calculation of distance between lexical collocations using thesarus.

Robust processing

References [1] Myung Seok Choi, Kong Joo Lee, and Gil Chang Kim. Disambiguation of english syntactic ambiguity using heuristics in English-to-Korean machine translation. In Proceedings of the KISS Fall Conference, volume 24, pages 177{180, 1997.(in Korean) [2] Brona Collins, Padraig Cunningham, and Tony Veale. An example-based approach to machine translation. In Proceedings of the Second Conference of the Association for Machine Translation in the America, pages 1{13, Montreal, Quebec, Canada, October 1996. [3] Bonnie Jean Dorr. Machine Translation: A View from the Lexicon. The MIT Press, 1993. [4] John Hutchins. Looking back to 1952: the rst mt conference. In Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation, pages 19{30, St. John's College, Santa Fe, New Mexico, USA, July 1997. [5] David Johnson Masatu Tomita Muriel Vasconcellos Jaime Carbonell, Elaine Rich and Yorick Wilks. Machine translation in Japan. Technical report, Japanese Technology Evaluation Center, 1992. [6] Hyun Ah Lee, Kong Joo Lee, and Gil Chang Kim. Two step English-to-Korean transfer system for stylistically natural interpretations. In Proceedings of the KISS Fall Conference, volume 24, pages 181{184, 1997.(in Korean) [7] K.S. Lee, Y.H.Cho, Kong Joo Lee, C.S.Lim, and G.C.Kim. Transfer system using combined syntactic-transfer rule in English-to-Korean machine translation. In Proceedings of the KISS Fall Conference, volume 23, pages 599{602, 1996.(in Korean) [8] Chul Su Lim, Hyun Ah Lee, Myung Seok Choi, Byung-Gyu Chang, Kong Joo Lee, and Gil Chang Kim. English-to-Korean machine translation system based on lexicalized grammar. In Proceedings of the KISS Fall Conference, volume 24, pages 161{164, 1997.(in Korean) 10

[9] I. Dan Melamed. Automatic construction of clean broad-coverage translation lexicons. In Proceedings of the Second Conference of the Association for Machine Translation in the America, pages 125{134, Montreal, Quebec, Canada, October 1996. [10] Makoto Nagoa. Machine Translation: How far can it go? Oxford University Press, 1989. [11] Sergei Nirenburg. Machine Translation: Theoretical and methodological issues. Cambridge University Press, 1987.

11

KAIST Department of Computer Science - CiteSeerX

KAIST Department of Computer Science - CiteSeerX

Suggest Documents