Rule-based Translation of Quantifiers for Chinese-Japanese Machine ...

Proceedings of the 10th WSEAS International Conference on COMPUTERS, Vouliagmeni, Athens, Greece, July 13-15, 2006 (pp559-564)

Rule-based Translation of Quantifiers for Chinese-Japanese Machine Translation DAPENG YIN 1, MIN SHAO 1, PEILIN JIANG 1, FUJI REN 12, SHINGO KUROIWA 1 1 Department of Information Science and Intelligent Systems University of Tokushima 2-1 Minamijosanjima-cho,Tokushima 770-8506 JAPAN 2 School of Information Engineering, Beijing University of Posts and Telecommunications Beijing 100876, CHINA http://a1-www.is.tokushima-u.ac.jp/ Abstract: - Quantifiers and numerals often cause mistakes in Chinese-Japanese machine translation. In this paper, an approach to quantifier translation is proposed based on the syntactic features after classification. First, morphological analysis is performed on sentences extracted from a Chinese-Japanese aligned corpus, which consists of quantifiers and numerals. Next, statistical information is obtained based on the meaning of the nouns with an accompanying quantifier. Using the difference in type and position of quantifiers between Chinese and Japanese, quantifier translation rules were acquired. Evaluation was conducted by using the acquired translation rules. Finally, the adaptability of the experimental data was verified and the methods achieved an accuracy of 90.75%, showing that they were effective in processing quantifiers and numerals translation.

Key-Words: - Machine Translation, Quantifier, Numerals, Quantifier type, Quantifier position

1 Introduction Many practical machine translation systems have rapidly developed during these past years [1] [2] [3]. Japanese-Chinese and English-Chinese machine translation systems have also been studied actively for several years. However, there are several unsolved problems that need to be further explored in regard to Chinese-Japanese machine translation. This paper presents a method for processing quantifiers. Because of the corresponding relation existed between Chinese modifiers and the modified words, there are many different quantifiers in Chinese. However, there are few quantifiers in Japanese. The following several corresponding relations appear during the process of Chinese-Japanese machine translation. (1), There are Chinese quantifiers. - There are Japanese quantifiers (2), There are Chinese quantifiers. - There aren’t Japanese quantifiers (3), There aren’t Chinese quantifiers. - There are Japanese quantifiers Because of the existence of such complicated corresponding relations, correctly processing

quantifiers becomes a necessity for Chinese-Japanese machine translation systems. The quantifier result of Chinese-Japanese machine translation software that is often used was analyzed and the following are the translation results of two pieces of translation software. The meaning of sentence is “There is a ship here.” (1), 这里有一条船。（This has one ship.） [A] これは一本の船がある。 [B] これは一隻の船がある。 (2), 这里有一只船。（This has one ship.） [A]これは一匹船がある。 [B]これは一隻の船がある。 (3), 这里有一艘船。（This has one ship.） [A]これは一艘の船がある。 [B]これは一隻の船がある。 [A] Translation software has not considered the semantic characteristics of the modified nouns and just used the most frequently appearing form of the quantifier. [B] Translation software has considered semantic characteristics of the modified nouns. But it has not considered the concrete difference between the quantifiers. For the meaning of the correct


expression sentence, this paper has put forward the processing method of translating between Chinese quantifier and Japanese quantifier. We have collected a large number of aligned Chinese-Japanese sentences and many articles regarding the quantifier grammar from variant research. We have collected approximately 5000 aligned sentence pairs that include quantifiers. We use an existing parse tool to parse the Chinese and Japanese sentences and compare the results of the parsers and count position information in the sentences [4]. From this we get translation rules of corresponding quantifiers for Chinese and Japanese. We applied these rules to develop an experimental system in Chinese-Japanese machine translation, and received improved results. Below, section 2 introduces the translation model and the method of classifying the quantifier by using the appearance frequency of the quantifier; section 3 introduces the kind of the quantifier, “One + quantifier”, “The number except 1 + quantifier”, “Indefinite quantifier”; section 4 explains the experiment and evaluation; Finally, section 7 states conclusions and discusses future work.

2 Quantifier translation model 2.1 Chinese quantifier translation model After building the source code of the Chinese morphological analysis system into our experimental system, an automatic morphological analysis of Chinese sentences was able to be achieved. First of all, a Chinese sentence is analyzed in the morpheme, and the pattern of translating (“One + quantifier”, “The number except 1 + quantifier”, “Indefinite quantifier”) it belongs to is confirmed. And then it confirms the location of quantifier “Appear at the beginning of the sentence”, “Appear in the middle of the sentence”, “Appear in the end of the sentence” and quantifier class: Aggregate quantifiers (aggregate quantifiers include the quantity by itself, is used in pairs, in groups of person or things. For example, “ 班 ”,“ 帮 ”,“ 笔 ”, “ 对 ”,“ 队 ”,“ 份 ”,“ 份儿”,“副”,“股”,“伙”, “排”,“批”,“群”,“双”, “套”, “窝”,“系列”,“组”,“打”etc.) measure quantifier/time quantifier (“公斤”,“米”,“吨”, “里”,“年”, “月”, “星期”etc.) and overlap quantifier(“一天一次”,“一天一天”,“一天天一天天”) [5] [6] [7]. The following confirms whether to translate or not when a Chinese quantifier translated into a Japanese quantifier. The translation form is decided at the end. Figure 1 shows

a model of the Chinese-Japanese machine translation system for quantifiers. Details of the technique are described as follows.

2.2 Classification of Chinese quantifiers It was found that the composition of quantifiers can be divided into the following categories in the course of collecting corpus: (1), “One + quantifier” (2), “The number except 1 + quantifier” (3), “Indefinite quantifier” The proportion of each type is shown in Table 1. Table 1, Quantifier Frequency All One + numbers Indefinite + Quantifier Quantifier other than Quantifier one + Quantifier Number 3702 491 807 74.04% 9.82% 16.14% There are various types of quantifiers in Chinese. Chiefly, “Noun quantifier”and “Verb quantifier”. Noun quantifier is the quantifier that expresses the measurement of things. It is composed by an individual quantifier, aggregate quantifier, time quantifier and measure quantifier. A verb quantifier is the word that measures the frequency of motion and change. Moreover, Even the same quantifiers have different features depending on where they appear in the sentence. For example, quantifiers are always translated when the quantifiers emphasize the subjects that appear at the front of the sentence. We further divide the quantifiers into three categories by different position information in the sentence. (1), Appear at the beginning of the sentence (2), Appear in the middle of the sentence (3), Appear at the end of the sentence

3 Translation rules of quantifier expressions and confirmation of the translation form 3.1 Translation rules for “one +quantifier” “One + quantifier” is divided into two kinds based on the Chinese grammatical characteristics (“Noun quantifiers” and “Verb quantifiers” ). At the same time, “Noun + quantifier” can be divided into


individual quantifier, aggregate quantifier, measure quantifier/time quantifier and overlap quantifier. The translation situation of quantifier in the sentence has three kinds: translate, no translate, and occasionally translate. Because the type “One + quantifier” is very large, it will the only one processed in this paper.

The “Indefinite quantifier” is an important component of quantifiers too covering about 9% of all quantifiers. (Ex.“几个”, “几条” etc.) The structure of “Indefinite quantifier” is comparatively complex. “Indefinite quantifier” is divided into two kinds as shown below.

Q u a n ti fie r C o n fir m a tio n

Q u a n tifie r P a tter n Id e n tific a tio n

Q u a n tifie r T r a n s la tio n R u le and E x p r e s s io n DB

T r a n sla te o r n o t?

Yes

No

T r a n sla tio n Form C o n fi r m a tio n

Q u a n tifie r T r a n s la tio n Form D B

T r a n sla tio n R e su lt

Fig.1, Model of the Chinese-Japanese machine translation system for quantifiers Individual quantifiers are used for individual things, usually all that needed is an individual quantifier in front of the individual thing. For example, “本”,“部”,“册”,“颗”,“间”,“节”etc. When “One + quantifier” appears in front of the sentence, it must be translated. When “One + quantifier” appears in the middle of the sentence, there are three kinds of cases (translate, no translate, and occasionally translate). When “One + quantifier” appears at the end of the sentence, the quantifiers are either to be translated or not translated. The rule of individual quantifiers is suitable for aggregate quantifiers. Measure quantifier/time quantifier is a unit of time and measures(“公斤”,“米”,“吨”, “里”, “里”,“年”, “ 月 ”,“ 星期 ”etc.). All Measure quantifier/time quantifier should be translated. There is no example where the “Verb quantifier” comes in front of the sentence. When “Verb quantifier” appears in the middle of the sentence, it is not translated. When “Verb quantifier” appears in the end of the sentence, it is not translated.

3.2

Translation

rules

for

“The number except 1 + quantifier” means all numbers except one and quantifier. Regardless of the position it is always translated.

3.3 Translation rules for the “Indefinite quantifier”

M o r p h o lo g ic a l A n a lysis R esu lt

C h in e s e Q u a n tifie r P a tte r n DB

number except 1 + quantifier”

“The

3.3.1 “Approximate number+quantifier” An “Approximate number” appears before or after a number and shows the uncertain quantity. There are “Approximate number” appearing after numbers mainly “ 来 ”, “ 多 ”, “ 把 ”, “ 左右 ”, “ 上下 ”etc. “Approximate number” appear before numbers, mainly “成”, “上”, “近”, “约”etc. 3.3.2 “Indefinite quantifier” “Indefinite quantifier” shows indefinite measurement. There are“些”and“点儿”, etc. Only the numeral “1” can be used ahead of “些”and“点儿”. For example, “一些”and“一点儿”, etc. The features of Chinese and Japanese were considered; collected bilingual corpuses analyzed, and collected the translation rules of Chinese-Japanese quantifiers. Table 2 shows the distribution of quantifier translation rules. Table 2, Number of rules En d Beginning Middle of of of Sentence Sentence sentence One + 4 20 6 Quantifier 1 8 1 The number except one + Quantifier Indefinite + 2 2 2 Quantifier Total 46

3.4 Selection of Japanese Translation Form There are about 150 common Chinese quantifiers. On the other hand, there are about 50 Japanese quantifiers. When we compare the relation of both quantifiers. there are three cases: 1) several Chinese


quantifiers match one Japanese quantifier. 2) one Chinese quantifier match one Japanese quantifier. 3) one Chinese quantifier matches several Japanese quantifiers. To solve this problem of translation in Chinese-Japanese quantifiers, the quantifiers were analyzed using a large corpus. As a result, it has been understood that the quantifiers are limited by the modified noun. Next, we analyzed the features of nouns, and created translation rules for 150 Chinese quantifiers for Chinese-Japanese machine translation. Actually, because there are many Chinese nouns, it is impossible to register them all in the translation table. "Semantic Similarity" [8] is used to solve this problem which is based on the semantic similarity of "HowNet" [9]. The higher the similarity, the more often the same quantifiers will be used. Table 3 shows the samples of “Chinese-Japanese quantifier translation table”. Next we explain the semantic similarity computational method in detail through one concrete example. Table 3, Chinese-Japanese quantifier translation table Chinese Type of Chinese Japanese quantifier translation noun set noun set … N1 T1 … 4 条毛巾 Q+枚 … 12 只鸟类 Q+羽 … … … … … Similarity is calculated by distance between nodes in the thesaurus as expression (1) shows. D=Dis(W1,W2) (1) In expression (1), W1 and W2 represent two nodes respectively. A similarity of two words is represented by the distance between two nodes. A similarity of the words is converted to a similarity of the sememe. Similarity Sim1 ( S 1, S 2 ) of the sememe is decided depending on the distance d in the sememe thesaurus. It shows in expression (2). a is the distance when similarity becomes 0.5.

Sim(S1 , S 2 ) =

a (2) d +a

The definition of a word consists of four parts of sememes: they are basic sememe (BS), other basic sememes (OS), related sememes (RS), and related sign sememes (RSG). Summation of similarities Sim j ( S 1, S 2 ) of the four parts of sememes becomes similarity of word W1 and W2 as shown in (3).

Sim (W 1 , W 2 ) =

4

∑

i =1

i

β i ∏ Sim j ( S 1 , S 2 ) j =1

( β 1 + β 2 + β 3 + β 4 = 1,

(3)

β1 ≥ β 2 ≥ β 3 ≥ β 4 ) The accuracy deciding the entire similarity decreases from sim1 to sim4 as expressed in (3) while calculating a similarity. β i (1 ≤ i ≤ 4) is one parameter here that can be adjusted. Moreover, in β 1 + β 2 + β 3 + β 4 = 1, β 1 ≥ β 2 ≥ β 3 ≥ β 4 . A basic sememe accomplishes the decision action. β 1 is adjusted to 0.5 or more. When sim1 is small, and sim3 or sim4 is comparatively large, an equation similarity increases, too. It quotes Π to make sure that such an irrational thing will not happen. A similarity of other basic sememes (OS) Sim 2 ( S 1, S 2 ) is shown by a similarity of two sets. The first element from OS1 is first taken for set OS1 and OS2, and then a similarity with all elements of OS2 is calculated, the element with the highest value is extracted, and the relation of the one to one is established among that. The one that it was able to establish the relation of one to one is erased from the set. When there is no corresponding element, it is made to correspond to null, and the similarity is made a constant δ with small value. The process continues until all elements are erased. The similarity between sets consists of the arithmetic mean of a similarity of each element after the relation of the one to one ends. It shows in expression (4). n and m are number of elements of two sets respectively and x are the numbers of common elements.

Sim2 (S1, S2 ) =

Sim(ρn1, ρm1) + Sim(ρn2 , ρm2 ) +... (4) n + m− x

The calculation of a similar level of related sememe (RS), related sign sememe (RSG), and special sign sememe (SSG) also follows expression (3). The value of each parameter was set as α = 1.6, β1 = 0.5, β2 = 0.2, β3 = 0.17, β4 = 0.13,δ = 0.2 .

4 Experiments and Evaluation 4.1 Evaluation experiment In this chapter experiments were evaluate the machine translation of based on these translation Chinese-Japanese bilingual corpus

conducted to the quantifiers rules. The used for the


4.2 Experiment Result

quantifier can do nothing and failed to find the corresponding word in Japanese, and has to design the other translation methods. This also greatly influences the experiment because it cannot take statistics of a corresponding Japanese translation form and those translation forms cannot be decided by the algorithm since respect of those distinctiveness and the uncertainties.

Only the number representation was assumed to be a criterion by the above-mentioned evaluation experiment, and the result in Table 4 was obtained.

5 Conclusion

experiment consists of four pieces: “坊ちゃん”, “鼻”, “斜陽”and “家”, totally 5000 sentences including quantifiers are included are extracted from them. These 5000 sentences were used to test the translation system developed.

Table 4, Experimental results Sentence Success Fail 5000 4365 635 87.3% 12.7% As the result, the 4365 sentences were succeefully translated and the 635 sentences failed. The 190 sentences failed in the Chinese sentence morphological analysis in the failing sentence. The results are shown in Table 5. Table 5, Experimental results Sentence Success Fail 4810(50000-190) 4365 445(635-190) 90.75% 9.25% The open test evaluation experiment was done by using 3000 sentences other than the corpus. Table 6 shows the result. Table 6, Experimental results Sentence Success Fail 3000 2560 440 （85.33%）（14.67%）

4.3 Evaluation and consideration Considering the cause of failure, the translation situations are as follows. Firstly, there are two modified nouns. It is necessary to judge which is the subject that the quantifier modifies when there are two nouns that the quantifier modifies. There were 130 such sentences. The second is vagueness in the modified noun. It can have different meanings depending on the context even in the same word in Chinese. Even though there is the same noun and the same quantifier, the translations are different when translating into Japanese. 54 sentences failed to be translated correctly due to this. Last is the special modification form. Chinese

In this paper, we proposed a technique for translating expressions of measure in Chinese into Japanese, and the quantifiers were classified into three kinds. Moreover, whether to be translated was examined by position in the sentence, and the translation rule and the Japanese translation form were brought together. The proposed technique is confirmed to be effective in the processing of the quantifiers in Chinese-Japanese machine translation by experiment. According to the experiment, if the semantic features of the noun that a quantifier modifies can be identified, the accuracy of the quantifier processing can be improved.

Acknowledgment This research was partially supported by the Ministry of Education, Culture, Sports, Science and Technology of Japan under Grant-in-Aid for Scientific Research (B), 14380166, 17300065, Exploratory Research, 17656128, 2005 and the Outstanding Overseas Chinese Scholars Fund of the Chinese Academy of Sciences (No.2003-1-1).

References: [1] Yujie Zhang, Kiyotaka Uchimoto, Qing Ma and Hitoshi Isahara, Building an Annotated Japanese-Chinese Parallel Corpus – A Part of NICT Multilingual Corpora, Second International Joint Conference on Natural Language Processing, Jeju Island, Republic of Korea, pp． 85-90, (2005-10) [2] Eiji Aramaki, Sadao Kurohashi, Hideki Kashioka, Naoto Kato, Probabilistic Model for Example-based Machine Translation, Proc． of MT Summit X, pp．219-226, (2005-9) [3] Y． Zhang, Q．Ma, H． Isahara: Automatic Constrction of Japanese-Chinese Translation Dictionary Using English as an Intermediary, Journal of Natural Language Processing, Vol． 12, No． 2, pp． 63-85, 2005


[4] Huaping Zhang, HongKui Yu, Deyi xiong, and Qun Liu : HHMM-based Chinese Lexical Analyzer ICTCLAS, proceedings of 2nd SigHan Workshop, pp．184-187 (2003-7) [5] Zhu X.F, Yu S.W and Wang H :“The Development of Contemporary Chinese Grammatical Knowledge Base And it’s Applications” Communications of COLIPS. 5/1-2, (1995) [6] Yuehua Liu, Wenyu Pan, Wei Gu:“The contemporary Chinese grammatical ompendium”, Kuroshio Pub (1988) [7] Asako IIDA: “A descriptive study of Japanese major classifiers”, Ph.D. Thesis (1999) [8] Qun LIU, Sujian LI, Word Similarity Computing base on HowNet, Computational Linguistics and Chinese Language Processing, Vol．7, No．2, pp．59-76, (2002-8) [9] Zhengdong Dong, Qiang Dong, Hownet: http://www．keenage．com/