Semantic Categorization of Contextual Features Based on Wordnet for G-to-P Conversion of Arabic Numerals Combined with Homographic Classifiers Youngim Jung1, Aesun Yoon2, and Hyuk-Chul Kwon1 1
Pusan National University, Department of Computer Science and Engineering, Jangjeon-dong Geumjeong-gu, 609-735 Busan, S. Korean {acorn, hckwon}@pusan.ac.kr 2 Pusan National University, Department of French, Jangjeon-dong Geumjeong-gu, 609-735 Busan, S. Korean
[email protected]
Abstract. Arabic numerals show a high occurrence-frequency and deliver significant senses, especially in scientific or informative texts. The problem, how to convert Arabic numerals to phonemes with ambiguous classifiers in Korean, is not easily resolved. In this paper, the ambiguities of Arabic numerals combined with homographic classifiers are analyzed and the resolutions for their sense disambiguation based on KorLex (Korean Lexico-Semantic Network) are proposed. Words proceeding or following the Arabic Numerals are categorized into 54 semantic classes based on the lexical hierarchy in KorLex 1.0. The semantic classes are trained to classify the meaning and the reading of Arabic Numerals using a decision tree. The proposed model shows 87.3% accuracy which is 14.1% higher than the baseline.
1 Introduction Currently, TTS technologies for naturalness have improved dramatically and have been applied to many unlimited domain systems. However, improvement in the accuracy of TTS products has been relatively static. According to the accuracy test results of 19 TTS products by Voice Information Associates, the weakest area of the TTS products is in number processing of which average accuracy is 55.6% among the ambiguity-generating areas [7]. In the modern Korean language, numerals have three different origins—Korean, Chinese and English—and they show a variety of variants. Their distribution largely depends on context. For example, a single numeral '3' can be read in five different ways depending on its following classifier or its preceding morpheme (Ex 1-a~e). * * * * (Ex 1) a. 3geuru1[se/ seog/ seo/ sam/ seuli]2 “three stumps” * * * * b. 3nyeon [ se/ seog/ seo/sam/ seuli] “three years” * * * * c. big 3 [ se/ seog/ seo/ sam/seuli] “Big 3” 1
2
Geuru is “a unit of trees” and nyeon means “year”. Doe and mal are Korean units of volume for measuring liquid or grain; one doe is about 1.8ℓ, and one mal is about 18ℓ. In this paper, letters in italics stand for G-to-P conversion of Korean. Phrases in quotation marks or brackets are the interpretation of each example phrases.
G.G. Lee et al. (Eds.): AIRS 2005, LNCS 3689, pp. 595 – 600, 2005. © Springer-Verlag Berlin Heidelberg 2005
596
Y. Jung, A. Yoon, and H.-C. Kwon *
*
*
*
d. 3doe [ se/seog/ seo/ sam/ seuli] “5.4ℓ” e. 3mal [*se/*seog/seo/*sam/*seuli] “54ℓ” f. 3gu [se/*seog/*seo/sam/*seuli] “three bodies/three boroughs or the third ballℓ” In (Ex 1), classifiers following Arabic numerals play an important role for determining the reading of Arabic numerals. However, a homograph classifier following an Arabic numeral, multiple readings are acceptable for an Arabic numeral as shown in (Ex 1-f). Thus, contextual features or patterns are required to resolve the ambiguity in reading of Arabic numerals, and to be learned in order to cover new data. The other parts of this paper are as follows. In Section 2, related work on WSD is studied. In Section 3, one approach to WSD by learning the semantic categories of contextual features extracted from corpora is suggested. Categorization of the semantic classes based on the lexical relations in KorLex1.0 is illustrated. Experimentations are performed in this section. Conclusions and future work of this paper are followed.
2 Related Studies Word Sense Disambiguation Because of the strong dependencies of contextual features, a decision tree algorithm has been adopted, which is an efficient classifier for handling complex conditional dependencies and non-dependencies [2, 5]. Efficiency deteriorates when the classifier handles very large parameter spaces, such as the highly lexicalized feature sets. Thus, work on grouping similar individual words as semantic categories was studied based on established semantic categories contained in Roget’s thesaurus [8]. The method achieves high accuracy in disambiguating word sense when thesaurus categories and senses align well with topics. Twenty-four semantic categories in WordNet1.53 have been applied for WSD, in the respect of sense granularity, semantic categories in WordNet are finer than that of Roget thesaurus [1].
3 Word Sense Disambiguation of Homographic Classifiers Ambiguities in reading Arabic numerals can be resolved using context as shown in (Ex 1-a~e). Homographic classifiers, however, cause ambiguities and need additional contextual features as to determine the reading of an Arabic numeral combined with the classifier. For the purpose of analyzing the ambiguities caused by homographic classifiers and resolving the ambiguities by learning contextual features, the training data were randomly sampled from news articles issued for two years (January 1st, 2000 to December 31st, 2001) from 10 major newspapers in Korea. The size of cor4 pora is 15,196 eo-jeol s. All instances of Arabic numerals combined with homo5 graphic classifiers are collected and then the correct RFA tags are labeled . 3 4
5
WordNct groups noun senses in 24 lexicographer's files [4]. Eo-jeol is a morpheme cluster of continuous alphanumeric characters and symbols with space on either side in Korean. In general, symbols are placed between the two paralleled items without spacing. In most cases an eo-jeol is composed of several morphemes of different parts of speech [9]. The process of labeling RFA tags is semi-automated, using the rule-based transliteration system of Arabic Numerals Expressions (ANEs) developed by [9]. Hand-craft correction has been followed for accurate RFA tagging by authors.
Semantic Categorization of Contextual Features Based on Wordnet
597
3.1 Ambiguities Caused by Homographic Classifiers Since many Chinese homographic classifiers are combined with Arabic numerals, precedent analysis on the senses of the classifiers is required for selecting the correct Reading Formulae of Arabic numerals (RFA). [Table 1] shows each sense of homographic classifiers and the RFA as an example. Table 1. Senses of a Homographic classifier ’gu’ Classifiers Pron Sense
gu
Example
6
RFA
ideul-eun sache 3gu-e year uibog, cheolmo deung yupum-eul unit of a balgulhaessda. [se] 1 dead body “They exhumed six bodies and then the relics such as clothes Kca_b and helmets.” gangnam-gu(0.08%), songpa-gu(0.05%), seocho-gu (0.04%) [sam] deung ‘gangnam 3gu’neun pyeonggyun-eul mitdol-assda. 2 borough “‘Gangnam 3gu’ such as Gangnam-gu (0.08%), Songpa-gu Cor_b(+D) (0.05%) and Seocho-gu (0.04%) were below average.” imyeonghoneun seonballo naseon choesangdeog-ui 3gujjae jigguleul tongtahaessda. [sam] 3 pitch “Lee, Myoungho hit the sixth fastball thrown by a starting Cor_b(+D)
pitcher, Choi, Sangdeok.” Other homographic classifiers such as ‘gi1 (unit of rockets, tombs), gi2 (unit of a stage, a session)’, ‘dae1 (unit of auto- mobiles, machines or bicycles), dae2 (the biggest item), dae3 (the time of life or persons in the time of life)’, ‘dan1 (unit of bundled vegetables), dan2 (level)’, ‘dong1 (unit of container for liquid), dong2 (unit of village)’, ‘byeong1 (a bottle), byeong2 (level of a soldier)’, ‘chuk1 (unit of ships), chuk2 (Korean measurement of height)’, ‘bak1 (musical time), bak2 (unit of stay)’, ‘bun1 (honorific form of persons), bun2 (a minute)’, ‘su1 (a move in baduk game), su2 (a piece of poems), su3 (sou)’, ‘guan1 (Korean measurement of weight), guan2 (unit of halls)’, ‘jib1 (a series), jib2 (a house)’ have been analyzed. 3.2 Homographic Sense Disambiguation Based on Corpus and KorLex Since RFA is determined depending on the sense of homographic classifiers and the ambiguities of homographs are resolved by the semantic correlation with neighboring words in turn, words around the Arabic numerals and homographic classifiers can be used as distinctive features to predict the correct RFA. The steps of the extraction of contextual features will be outlined, using a homographic classifier ‘gu’ in [Table 1], as an example. 6
The sub-categorization of RFA suggested by [9] is adopted in this paper. The abbreviations representing the sub-categories of RFA are as follows; K=Korean, C=Chinese, E=English, ca=cardinal, or=ordinal, b=base form, v=variants, n=noun, D=DSM (Decimal Scale Marker). The abbreviations can be combined. For example, ‘Kca_b’ is consisted of ‘K’, ‘ca’, and ‘b’, which means ‘Korean cardinal adjective numeric in base-form’, altogether. ‘Cor_b[+D]’ means ‘Chinese ordinal adjective numeric in base-form with DSM’.
598
Y. Jung, A. Yoon, and H.-C. Kwon
Step 1: Morphological Analysis In Korean, content words and function morphemes such as case markers, postpositions, or endings come in one eo-jeol. Content words are separated from function morphemes and be lemmatized through morphological analysis. Step 2: Contextual Feature Extraction and Semantic Categorization Among the lemmatized content words, nouns are extracted as shown in [Table 2]. Table 2. Left and right contextual features of homographic classifier ‘gu’ (excerpted) Classifiers gu
Sense W[-3] W [-2] 1 2 Songpagu Seochogu 3 seonbal naseo-
W [-1] sache Gangnam choesangdeog
W [+1] W [+2] W [+3] uibog cheolmo balgul pyeonggyun jiggu tongtaha-
Words proceeding (-3, -2, -1) or following (+1, +2, +3) the combination of Arabic numerals and homographic classifiers are clustered into semantic categories based on lexical hierarchy in KorLex 1.0 [3]. The process for semantic categorizing contextual features is described as follows: Step 2-1: Mapping lemmatized words used as contextual features-extracted from the tagged corpus- to KorLex hierarchy. For example, words {cheinji-eob (change-up), bol (ball), jiggu (fastball), samjin (putout)} used as contextual features for disambiguating ‘gu3 (“pitch”)’ are mapped to the hierarchy in KorLex. Step 2-2: Listing all common hypernyms of synset nodes mapped from contextual features. Common hypernyms are {haengwi (act)} as shown in [Figure 1]. Hierarchy of KorLex 1.0 Common hypernyms
” ” gonghun (“deed”) baldong (“propulsion”) tusa (“throw”) tugu (“pitch”) che-inji-eob (“change-up”) bol (“ball”)
“
“
“
”
midalseong ( nonachievement
hangdong ( action )
“
Least Upper Bound
”
hangwi ( act )
“
”
silpae ( failure )
dalseong ( achievement )
“
“
”
a-us ( out ) cheogsal ( putout )
samjin ( strikeout )
“
”
“
”
jiggu ( fastball )
”
“
”
binbol ( duster )
Fig. 1. Automatic selection of Least Upper Bound
Step 2-3: Finding the Least Upper Bound (LUB) of synset nodes mapped from contextual features. Here, { haengwi} is selected as LUB. Step 2-4: Selecting the LUB as a semantic category for the contextual features. The selected {haengwi} becomes the generalized semantic category of {cheinji-eob, bol, jiggu, samjin}. Following the same procedure, {sache (dead_body)} has been reassigned for the semantic category of {sache (dead_body), siche (corpse)}, which are contextual features for disambiguating ‘gu1(unit of dead body)’ and {haengjeong_guyeog (administrative_district)} for {gangnam-gu (Gangnam-gu), songpa-gu (Songpagu), seocho-gu (Seocho-gu), seongeogu (election_district)}, which are contextual
Semantic Categorization of Contextual Features Based on Wordnet
599
features for ‘gu2 (borough)’, respectively. By application of the procedure to the training corpus, 54 semantic categories have been obtained. Step 3: Extraction of Pattern and Arithmetic Features The other learning features are combined patterns of Arabic numerals and text symbols, arithmetic features and the individual homographic classifiers as in [Table 3]. Table 3. Learning Features and values Features Pattern features
Subcategories No. of groups in an ANE No. of text symbols in an ANE
Value NA1: one group of numeral in an ANE NA2: more than 2groups NT0: no text symbols in an ANE NT1: one text symbols NT2: more than two text symbols Types of text symbols T1 : ‘-‘, T2 : ‘~’, T3 : ‘.’, T4 : ‘,’, T5 : ‘:’ T6 : ‘/’ Arithmetic Size of Arabic nu- S1: 1900