A Hybrid Approach to Lao Word Segmentation using Longest Syllable ...

2 downloads 0 Views 539KB Size Report
Abstract— The Lao language is written without words delimiter which makes it extremely difficult to process. The development of automatic word segmentation ...
A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition Arounyadeth Srithirath#, Pusadee Seresangtakul# #

Department of Computer Science, Khon Kaen University Khon Kaen, Thailand, 40002 Email: [email protected], [email protected] Abstract— The Lao language is written without words delimiter which makes it extremely difficult to process. The development of automatic word segmentation for natural language processing for the Lao language is an essential but challenging task. This paper proposes a longest syllable level match with named entities recognition approach for Lao word segmentation. Syllables were first extracted from the input text and then longest matching was applied. This is one of the techniques in the Dictionary Based approach with named entities recognition being used to combine them to form the words. The performance result obtained from this approach, in precision and recall, was 85.21% and 92.36%, respectively. Keywords— Lao word segmentation, tokenization, syllable extraction, longest matching, dictionary based, named entities recognition.

I. INTRODUCTION A Lao text is a string of symbols with no explicit word boundary, which is similar to other South East Asian languages for ex: Japanese, Chinese, Thai, etc. Spaces between syllables are rarely used in Lao language. In order to perform Lao language processing, especially in Rule-Based Machine Translation role, text must be first segmented into individual terms or words. Researchers have proposed several segmentation techniques to apply word segmentation in many different languages. These techniques can be classified into two main approaches: Dictionary Based (DCB) and Machine Learning Based (MLB) [2]. The DCB approach is simple and straight forward in that it basically looks up the series of characters in the dictionary for matching terms. The main problems for the DCB approach are parsing the words that are not in the dictionary and parsing ambiguity. The MLB approach aims to solve the problems that occur in the DCB approach by using a model classification of machine learning approach that has been learned from various character patterns inside a tagged corpus. In this paper, we will present the word segmentation process in Lao language by combining three main operations: syllable extraction, longest matching technique, which one of the techniques in DCB approach, and named entities recognition (NER) [4] in order to improve the segmentation quality. Our dictionary contains approximately 52,000 words

978-1-4799-0545-4/13/$31.00 ©2013 IEEE

and acquires sufficient lexicon to cover daily used in the general domain. II. BACKGROUND AND RELATED WORKS Recently, Sisouvanh Vanthanavong et al. (2011) proposed research on Lao Word Segmentation based on conditional random fields (CRF) [1] using a tagged corpus of approximately 100,000 words, which gave the precision and recall results of 80.29% and 78.45%, respectively. As Lao and Thai are very similar in both their spoken and writing system, Choochart Haruechaiyasak et al. (2008) have compared Thai word segmentation approaches [2] for both the DCB and MLB techniques. For DCB, they compared two algorithms: Longest Matching and Maximal Matching. For MLB, they compared four algorithms: Naive Bayes (NB), decision tree, support vector machine (SVM), and CRF. By using the ORCHID corpus which contains 113,404 words, the best performance was obtained from the CRF algorithm with the precision and recall result of 95.79% and 94.98%, respectively. Named entities recognition is an essential process widely used in natural language processing (NLP). Hutchatai Chanlekha et al. (2004) have presented Thai named entity extraction by incorporating the maximum entropy Model with simple heuristic information [3]. By combining the machinelearning and rule-based approaches, the evaluated result shows that the F-measures of person, location, and organization names are 90.44%, 82.16% and 89.87%, respectively. Nattapong Tongtep et al. (2010) have proposed the method on pattern-based extraction of named entities in Thai news documents [4], that focuses on rule-based approach and uses many techniques such as longest word matching, longest pattern matching, clue words, and word lists from dictionaries, etc., all combine together. The result was obtained from their method was approximately 68-100% correctness depending the named entities type. Although, previous research has shown that the MLB approach performs slightly better than the DCB approach, especially in CRF technique. However, the MLB performance depends mainly on the domain and size of the corpus [2]. The process to collect the corpus to cover every domain and character pattern would take a lot of time and effort in order for the training model could be trained effectively. On the

other hand, even the DCB approach performs poorly on the unknown word, the problem can be overcome significantly by combing syllable extraction and Lao named entities recognition technique. III. METHODOLOGY Lao Word Segmentation is underpinned by three rudimentary operations: pre-processing, syllable extraction, and longest syllable level matching with NER, respectively. Fig. 1 illustrates the overview of the system. Lao Paragraphs or Sentences

Pre-processing Sentence

TABLE I LAO CONSONANTS

Consonant

IPA

Consonant IPA Single Consonants

Consonant

IPA



/k/



/t/



/f/



/k /



h

/t /



/m/



h

/k /



h

/t /



/j/



/ŋ/



/n/



/r/



/c/



/b/



/l/



/s/



/p/



/w/



/s/



/ph/



/h/



/ɲ/



/f/



/ʔ/



/d/



/p /



/h/

h

h

Double Consonants

Syllable Extraction List of Syllable

to III illustrates Lao consonants, vowels, and tone markers, respectively.

ຫງ

/ŋ/

ຫນ/ໜ

/n/

ຫລ/ຫ

/r/

ຫຍ

/ɲ/

ຫມ/ໝ

/m/

ຫວ

/w/

Dictionary TABLE II LAO VOWELS

Longest Syllable Level Matching with Name Entity Recognition

Vowel

Lao Initial Name Entity Lao Segmented Words Fig. 1 Lao word segmentation system overview

The pre-processing operation will take Lao paragraphs or sentences as an input. The Lao language, however, rarely uses a space between syllables; however, the Lao language does use full stops (.) to determine the end of sentences. Therefore, the paragraph will be split into sentences by using full stop (.) in order to help longest syllable level matching with the NER operation parsing the words more accuracy. Syllable structure in Lao languages contains consonants, vowels and tone markers. Consonants occur on the baseline and can be divided into two categories: single consonants, which have 27 characters and double consonants, which has 6 characters. Vowels can occur between, before, above or below a consonantal character. There are 28 vowels in Lao language. They are divided into two main categories according to their sound: short vowels, which have 12 characters and long vowels, which have 12 characters and a set of special vowel which has 4 characters. There are 4 tone markers which always occur on top of consonantal characters. There are also 3 special symbols: “ໆ” indicating repetition of syllable; “ຯ” indicating and others (etc.), and “໌” indicating voice less of the final consonant of words that borrowed from other language. Table I

IPA

Vowel IPA Short Vowels

Vowel

IPA

◌ະ

/a/

ເ◌ະ

/e/



/ɤ/



/i/

ແ◌ະ

/ɛ/

ົວະ

/uə/



/ɯ/

ໂ◌ະ

/o/

ເົອ

/ɯə/



/u/

/ɔ/ ◌າະ Long Vowels

◌ຍ

/iə/

◌າ

/aː/

ເ◌

/eː/



/ɤː/



/iː/

ແ◌

/ɛː/

ົວ

/uːə/



/ɯː/

ໂ◌

/oː/

ເົອ

/ɯːə/



/uː/



/ɔː/

◌ຍ

/iːə/

ເົາ

/ao/

Special Vowels ໄ◌

/ai/

ໃ◌

ົາ

/am/

/ai/

TABLE III LAO TONE MARKERS

Tone Marker

Tone low high

IPA /àː/ /áː/

Tone Marker

Tone falling rising

IPA /âː/ /ǎː/

By determining the characteristics of the Lao writing system, it can be observed that word boundaries generally align with syllable boundaries. This means that instead of working directly at character level, which will lead to the incorrect segmentation of a sentence into single characters and small lexemes, it’s useful to do Syllable Extraction first before doing

other operations. These rules and the algorithm have been proposed by Phonpasit Phissamay et al. (2004) in order to carry out syllable identification in Lao language [5]. In order to do longest syllable level matching, the series of syllables will be looked up in a dictionary using the forward technique. Fig. 2 describes the algorithm for Longest Syllable Level Matching. Given a set of extracted syllables S and a set of words in a dictionary D, the algorithm will output a set of longest syllable W. For example: the Lao sentence after doing the syllable extraction operation is ‘ຂອ້ ຍ | ໄປ | ຕະ | ຫຼາດ’ /kʰɔːy ay tá l ːt go to t e ma ket i stly, t e algo it m will mark the index of current position denoted as CP and last position denoted as LP to the first index and last index of the syllable set, respectively. It will then begin a check from index of syllable set CP to LP ‘ຂອ້ ຍໄປຕະຫຼາດ’against the dictionary, if there is no match; it will decrease LP by one and keep checking from index of syllable set CP to LP ‘ຂອ້ ຍໄປຕະ’ against the dictionary again. It will keep doing until it has found match or a CP equal to LP. Following this, it will form the word from the index of syllable set CP to LP, and increase CP to LP+1 and reset LP to the last index of the syllable set. The algorithm will keep doing this until all syllables have been processed or CP is greater than LP. The word segmentation esult f om t is exam le is ‘ຂອ້ ຍ | ໄປ | ຕະຫຼາດ’.

(last name is denoted as optional part) as shown in Fig. 3 and Table IV. Title

repeat If CP LP return W Fig. 2 Algorithm for longest syllable level matching

Lao personal names usually start with title that can be used as clue. In general, native Lao personal names are composed of title + one space + first name + one space + [last name]

Last Name

Fig. 3 Regular grammar for personal name written in Lao language TABLE IV EXAMPLE OF PERSON NAME WRITTEN IN LAO LANGUAGE

No

Title

First Name

Last Name

1

ທາວ

ສກໃຈ

ລດຕະນະ

2

ທານ

ອານສອນ

ໄພສານ

3

ນາງ

ສແນດຕາ

ພະວງສາ

Furthermore, location expressions such as institute, company, school, university, office, district, city, village, town, province, country, etc. also have a title that can be used as a clue Gene ally, it’s com osed of title + one s ace + location name as shown in Fig. 4 and Table V. Title

Location Name

Fig. 4 Regular grammar for location name written in Lao language

Algorithm 1: Longest syllable level matching S = {s0, s1, s2,… , sn} # Set of extracted syllables D = {d0, d1, d2,… , dn} #Set of words the in dictionary W = {w0, w1, w2,… , wn} # Set of longest syllable #Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let TmpLP = LP # Last Position Marker

First Name

TABLE V EXAMPLE OF LOCATION NAME WRITTEN IN LAO LANGUAGE

No 1

Title

Location Name

ບລສດ

ໄຊຍະສດທປກສາໄອທ

2

ອງການ

ການຄາໂລກ

3

ບານ

ສສງວອນ

By determining the word boundary of personal name and location name in Lao language, a rule can be created to recognize named entities in Lao language. Fig. 5 describes the modified algorithm for longest matching using forward technique with NER. IV. EXPERIMENTAL AND EVALUATION Lao news documents were used to evaluate the performance in our approach and they were collected from websites in the following categories: General, Sport, and Education. Each category was taken from a Lao news publisher: Vientiane Mai [6]. Ten articles were randomly selected in each category, totalling thirty articles. The dictionary used contains approximately 52,000 words. The named entities were divided into two types for recognition: person name (PER) and location (LOC). Table VI shows the list of Lao initial titles and location names that can be used as clues to detect the named entities. Table VII shows the results of our approach before and after using NER. It can be observed that the word segmentation approach improve

significantly when using NER especially in general category where the named entities are most likely to occur.

PER

ສຈ

Prof.

PER

ຮສ ດຣ

Ph.D. Assoc. Prof.

Algorithm 2: Longest Syllable Level Matching with NER S = {s0, s1, s2,… , sn} # Set of extracted syllables D = {d0, d1, d2,… , dn} #Set of words the in dictionary C = {c0, c1, c2,… , cn} # Set of clue words W = {w0, w1, w2,… , wn} # Set of longest syllable

PER

ຜຊສ ດຣ

Ph.D. Asst. Prof.

PER

ສຈ ດຣ

Ph.D. Prof.

PER

ທາ່ ນ

Mr.

PER

ສິບຕີ

Private 1st class

#Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let FlagClue = FALSE Let TmpLP = LP # Last Position Marker Let NI = 0 # Next Index of Space

PER

ສິບໂທ

Corporal

PER

ສິບເອກ

Sergeant

PER

ຮອ ້ ຍຕີ

Second Lieutenant

PER

ຮອ ້ ຍໂທ

First Lieutenant

PER

ຮອ ້ ຍເອກ

Captain

PER

ພ ັນຕີ

Major

PER

ພນໂທ

Lieutenant General

PER

ພນເອກ

General

PER

ພະນະທາ່ ນ

Excellency

LOC

ອງການ

Organization

LOC

ບລສດ

Company

LOC

ບານ

Village

LOC

ບານພກ

Guest House

LOC

ມອງ

District

LOC

ແຂວງ

Province

LOC

ໂຮງແຮມ

Hotel

LOC

ຮານ

Restaurant

LOC

ລ ັດວິສາຫະກິດ

State enterprise

repeat If CP LP return W Fig. 5 Algorithm for longest matching with NER TABLE VI LIST OF LAO INITIAL TITLE AND LOCATION NAME

Type

Lao Title

English Title

PER

ທາ້ ວ

Mr.

PER

ນາງ

Ms.

PER

ອຈ

Teacher.

PER

ດຣ

Dr.

PER

ຮສ

Assoc. Prof.

PER

ຜຊສ

Asst. Prof.

Fig. 6 Comparison chart result of longest syllable level matching approach before and after using NER

TABLE VII EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER

Approach DCB Without NER DCB With NER

Categories

Precision

Recall

F-Measure

General

76.59

86.21

81.12

Sport

79.41

88.67

83.79

Education

82.58

90.31

86.27

General

83.54

91.10

87.16

Sport

83.69

91.55

87.44

Education

87.43

93.79

90.50

TABLE VIII AVERAGE EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER

Approach

Precision

Recall

F-Measure

DCB Without NER

79.87

88.62

84.02

DCB With NER

85.21

92.36

88.64

The errors that most likely occur by using our approach are parsing word ambiguity and parsing unknown word. For ex: the Lao sentence ‘ຂອຍໃຫການສະໜບສະໜນ ຈາ’ /kʰɔːy y kaːn

V. CONCLUSION AND FUTURE WORK This paper presented the Lao word segmentation approach using longest syllable level matching with NER. This approach first extracted the syllables from the input text and then applied longest matching, which is one of the techniques in the DCB approach to combine them to form the words. We also proposed the technique on Lao named entities recognition, especially in person and location name domain, in order to improve the quality in word segmentation more accuracy. The experimental performance result was obtained from our approach with precision and recall of 85.21% and 92.36%, respectively. Future works will try to implement MLB approach to integrate with our approach in order to improve word segmentation performance, especially when parsing words that are not in the dictionary and parsing word ambiguity. REFERENCES [1]

[2]

[3]

sáná sán ːn c o/ (I give you the support). t a ses into ‘ຂອຍ (I) [P onoun]’, ‘ໃຫການ (give evidence) [Ve b]’, ‘ສະໜບສະໜນ (to support) [Ve b]’, ‘ ຈາ you [P onoun]’, instead of ‘ຂອຍ (I)

[4]

[P onoun]’, ‘ໃຫ (give) [Ve b]’, ‘ການສະໜບສະໜນ (support) [Noun]’, ‘ ຈາ you [P onoun]’

[5] [6]

S. Vanthanavong and C. Haruechaiyasak, “LaoWS: Lao Word Segmentation Based on Conditional Random Fields,” in Conference on Human Language Technology for Development, 2011, p. 21-26. C. Haruechaiyasak, S. Kongyoung, and M. Dailey, “A comparative study on Thai wo d segmentation a oac es,” in 5th Int. Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008, p. 125-128. H. Chanlekha and A. Kawt akul, “Thai Named Entity Extraction by incorporating Maximum Entropy Model with Simple Heuristic nfo mation,” in 1st Int. Joint Conference on NLP, 2004. N. Tongtep and T. T ee amunkong, “Pattern-based Extraction of Named Entities in T ai News Documents,” Thammasat Int. J. Sc. Tech., Vol. 15, No. 1, pp. 70-81, January-March 2010. P P issamay, et al , “Syllabification of Lao Sc i t fo Line B eaking,” Tech. Rep. of STEA, Lao PDR, 2004. (2013) Vientaine Mai website. [Online]. Available: http://www.vientianemai.net/

Suggest Documents