A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition Arounyadeth Srithirath#, Pusadee Seresangtakul# #
Department of Computer Science, Khon Kaen University Khon Kaen, Thailand, 40002 Email:
[email protected],
[email protected] Abstract— The Lao language is written without words delimiter which makes it extremely difficult to process. The development of automatic word segmentation for natural language processing for the Lao language is an essential but challenging task. This paper proposes a longest syllable level match with named entities recognition approach for Lao word segmentation. Syllables were first extracted from the input text and then longest matching was applied. This is one of the techniques in the Dictionary Based approach with named entities recognition being used to combine them to form the words. The performance result obtained from this approach, in precision and recall, was 85.21% and 92.36%, respectively. Keywords— Lao word segmentation, tokenization, syllable extraction, longest matching, dictionary based, named entities recognition.
I. INTRODUCTION A Lao text is a string of symbols with no explicit word boundary, which is similar to other South East Asian languages for ex: Japanese, Chinese, Thai, etc. Spaces between syllables are rarely used in Lao language. In order to perform Lao language processing, especially in Rule-Based Machine Translation role, text must be first segmented into individual terms or words. Researchers have proposed several segmentation techniques to apply word segmentation in many different languages. These techniques can be classified into two main approaches: Dictionary Based (DCB) and Machine Learning Based (MLB) [2]. The DCB approach is simple and straight forward in that it basically looks up the series of characters in the dictionary for matching terms. The main problems for the DCB approach are parsing the words that are not in the dictionary and parsing ambiguity. The MLB approach aims to solve the problems that occur in the DCB approach by using a model classification of machine learning approach that has been learned from various character patterns inside a tagged corpus. In this paper, we will present the word segmentation process in Lao language by combining three main operations: syllable extraction, longest matching technique, which one of the techniques in DCB approach, and named entities recognition (NER) [4] in order to improve the segmentation quality. Our dictionary contains approximately 52,000 words
978-1-4799-0545-4/13/$31.00 ©2013 IEEE
and acquires sufficient lexicon to cover daily used in the general domain. II. BACKGROUND AND RELATED WORKS Recently, Sisouvanh Vanthanavong et al. (2011) proposed research on Lao Word Segmentation based on conditional random fields (CRF) [1] using a tagged corpus of approximately 100,000 words, which gave the precision and recall results of 80.29% and 78.45%, respectively. As Lao and Thai are very similar in both their spoken and writing system, Choochart Haruechaiyasak et al. (2008) have compared Thai word segmentation approaches [2] for both the DCB and MLB techniques. For DCB, they compared two algorithms: Longest Matching and Maximal Matching. For MLB, they compared four algorithms: Naive Bayes (NB), decision tree, support vector machine (SVM), and CRF. By using the ORCHID corpus which contains 113,404 words, the best performance was obtained from the CRF algorithm with the precision and recall result of 95.79% and 94.98%, respectively. Named entities recognition is an essential process widely used in natural language processing (NLP). Hutchatai Chanlekha et al. (2004) have presented Thai named entity extraction by incorporating the maximum entropy Model with simple heuristic information [3]. By combining the machinelearning and rule-based approaches, the evaluated result shows that the F-measures of person, location, and organization names are 90.44%, 82.16% and 89.87%, respectively. Nattapong Tongtep et al. (2010) have proposed the method on pattern-based extraction of named entities in Thai news documents [4], that focuses on rule-based approach and uses many techniques such as longest word matching, longest pattern matching, clue words, and word lists from dictionaries, etc., all combine together. The result was obtained from their method was approximately 68-100% correctness depending the named entities type. Although, previous research has shown that the MLB approach performs slightly better than the DCB approach, especially in CRF technique. However, the MLB performance depends mainly on the domain and size of the corpus [2]. The process to collect the corpus to cover every domain and character pattern would take a lot of time and effort in order for the training model could be trained effectively. On the
other hand, even the DCB approach performs poorly on the unknown word, the problem can be overcome significantly by combing syllable extraction and Lao named entities recognition technique. III. METHODOLOGY Lao Word Segmentation is underpinned by three rudimentary operations: pre-processing, syllable extraction, and longest syllable level matching with NER, respectively. Fig. 1 illustrates the overview of the system. Lao Paragraphs or Sentences
Pre-processing Sentence
TABLE I LAO CONSONANTS
Consonant
IPA
Consonant IPA Single Consonants
Consonant
IPA
ກ
/k/
ຕ
/t/
ຟ
/f/
ຂ
/k /
ຖ
h
/t /
ມ
/m/
ຄ
h
/k /
ທ
h
/t /
ຢ
/j/
ງ
/ŋ/
ນ
/n/
ຣ
/r/
ຈ
/c/
ບ
/b/
ລ
/l/
ສ
/s/
ປ
/p/
ວ
/w/
ຊ
/s/
ຜ
/ph/
ຫ
/h/
ຍ
/ɲ/
ຝ
/f/
ອ
/ʔ/
ດ
/d/
ພ
/p /
ຮ
/h/
h
h
Double Consonants
Syllable Extraction List of Syllable
to III illustrates Lao consonants, vowels, and tone markers, respectively.
ຫງ
/ŋ/
ຫນ/ໜ
/n/
ຫລ/ຫ
/r/
ຫຍ
/ɲ/
ຫມ/ໝ
/m/
ຫວ
/w/
Dictionary TABLE II LAO VOWELS
Longest Syllable Level Matching with Name Entity Recognition
Vowel
Lao Initial Name Entity Lao Segmented Words Fig. 1 Lao word segmentation system overview
The pre-processing operation will take Lao paragraphs or sentences as an input. The Lao language, however, rarely uses a space between syllables; however, the Lao language does use full stops (.) to determine the end of sentences. Therefore, the paragraph will be split into sentences by using full stop (.) in order to help longest syllable level matching with the NER operation parsing the words more accuracy. Syllable structure in Lao languages contains consonants, vowels and tone markers. Consonants occur on the baseline and can be divided into two categories: single consonants, which have 27 characters and double consonants, which has 6 characters. Vowels can occur between, before, above or below a consonantal character. There are 28 vowels in Lao language. They are divided into two main categories according to their sound: short vowels, which have 12 characters and long vowels, which have 12 characters and a set of special vowel which has 4 characters. There are 4 tone markers which always occur on top of consonantal characters. There are also 3 special symbols: “ໆ” indicating repetition of syllable; “ຯ” indicating and others (etc.), and “໌” indicating voice less of the final consonant of words that borrowed from other language. Table I
IPA
Vowel IPA Short Vowels
Vowel
IPA
◌ະ
/a/
ເ◌ະ
/e/
◌
/ɤ/
◌
/i/
ແ◌ະ
/ɛ/
ົວະ
/uə/
◌
/ɯ/
ໂ◌ະ
/o/
ເົອ
/ɯə/
◌
/u/
/ɔ/ ◌າະ Long Vowels
◌ຍ
/iə/
◌າ
/aː/
ເ◌
/eː/
◌
/ɤː/
◌
/iː/
ແ◌
/ɛː/
ົວ
/uːə/
◌
/ɯː/
ໂ◌
/oː/
ເົອ
/ɯːə/
◌
/uː/
◌
/ɔː/
◌ຍ
/iːə/
ເົາ
/ao/
Special Vowels ໄ◌
/ai/
ໃ◌
ົາ
/am/
/ai/
TABLE III LAO TONE MARKERS
Tone Marker
Tone low high
IPA /àː/ /áː/
Tone Marker
Tone falling rising
IPA /âː/ /ǎː/
By determining the characteristics of the Lao writing system, it can be observed that word boundaries generally align with syllable boundaries. This means that instead of working directly at character level, which will lead to the incorrect segmentation of a sentence into single characters and small lexemes, it’s useful to do Syllable Extraction first before doing
other operations. These rules and the algorithm have been proposed by Phonpasit Phissamay et al. (2004) in order to carry out syllable identification in Lao language [5]. In order to do longest syllable level matching, the series of syllables will be looked up in a dictionary using the forward technique. Fig. 2 describes the algorithm for Longest Syllable Level Matching. Given a set of extracted syllables S and a set of words in a dictionary D, the algorithm will output a set of longest syllable W. For example: the Lao sentence after doing the syllable extraction operation is ‘ຂອ້ ຍ | ໄປ | ຕະ | ຫຼາດ’ /kʰɔːy ay tá l ːt go to t e ma ket i stly, t e algo it m will mark the index of current position denoted as CP and last position denoted as LP to the first index and last index of the syllable set, respectively. It will then begin a check from index of syllable set CP to LP ‘ຂອ້ ຍໄປຕະຫຼາດ’against the dictionary, if there is no match; it will decrease LP by one and keep checking from index of syllable set CP to LP ‘ຂອ້ ຍໄປຕະ’ against the dictionary again. It will keep doing until it has found match or a CP equal to LP. Following this, it will form the word from the index of syllable set CP to LP, and increase CP to LP+1 and reset LP to the last index of the syllable set. The algorithm will keep doing this until all syllables have been processed or CP is greater than LP. The word segmentation esult f om t is exam le is ‘ຂອ້ ຍ | ໄປ | ຕະຫຼາດ’.
(last name is denoted as optional part) as shown in Fig. 3 and Table IV. Title
repeat If CP LP return W Fig. 2 Algorithm for longest syllable level matching
Lao personal names usually start with title that can be used as clue. In general, native Lao personal names are composed of title + one space + first name + one space + [last name]
Last Name
Fig. 3 Regular grammar for personal name written in Lao language TABLE IV EXAMPLE OF PERSON NAME WRITTEN IN LAO LANGUAGE
No
Title
First Name
Last Name
1
ທາວ
ສກໃຈ
ລດຕະນະ
2
ທານ
ອານສອນ
ໄພສານ
3
ນາງ
ສແນດຕາ
ພະວງສາ
Furthermore, location expressions such as institute, company, school, university, office, district, city, village, town, province, country, etc. also have a title that can be used as a clue Gene ally, it’s com osed of title + one s ace + location name as shown in Fig. 4 and Table V. Title
Location Name
Fig. 4 Regular grammar for location name written in Lao language
Algorithm 1: Longest syllable level matching S = {s0, s1, s2,… , sn} # Set of extracted syllables D = {d0, d1, d2,… , dn} #Set of words the in dictionary W = {w0, w1, w2,… , wn} # Set of longest syllable #Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let TmpLP = LP # Last Position Marker
First Name
TABLE V EXAMPLE OF LOCATION NAME WRITTEN IN LAO LANGUAGE
No 1
Title
Location Name
ບລສດ
ໄຊຍະສດທປກສາໄອທ
2
ອງການ
ການຄາໂລກ
3
ບານ
ສສງວອນ
By determining the word boundary of personal name and location name in Lao language, a rule can be created to recognize named entities in Lao language. Fig. 5 describes the modified algorithm for longest matching using forward technique with NER. IV. EXPERIMENTAL AND EVALUATION Lao news documents were used to evaluate the performance in our approach and they were collected from websites in the following categories: General, Sport, and Education. Each category was taken from a Lao news publisher: Vientiane Mai [6]. Ten articles were randomly selected in each category, totalling thirty articles. The dictionary used contains approximately 52,000 words. The named entities were divided into two types for recognition: person name (PER) and location (LOC). Table VI shows the list of Lao initial titles and location names that can be used as clues to detect the named entities. Table VII shows the results of our approach before and after using NER. It can be observed that the word segmentation approach improve
significantly when using NER especially in general category where the named entities are most likely to occur.
PER
ສຈ
Prof.
PER
ຮສ ດຣ
Ph.D. Assoc. Prof.
Algorithm 2: Longest Syllable Level Matching with NER S = {s0, s1, s2,… , sn} # Set of extracted syllables D = {d0, d1, d2,… , dn} #Set of words the in dictionary C = {c0, c1, c2,… , cn} # Set of clue words W = {w0, w1, w2,… , wn} # Set of longest syllable
PER
ຜຊສ ດຣ
Ph.D. Asst. Prof.
PER
ສຈ ດຣ
Ph.D. Prof.
PER
ທາ່ ນ
Mr.
PER
ສິບຕີ
Private 1st class
#Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let FlagClue = FALSE Let TmpLP = LP # Last Position Marker Let NI = 0 # Next Index of Space
PER
ສິບໂທ
Corporal
PER
ສິບເອກ
Sergeant
PER
ຮອ ້ ຍຕີ
Second Lieutenant
PER
ຮອ ້ ຍໂທ
First Lieutenant
PER
ຮອ ້ ຍເອກ
Captain
PER
ພ ັນຕີ
Major
PER
ພນໂທ
Lieutenant General
PER
ພນເອກ
General
PER
ພະນະທາ່ ນ
Excellency
LOC
ອງການ
Organization
LOC
ບລສດ
Company
LOC
ບານ
Village
LOC
ບານພກ
Guest House
LOC
ມອງ
District
LOC
ແຂວງ
Province
LOC
ໂຮງແຮມ
Hotel
LOC
ຮານ
Restaurant
LOC
ລ ັດວິສາຫະກິດ
State enterprise
repeat If CP LP return W Fig. 5 Algorithm for longest matching with NER TABLE VI LIST OF LAO INITIAL TITLE AND LOCATION NAME
Type
Lao Title
English Title
PER
ທາ້ ວ
Mr.
PER
ນາງ
Ms.
PER
ອຈ
Teacher.
PER
ດຣ
Dr.
PER
ຮສ
Assoc. Prof.
PER
ຜຊສ
Asst. Prof.
Fig. 6 Comparison chart result of longest syllable level matching approach before and after using NER
TABLE VII EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER
Approach DCB Without NER DCB With NER
Categories
Precision
Recall
F-Measure
General
76.59
86.21
81.12
Sport
79.41
88.67
83.79
Education
82.58
90.31
86.27
General
83.54
91.10
87.16
Sport
83.69
91.55
87.44
Education
87.43
93.79
90.50
TABLE VIII AVERAGE EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER
Approach
Precision
Recall
F-Measure
DCB Without NER
79.87
88.62
84.02
DCB With NER
85.21
92.36
88.64
The errors that most likely occur by using our approach are parsing word ambiguity and parsing unknown word. For ex: the Lao sentence ‘ຂອຍໃຫການສະໜບສະໜນ ຈາ’ /kʰɔːy y kaːn
V. CONCLUSION AND FUTURE WORK This paper presented the Lao word segmentation approach using longest syllable level matching with NER. This approach first extracted the syllables from the input text and then applied longest matching, which is one of the techniques in the DCB approach to combine them to form the words. We also proposed the technique on Lao named entities recognition, especially in person and location name domain, in order to improve the quality in word segmentation more accuracy. The experimental performance result was obtained from our approach with precision and recall of 85.21% and 92.36%, respectively. Future works will try to implement MLB approach to integrate with our approach in order to improve word segmentation performance, especially when parsing words that are not in the dictionary and parsing word ambiguity. REFERENCES [1]
[2]
[3]
sáná sán ːn c o/ (I give you the support). t a ses into ‘ຂອຍ (I) [P onoun]’, ‘ໃຫການ (give evidence) [Ve b]’, ‘ສະໜບສະໜນ (to support) [Ve b]’, ‘ ຈາ you [P onoun]’, instead of ‘ຂອຍ (I)
[4]
[P onoun]’, ‘ໃຫ (give) [Ve b]’, ‘ການສະໜບສະໜນ (support) [Noun]’, ‘ ຈາ you [P onoun]’
[5] [6]
S. Vanthanavong and C. Haruechaiyasak, “LaoWS: Lao Word Segmentation Based on Conditional Random Fields,” in Conference on Human Language Technology for Development, 2011, p. 21-26. C. Haruechaiyasak, S. Kongyoung, and M. Dailey, “A comparative study on Thai wo d segmentation a oac es,” in 5th Int. Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008, p. 125-128. H. Chanlekha and A. Kawt akul, “Thai Named Entity Extraction by incorporating Maximum Entropy Model with Simple Heuristic nfo mation,” in 1st Int. Joint Conference on NLP, 2004. N. Tongtep and T. T ee amunkong, “Pattern-based Extraction of Named Entities in T ai News Documents,” Thammasat Int. J. Sc. Tech., Vol. 15, No. 1, pp. 70-81, January-March 2010. P P issamay, et al , “Syllabification of Lao Sc i t fo Line B eaking,” Tech. Rep. of STEA, Lao PDR, 2004. (2013) Vientaine Mai website. [Online]. Available: http://www.vientianemai.net/