Noisy Text Normalization Using an Enhanced Language Model - sdiwc

2 downloads 0 Views 324KB Size Report
website, text messages, or Android applications. Twitter is a beloved media for news dissemination, keeping in touch with friends, and sharing beliefs. Since its ...
Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

Noisy Text Normalization Using an Enhanced Language Model Mohammad Arshi Saloot1,2, Norisma Idris1, AiTi Aw2 1 University of Malaya, 50603, Malaysia 2 Institute for Infocomm Research (I2R), A*STAR, Singapore [email protected]

ABSTRACT User generated text in social network sites contains enormous amount and vast variety of out-of-vocabulary words, formed both deliberately and mistakenly by the end-users. It is of essential usefulness to normalize the noisy text before employing NLP tasks. This paper describes an unsupervised normalization system, which encompasses two phases: candidate generation and candidate selection. We generate candidate via six different methods: 1) one-edit distance lexically generation, 2) phonemically generation, 3) blending the previous methods, 4) two-edit distance lexically generation, 5) dictionary translation, and 6) heuristic rules. Although in candidate selection we use a trigram language model, a new method presented to select candidates with respect to all other words in the sentence. Our experiments on a large dataset show promising results. KEYWORDS Natural Language Processing, Noisy Text, Twitter, Text Normalization, Language model

1 INTRODUCTION Social network sites (SNS) and microblogs were introduced a few years ago. The rapid expansion of services such as Facebook and Twitter has led to a swift increase in the need to comprehend noisy written English, which often does not obey the rules of punctuation, grammar, and spelling. This noisy text often extremely is divergent from Standard form of

ISBN: 978-1-941968-02-4 ©2014 SDIWC

language. This deviation is due to three main factors: (1) the limited number of characters per text message, (2) the small phonetts keypads, (3) the communication between friends and relatives occurs in an informal manner [1], [2]. Twitter is a hybrid microblogging/SNS where end-users can send and read messages from a variety of electronic channels, such as Twitter website, text messages, or Android applications. Twitter is a beloved media for news dissemination, keeping in touch with friends, and sharing beliefs. Since its initial launch in 2006, it has obtained more than 500 million users in 2012 [3]. Tweet is a term refers to messages (“status messages”) sent on Twitter, which consists of only 140 characters. Due to the colloquial nature of the Twitter, Tweets are remarkably noisy, containing many non-standard words - e.g., 2nght “tonight” and u “you”. Twitter serves as an important resource for many natural language processing (NLP) tasks, such as sentiment analysis, information extraction (IE), summarization, information retrieval (IR) text-to-speech (TTS), etc. [4]. Although existing approaches often function below par in Twittersphere because of the ample usage of the emoticons, nonstandard words, ungrammatical and incomplete sentences, etc. Liu, Weng, and Jiang [5] report that the performance of Stanford named entity recognizer (NER) falls from 90% to 45% on Tweets. Therefore, it is very fruitful to normalize Tweets before employing the standard NLP techniques. There are many similarities between the normalization and spell correction. They attempt to detect and correct out of vocabulary

111

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

(OOV) words. The spell correction techniques only attend to misspelled words while normalization systems concern all types of OOV words such as shortened forms (e.g. “university” → “uni”) and phoneme stylish (e.g. “see you” → ”cu”). Therefore, normalization systems must cover more OOV words compare to spell correction systems, which cause more difficulty. On the other hand, normalization is an easier task than spell correction because the context of the text is recognizable and its features are recognizable. The insupervised approach presented in this work is based on hypothesis saying that there is a variety of available information about the context of Tweets. For example, we know that the Tweets are not the expository type of text. Maximum entropy provides an infrastructure to combine different probabilities about the normalized candidates. The remainder of this paper is organized as follows: Section 2 discusses related works on normalizing noisy text. Section 3 presents our approach in detecting OOV words. Then, the normalized candidate generation stage is described in Section 4. Section 5 briefly illustrates the designed data structure. The proposed candidate selection stage is explained in Section 6. Finally, Section 7 concludes this paper with a brief summary and potential future works. 2 LITERATURE REVIEW There are a large number of studies on normalizing noisy non-English text, including Chinese, Spanish, French, and Malay [6,7,8,9,10]. However, in this study, we focus on normalization of English Tweets. It has been discovered that standard MT techniques work on normalizing short message service (SMS) messages minor or no customization [11]. They present a phrase-level statistical machine translation (SMT) normalization method with an adapted phrase alignment system. Later on, Kaufmann and Kalita [12] prove that the same is true for

ISBN: 978-1-941968-02-4 ©2014 SDIWC

Twitter messages. The main problem of SMT approaches is the lack of training data. Constructing an annotated corpus, which good enough cover ill-formed words, is a time consuming task. To build an automatic speech recognition (ASR)-like normalization system, Kobus, Yvon, and Damnati [13] in the first place, reform SMS text tokens into phonetic tokens and then change back them to words via phonetic dictionary lookup. Kobus, Yvon, and Damnati [13] fuse the ASR-like system with the SMT-like metaphor to augment the accuracy of the results. To access extensive parallel data for the training of the SMT-like system, Gadde, Goutam, Shah, and Sagar [14] artificially generate parallel text in six controlled ways: character removal, phonetic swap, decapitalization/capitalization, word combination, typing mistake, and word elimination. Additionally, the SMT approach is assumed effective to strengthen Text-to-Speech (TTS) systems [15]. Furthermore, text normalization has been handled through another distinguished NLP metaphor: spelling correction [16]. The spelling correction metaphor accomplishes the normalization task on a word-per-word basis by employing a supervised noisy channel model. They built a hidden Markov model using manually annotated training data. However, unlike the SMT-like system, the model discounts the context around the token. Cook and Stevenson [17] extended the work by adding three word formation probabilities (prefix clipping, stylistic variation, and subsequence abbreviation). An architecture has been introduced to normalize French text messages using weighted finite state machines [18]. The architecture is formed by three modules in a pipeline. The first and the last modules accomplish tokenization and de-tokenization, respectively, rest on a set of hand written rules. The second module normalizes the tokens based upon the trained phonetic model. Although, they obtain significant accuracy in terms of BLEU and

112

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

WER scores, SER is disappointing because of the phonetic complexity in French language. Han and Baldwin [19] created linear SVM classifiers for discovering the ill-formed words and generated corrections based on the morphophonemic similarity. The most appropriate candidate is found using several measures: phonemic and lexical edit distance, longest common subsequence (LCS), affix substring, dependency-based frequency, and Language Model (LM). Despite the obtained high BLEU and F-score, their approach functions poorly in highly noisy Tweets. Using their method in a time sensitive Twitter search yields increased accuracy [20]. In order to normalizing casual English text, Clark and Araki [21] design a dictionary that keeps phrases. The dictionary includes 1043 entries and end-users can add more. Its trie data structure makes context aware loop-up and prefix search feasible. In 2012, Han, Cook, and Baldwin [22] generate possible normalization pairs to compile a dictionary. They prove that a dictionary-based approach can outperform sophisticated approaches in terms of F-score and word error rate. For each OOV, the most similar morphophonemic in vocabulary (IV) word was chosen from the 10 million English Tweets. This inspired us to build a dictionarybased module in our candidate generation stage. 3 OOV WORD DETECTION The first step of most text processing systems is the tokenization. We evaluated most of existing tokenization methods in our extremely noisy context from Stanford tokenizer to rule-based tokenizers. Nonetheless, we learn that a simple word splitting method that is works by separating words based on a white space character is more suitable for this type of context. The first step in most of normalization systems is to make discrimination between OOV and IV words. Here, we detect OOV words by checking them in Ispell1 dictionary, which is a list of English words along with common

ISBN: 978-1-941968-02-4 ©2014 SDIWC

proper names such as names of cities and countries. The OOV detection module also includes a set of XXX heuristic rules: 1. It is an IV word if the word is only one character length and it is one of the special characters. 2. It is an IV word if the last character of the word is one of special character and remaining character would be found in Ispell. 3. It is an IV word if the initial letter of the word is uppercase and the remaining letters are lowercase. 4. It is an IV word if it is a URL or email address (detected by the regular expression). 5. Twitter coined its own writing style for mentioning topics and users. It is an IV word if it is a Twitter topic or username. 6. Words with one or two digits can be an Twitter stylish word (e.g. 2night, 10x), but words that contain more than 2 digits are likely a code word. It is an IV word if it contains more than two digits. 7. All words with only one character (e.g. a, 2, 4, and I) are considered as OOV words. 4 CANDIDATE GENERATION For each detected OOV we generate 5 sets of the corrected candidates by specific modules as illustrated in Table 1. The lowest figure belongs to the number of candidates generated by the Malay dictionary method. Two-edit distance lexical generation would produce the highest number of candidates that is with highest recall and lowest precision. The combination of phoneme and lexicon, lexically (one-edit distance), phonemically, heuristic rules are sitting at 2,3,4, and 5, respectively.

113

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

Table 1 Six different candidate generation scheme

No.

1. 2. 3. 4. 5. 6.

Type of generation Lexically (Two distance) Combination of phoneme and lexicon Lexically (One distance) Phonemically Heuristic rules Malay dictionary

Average number of candidates (for a 5 letter OOV word) 90 70 40 30 20 3

Phoneme module generates potential candidates regarding its phonemes sound. This module uses Phonetisaurus application [23], in order to convert lexical to phoneme. After converting the lexical to 10 best near phoneme, the module looks for the matched words in the CMU dictionary, which is a pronouncing dictionary created by Carnegie Mellon University. Because the CMU dictionary contains enormous amount of OOV words, finally the module test the generated candidates with the Ispell dictionary. One distance lexical module generates 54n+25 compositions for a word of length n via four modifying strategies: 1) Deletion strategy drops characters in different positions (e.g. yuo→ uo, yo, yu), making n compositions. 2) Transposition strategy swaps two adjacent characters (e.g. yuo→ uyo, you), making n-1 compositions. 3) Alteration strategy changes each character with all alphabet letters (e.g. yuo→ auo, buo, cuo, duo, euo, fuo, guo, etc.), making 26n compositions. 4) Insertion strategy assumes a character dropped thus injects all alphabets between characters (e.g. yuo→ ayuo, byuo, cyuo, dyuo, etc.), making 26(n+1)

ISBN: 978-1-941968-02-4 ©2014 SDIWC

compositions. Finally, all composition filtered out via Ispell to achieve set of candidates. Edit distance is the number of edits it will take to convert one to another word. An edit can be insertion, transposition, deletion, or alteration. Literature in spelling correction proves that 80% to 95% of errors can be addressed with one-edit distance and 98% of them with twoedit. Therefore, here we neglect more than twoedit distances. To obtain two distances, we just apply one-edit distance module on the produced compositions by one-edit distance module. The Malay dictionary module is responsible for generating candidates for Malay words. Seeing that the proposed approach will be employed to normalize English Tweets which created by Singaporeans, it is anticipated that there would be Malay words in our context. To obtain the Malay translation, we load our dictionary2 to a C++ vector data structure. The combination module is responsible for generating candidates, which is modified both lexically and phonemically. The module sends the detected OOV word to the one distance lexical component and apply phoneme module to the results of the lexical module. Heuristic module is designed to contain heuristic rules that can facilitate candidate generation. Currently, the module only contains one rule: splitting rule. It is responsible to resolve the problem of combination in OOV words. For example, it can convert “alot” to “a lot”. Heuristically, it splits words to two parts and checks if the two parts are both IV words. 5 DATA STRUCTURE In the process of normalization, there are two important factors to be considered: speed and reliability. In order to speed up the process we use nested Vector C++ data structure. As you can see in Figure 1, the design of the data structure is straightforward in order to facilitate easy preservation. Figure 1 refers to an example: “2day was e”, a phrase with two OOV words and one IV. Six different sets of candidates would be stored along with their probabilities. Although in this paper, we

114

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

describe only one probability generation method (Language Model), the structure

enables us to stock more probability scores.

0

1

2

2day

was

e

OOV

IV

OOV

1 distance lexical

Candid 1 Prob 1…Prob n

Candid i Prob 1…Prob n

Malay translation

2 distance lexical

Candid 1 Prob 1…Prob n

Candid i Prob 1…Prob n

Malay translation

phoneme

Candid 1 Prob 1…Prob n

Candid i Prob 1…Prob n

Malay translation

Heuristic rules

Candid 1 Prob 1…Prob n

Candid i Prob 1…Prob n

Malay translation

Malay translation

Candid 1 Prob 1…Prob n

Candid i Prob 1…Prob n

Malay translation

Combining

Candid 1 Prob 1…Prob n

Candid i Prob 1…Prob n

Malay translation

Figure 1 Data structure of the normalization system

6 CANDIDATE SELECTION The most significant contribution of this project is the proposed method of the candidate selection. In order to select the most suitable candidates, we calculate its probability score in its context using LM. A trigram language model is generated from Gigaword corpus using SRILM [24] toolkit. With a trigram language model we only produce a probability score of a candidate regarding three words, which is not accurate enough.

ISBN: 978-1-941968-02-4 ©2014 SDIWC

Table 2 Weakness of trigram language model Word 1 2 3 4 5 6 7 8 order: Misspelled h punched the stranger with Her tiny fists writing: Candid 1: he punched the stranger with Her tiny fists Candid 2: she punched the stranger with Her tiny fists

Table 2 refers to an example to elaborate weakness of the trigram language model. The word “she” is shortened to “h” mistakenly. The candid one generates “he” by inserting “e” to it via one-edit distance lexical method. The candid two generates “she” by inserting “e” and “s” to it via two-edit distance lexical method. Our trigram LM will associate higher probability to “he punched the”, while the

115

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

correct selection would be “she punched the stranger with her tiny fists.” In order to obtain a solution for calculating a LM probability that can consider all words in a single view we proposed an algorithm, and named it ChainedLM. Existing LM making algorithms such as Whole Sentence Exponential Model are satisfactory to make models up to 4-grams in terms of memory, space, and accuracy. While the generic path finding algorithms, including A* are there, this particular problem can be handled more appropriately with its own specific solution. The ChainedLM algorithm includes (n-2) stages where n is the number of words in a sentence. Table 3 illustrates the first stage of ChainedLM to normalize “a hav 2 dres quite $martly 4 work”. For the sake of simplicity, it is assumed that each OOV word brings only four IV candidates. Only the first stage compromises all permutations of the first three words. Table 3 displays an example that all first three words are OOV, causing 64 permutations. At each stage, top three most probable phrases would be selected if their probability score were above a preset threshold. The most probable phrase will not be inspected against the threshold because we need at least one

phrase. If second and third most likely phrase were below the threshold, they would be eliminated. Table 4 indicates the second stage where LM of the second, third and fourth words would be calculated. Here we do not need to observe all permutations. Only candidates of the second word of the sentence, which is selected in the first stage, will be considered at this stage, causing a less number of permutations. In the second stage also up to three phrases will be selected. The process of candidate selection in the stage three, four and five is similar to the stage two as displayed in Table 5, Table 6 and Table 7, respectively. Table 8 refers to our last stage where only one phrase should be selected according to its LM probability score. After accomplishing the last stage, a recursive algorithm will be employed to determine the normalized sentence as illustrated in Figure 2. Since the last stage only generates one phrase we can follow a backward chain to reach the first word of the sentence. Implementation of our candidate selection algorithm in C++ proves that it is significantly fast to be employed in real time applications.

Table 3 Stage One

Original sentence:

Choices: (64 cases)

a I a 2 aka I has to I has 2nd I has two I has 2

Highest I had to probabilities:

hav

2

dres

has to have 2nd had two hag 2 I have to I had to I have 2nd I had 2nd I have two I had two I have 2 I had 2 I have to

ISBN: 978-1-941968-02-4 ©2014 SDIWC

quite

I hag 2 I hag 2nd I hag two I hag 2

$martly

a has to a has 2nd a has two a has to

4

work

a have to a have 2nd a have two a have 2…

I have two

116

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

Table 4 Stage two

Original sentence:

Choices: (32 cases)

I

hav

2

dres

to have 2nd had two 2 have to press have 2nd press have to dress have 2nd dress have to dries have 2nd dries have to drew have 2nd drew

Highest have to press probabilities:

quite

$martly

press dress dries drew have two press have 2 press have two dress have 2 dress have two dries have 2 dries have two drew have 2 drew

have to drew

4

work

had to press had to dress had to dries had to drew…

have to dress

Table 5 Stage three

Original sentence:

Choices: (16 cases)

I

hav

2

dres

quite

$martly

to press 2nd dress two dries 2 drew 2nd press quite two press quite 2nd dress quite two dress quite 2nd dries quite two dries quite 2nd drew quite two drew quite

to press quite to dress quite to dries quite to drew quite

Highest to dress quite probabilities:

4

work

2 press quite 2 dress quite 2 dries quite 2 drew quite

to drew quite Table 6 Stage four

Original sentence:

Choices: (16 cases)

I

hav

2

dres

press dress dries drew Press quite tartly dress quite tartly Press quite partly dress quite partly Press quite mart dress quite mart Press quite smartly dress quite smartly

Highest dress quite smartly probabilities:

quite

$martly

4

work

tartly partly mart smartly dries quite tartly drew quite tartly dries quite partly drew quite partly dries quite mart drew quite mart dries quite smartly drew quite smartly

drew quite smartly

ISBN: 978-1-941968-02-4 ©2014 SDIWC

117

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

Table 7 Stage five

Original sentence:

Choices: (16 cases)

I

hav

2

dres

quite tartly four quite partly four quite tartly for quite partly for quite tartly 4 quite partly 4 quite tartly fourth quite partly fourth

Highest quite tartly for probabilities:

quite partly for

quite

$martly

4

work

tartly four partly for mart 4 smartly fourth quite mart four quite smartly four quite mart for quite smartly for quite mart 4 quite smartly 4 quite mart fourth quite smartly fourth quite smartly for

Table 8 Stage six

Original sentence:

Choices: (16 cases)

I

hav

tartly four work tartly for work tartly 4 work tartly fourth work

2

dres

quite

$martly

4

work

tartly four partly for mart 4 smartly fourth partly four work mart four work smartly four work partly for work mart for work smartly for work partly 4 work mart 4 work smartly 4 work partly fourth work mart fourth work smartly fourth work

Highest smartly for work probability:

ISBN: 978-1-941968-02-4 ©2014 SDIWC

118

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

Smartly for

quite tartly for

dress

to

I

quite partly

for

quite smartly for

drew

quite smartly

dress

have

work

quite

to

press

had

to

to

have

I

to

drew

have

to

have

I

quite smartly

drew

quite

to

dress

have

two

Figure 2 Recursive selecting strategy

7 CONCLUSION AND FUTURE WORKS In this paper, we presented a two-phase approach for noisy text normalization. The first phase is candidate generation encompassing six methods: (1) the phonemically generation method converts the word to its most probable phonemes and finds them in CMU dictionary. (2) The one distance lexical generation method performs insertion, deletion, transposition, and alteration upon a word. (3) The two distance lexical generation produces candidates with two-edit distance via employing previous method 2 times in a nested stance. (4) The Malay dictionary method translates Malay words to their English meanings via a simple

ISBN: 978-1-941968-02-4 ©2014 SDIWC

dictionary look-up. (5) The combination of the first and second module cause to cover all unseen candidates. (6) The heuristically generation method is a placeholder to keep heuristic rules for generating candidates. Currently there is only one heuristic rule to split up joined OOV words (e.g. “a lot” → “a lot”). After placing OOV words in the designed fusion data structure, our candidate selection method selects the most appropriate candidates according to their language model probability score. We have illustrated a unique candidate selection method which selection of a candidate in a sentence depends to all other words of the sentence. The method includes n-2 stages where n is the number of words in a sentence. In each stage LM probability score of the

119

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

candidates will be calculated. After performing n-2 stages, a recursive algorithm will be employed to obtain final normalized sentence. Our primary experimental results on Choudhury dataset [16] and a Singaporean English Twitter corpus proves the significant accuracy of the approach. Since this approach is part of an ongoing project, more detailed inspection of experimental results will be published on extended versions of this paper. In our future work, we plan to appraise more probability scores, including lexical dependency, hierarchical distanc [25], positional indexing, and combine them using Maximum Entropy. 8 NOTES

“Understanding the Demographics of Twitter Users.,” in ICWSM, 2011. [4]

D. Lopresti, S. Roy, K. Schulz, and L. V. Subramaniam, “Special issue on noisy text analytics,” Int. J. Doc. Anal. Recognit., vol. 14, no. 2, pp. 111–112, Apr. 2011.

[5]

F. Liu, F. Weng, and X. Jiang, “A Broad-Coverage Normalization System for Social Media Language,” Proc. 50th Annu. Meet. Assoc. Comput. Linguist. Vol. 1 Long Pap., no. July, pp. 1035– 1044, 2012.

[6]

S. B. Basri, R. Alfred, and C. K. On, “Automatic spell checker for Malay blog,” in Control System, Computing and Engineering (ICCSCE), 2012 IEEE International Conference on, 2012, pp. 506–510.

[7]

N. Samsudin, M. Puteh, A. R. Hamdan, and M. Z. A. Nazri, “Normalization of Common NoisyTerms in Malaysian Online Media,” in Proceedings of the Knowledge Management International Conference, 2012, pp. 515–520.

[8]

M. A. Saloot, N. Idris, and R. Mahmud, “An architecture for Malay Tweet normalization,” Inf. Process. Manag., vol. 50, no. 5, pp. 621–633, 2014.

[9]

A. Wang, M.-Y. Kan, D. Andrade, T. Onishi, and K. Ishikawa, “Chinese Informal Word Normalization: an Experimental Study,” in International Joint Conference on Natural Language Processing, 2013, pp. 127–135.

1

English dictionaries are distributed with Ispell, available from http://fmgwww.cs.ucla.edu/fmgmembers/geoff/ ispell.html 2 English-Malay dictionary compiled by Human Language Technology department of I2R. 9 ACKNOWLEDGEMENT We gratefully acknowledge the University of Malaya for supporting this research through UMRG Grant (RG089/12ICT). 10 REFERENCES [1]

[2]

[3]

M. Bieswanger, “2 abbrevi8 or not 2 abbrevi8: A Contrastive Analysis of Different Space- and Time-Saving Strategies in English and German Text Messages,” Texas Linguistic Forum, Vol. 50. 2007. C. Thurlow and A. Brown, “Generation Txt? The sociolinguistics of young people’s text-messaging,” 2003. A. Mislove, S. Lehmann, Y.-Y. Ahn, J.P. Onnela, and J. N. Rosenquist,

ISBN: 978-1-941968-02-4 ©2014 SDIWC

[10] M. Arshi Saloot and S. D. Bhavani, “SOCIAL NETWORK SECURITY USING ANOMALY DETECTION,” in 4th International Conference on

120

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

Computer and Automation Engineering, 2012, pp. 43–49. [11]

A. Aw, M. Zhang, J. Xiao, and J. Su, “A Phrase-based Statistical Model for SMS Text Normalization,” in Proceedings of the COLING/ACL on Main Conference Poster Sessions, 2006, pp. 33–40.

[12]

M. Kaufmann and J. Kalita, “Syntactic normalization of Twitter messages,” Int. Conf. Nat. Lang. Process. Kharagpur, India, 2010.

[13]

C. Kobus, F. Yvon, and G. Damnati, “Normalizing SMS: Are Two Metaphors Better Than One?,” in Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, 2008, pp. 441–448.

[14]

P. Gadde, R. Goutam, R. Shah, and H. Sagar, “Experiments with artificially generated noise for cleansing noisy text.”

[15]

V. Lopez Ludeña, R. San Segundo, J. M. Montero, R. Barra Chicote, and J. Lorenzo, “Architecture for Text Normalization using Statistical Machine Translation techniques,” in IberSPEECH 2012, 2012, pp. 112–122.

[16]

M. Choudhury, R. Saraf, V. Jain, S. Sarkar, and A. Basu, “Investigation and Modeling of the Structure of Texting Language,” pp. 63–70, 2007.

[17]

P. Cook and S. Stevenson, “An Unsupervised Model for Text Message Normalization,” no. June, pp. 71–78, 2009.

[18]

R. Beaufort, S. Roekhaut, L.-A. Cougnon, and C. Fairon, “A Hybrid Rule/Model-based Finite-state Framework for Normalizing SMS Messages,” in Proceedings of the 48th

ISBN: 978-1-941968-02-4 ©2014 SDIWC

Annual Meeting of the Association for Computational Linguistics, 2010, pp. 770–779. [19] B. Han and T. Baldwin, “Lexical Normalisation of Short Text Messages: Makn Sens a #Twitter,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, 2011, pp. 368– 378. [20] Z. Wei, L. Zhou, B. Li, K.-F. Wong, and W. Gao, “Exploring Tweets Normalization and Query Time Sensitivity for Twitter Search.,” in TREC, 2011. [21] E. Clark and K. Araki, “Text Normalization in Social Media: Progress, Problems and Applications for a PreProcessing System of Casual English,” Procedia - Soc. Behav. Sci., vol. 27, no. 0, pp. 2–11, 2011. [22] B. Han, P. Cook, and T. Baldwin, “Automatically Constructing a Normalisation Dictionary for Microblogs,” in Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012, pp. 421–432. [23] J. Novak, D. Yang, N. Minematsu, and K. Hirose, “Phonetisaurus: A wfst-driven phoneticizer.” The University of Tokyo, Tokyo Institute of Technology, pp. 221– 222, 2011. [24] A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Proceedings International Conference on Spoken Language Processing, 2002, pp. 257–286.

121

Proceedings of the International Conference on Artificial Intelligence and Pattern Recognition, Kuala Lumpur, Malaysia, 2014

[25]

Y. Shang, “Phase Transition in LongRange Percolation on Bipartite Hierarchical Lattices,” Sci. World J., vol. 2013, pp. 1–5, 2013.

ISBN: 978-1-941968-02-4 ©2014 SDIWC

122