Technical Oriented - Data Mining TextBook

3 downloads 135 Views 1MB Size Report
Towards text mining, several preprocessing techniques have been proposed to ..... all document vectors to form a class p
CC: BY NC ND

Table of Contents Chapter 6. Applications to Text Mining ....................................................................................... 245 6.1. Centroid-based Text Classification ..................................................................................... 247 6.1.1. Formulation of centroid-based text classification ..................................................... 248 6.1.2. Effect of Term distributions........................................................................................ 251 6.1.3. Experimental Settings and Results ............................................................................. 253 6.2. Document Relation Extraction ........................................................................................... 258 6.2.1. Document Relation Discovery using Frequent Itemset Mining ................................. 259 6.2.2. Empirical Evaluation using Citation Information........................................................ 259 6.2.3. Experimental Settings and Results ............................................................................. 264 6.3. Application to Automatic Thai Unknown Detection .......................................................... 269 6.3.1. Thai Unknown Words as Word Segmentation Problem ............................................ 271 6.3.2. The Proposed Method................................................................................................ 271 6.3.3. Experimental Settings and Results ............................................................................. 280

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Chapter 6.

Applications to Text Mining

As one application of data mining, text mining is a knowledge-intensive process that deals with a document collection over time by a set of analysis natural language processing tools. Text mining seeks to extract useful information from a large pile of textual data sources through the identification and exploration of interesting patterns. The data sources can be electronic documents, email, web documents or any textual collections, and interesting patterns are found not in formalized database records but, instead, in the unstructured textual data in the documents in these collections. It is quite common that text mining and data mining share many high-level architectural similarities, including preprocessing routines, pattern-discovery algorithms, and visualization tools for presenting mining results. While data mining assumes that data have already been stored in a structured format with preprocessing of data cleansing and transformation, text mining deals with preprocessing of feature extraction, i.e., usually keywords from natural language documents. The number of features in text mining seems much larger than that in data mining since the features in text mining involves words, which are highly various. Most text mining works exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based computational linguistics. This chapter presents three examples of text mining applications: text classification, document relation extraction and unknown word detection in Thai language. The original literatures related to these three applications can be found in (Lertnattee and Theeramunkong, 2004a), (Sriphaew and Theeramunkong, 2007a) and (TeCho et. al, 2009b). Before explanation of the applications, some basic concepts of text processing are provided as follows. Towards text mining, several preprocessing techniques have been proposed to transform structured document representations from raw textual data. Most techniques aims to use and produce domain-independent linguistic features with natural language processing (NLP) techniques. There are also text categorization and information extraction (IE) techniques, which directly deal with the domain-specific knowledge. Note that a document is an abstract object. Therefore, we can have a variety of possible actual representations for it. To exploit information in documents, we need a so-called document structuring process which transforms raw representation to some kinds of structured representation. To solve this task, at least three subtasks need to be solved; (1) text preprocessing task, (2) problem-independent task, and (3) problem-dependent task. As the first subtask, text pre-processing converts raw representation into a structure suitable for further linguistic processing. For example, when the raw input is a document image or a recorded speech, preprocessing is to convert the raw input into a stream of text, sometimes with text structures such as paragraphs, columns and tables, as well as some document-level fields such as author, title, and abstract by visual presentation. To convert document images to texts, optical character recognition (OCR) is used while speech recognition can be applied to transform audio speeches into texts. As the second subtask, the problem-independent tasks process text documents using general knowledge on natural language. The tasks may include word segmentation or tokenization, morphological analysis, POS tagging, and syntactic parsing in either shallow or deep processing. The output of these tasks is not specific for any particular problem, but typically employed for further problem-dependent processing. The domain-related knowledge, however, can often enhance performance of generalpurpose NLP tasks and is often used at different levels of processing. As the last step, the problemdependent tasks attempt to output final representation suitable for the concerned task for text categorization, information extraction, etc. However, up to now it has been shown that different analysis levels, including phonetic, morphological, syntactical, semantical, and pragmatical, occur simultaneously and depend on each other. Even now how human process a langauge is still unrevealed. Some works have tried to combine such levels into one single process but still have not yet achieved a level of satisfactory. Therefore most of text understanding methods use the divide-

245

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

and-conquer strategy, separating the whole problem into several subtasks and solving them independently as follows.

Tokenization and Word Segmentation The important step towords text analysis is to break down a continuous character stream into meaningful constituents, such as chapters, sections, paragraphs, sentences, words, and even syllables or phonemes. Tokenization is a process to break the text into sentences and words. In English, the main challenge is to identify sentence boundaries since a period can be used as the end of a sentence and a part of a previous token like Dr., Mr., Ms., Prof., St., No. and so on. In general a tokenizer may extract token features, such as types of capitalization, inclusion of digits, punctuation, special characters, and so on. These features usually describe some superficial property of the sequence of characters that make up the token. For languages without explicit word boundaries such as Thai, Japanese, Korean and Chinese, word segmentation is necessary. This processing is very important to construct the fundamental units for processing such languages. Part-of-Speech (POS) Tagging POS tagging is the process to assign a word type (category) for each word in a sentence with the appropriate POS tags based on the context they appear. The POS tag of a word specifies the role the word plays in the sentence where it appear. It also provides the initial information related to the semantic content of a word. Among several works, the most common set of tags includes seven different tags, i.e., article, noun, verb, adjective, preposition, number, and proper noun. Some systems contain a much more elaborate set of tags. For instance, there have been at least 87 basic tags in the complete Brown Corpus. More types of tags means more detailed analysis. Syntactical Parsing Syntactical parsing is a process that applies a grammar to detect the structure of a sentence. In the sentence structure, common constituents in grammars include noun phrases, verb phrases, prepositional phrases, adjective phrases, and subordinate clauses. Following grammar rules, each phrase or clause may consist of smaller phrases or words. For deeper analysis, the syntactical structure of sentences may also elaborate the roles of different phrases, such as a noun phrase as a subject, an object, or a complement. In the grammar, it is also possible to specify dependency among phrases or clauses at several different levels. After analyzing a sentence, the output can be represented as a sentence graph with connected components. Shallow Parsing In real situation, it is not easy to fully analyze the structure of a sentence since language usage is sometimes complicated and flexible. Therefore it is almost impossible to construct a grammar that covers all cases. Moreover, while we try to revise a grammar to cover special cases, as a by-product a lot of ambiguity will be triggered in the grammar. Such ambiguity needs to be solved by higher process, such as semantic or pragmatic processing. By this situation, normally traditional algorithms are computationally expensive to process a large number of sentences in a very large corpus. They are also not robust enough. Instead of full analysis, shallow parsing is a practical alternative since it will not perform a complete analysis of a whole sentence but only treat some parts in the sentence that are simple and unambiguous. For example, shallow parsing finds only small and simple noun and verb phrases, but not complex clauses. Therefore we can compromise speed and robustness of processing by sacrificing depth of analysis. Most prominent dependencies might be formed, but unclear and ambiguous ones are left unresolved. For the purposes of information extraction, shallow parsing is usually sufficient and therefore preferable to full analysis because of its far greater speed and robustness. 246

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Problem-Dependent Tasks After text preprocessing and problem-independent processing, the final stage is to create meaningful representations for either later more sophisticated processing. Normally to process a text, documents are expected to be represented as sets of features. Two common applications for text mining are text categorization (TC) and information extraction (IE). Both of these applications need a tagging (and sometimes parsing) process. TC and IE enable users to move from a machine-readable representation of the documents to a machine- understandable form of the documents. Text categorization or text classification is a task to assign a category (also called a class) to each document, such as giving the class ‘political’ to a political news. The number of groups depends on the user preference. The set of all possible categories is usually manually predefined beforehand. All categories are usually unrelated. However, recently there are multidimensional text classification or multi-class text classification has been explored intensively. Information Extraction (IE) is a task to discover important constituents in a text, such as what, where, who, whom, when (5W). Without IE techniques, we would have much more limited knowledge discovery capabilities. IE is different to information retrieval (IR) which perform search. Information retrieval just discover documents relavant to a given query and let the user read the whole document. IE, on the other hand, aims to extract the relevant information and present it in a structured format, such as a table. IE can help us save time for reading the whole document by providing essential information in a structured form.

6.1. Centroid-based Text Classification With the fast growth of online text information, there has been extreme need to find and organize relevant information in text documents. For this purpose, it is known that automatic text categorization (also known as text classification) becomes a significant tool to utilize text documents efficiently and effectively. As an application, it can improve text retrieval as it allows find class-based retrieval instead of full retrieval. Given statistics acquired from a training set of labeled documents, text categorization is a method to use these statistics to assign a class label to a new document. In the past, a variety of classification models were developed in different schemes, such as probabilistic models (i.e., Bayesian classification), decision trees and rules, regression models, example-based models (e.g., k-nearest neighbor or k-NN), linear models, support vector machine, neural networks and so on. Among these methods, a variant of linear models called a centroid-based or linear discriminant model is attractive since it has relatively less computation than other methods in both the learning and classification stages. The traditional centroid-based method can be viewed as a specialization of so-called Rocchio method proposed by Rocchio (1971) and used in several works on text categorization (Joachims, 1997). Based on the vector space model, a centroid-based method computes beforehand, for each class (category), an explicit profile (or class prototype), which is a centroid vector for all positive training documents of that category. The classification task is to find the most similar class to the vector of the document we would like to classify, for example by the means of cosine similarity. Despite the less computation time, centroid-based methods were shown to achieve relatively high classification accuracy. In a centroid-based model, an individual class is modeled by weighting terms appearing in training documents assigned to the class. This makes classification performance of the model strongly depend on the weighting method applied in the model. Most previous works of centroid-based classification focused on weighting factors related to frequency patterns of words or documents in the class. Moreover, they are often obtained from statistics within a class (i.e., positive examples of the class). The most popular factors are term frequency (tf) and inverse document frequent (idf).

247

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Text categorization or text classification (TC) is a task of assigning a Boolean value to each pair where is a domain of documents and is a set of predefined categories. A value of T (i.e., true) is assigned to when the document is determined to belong to the category . On the other hand, a value of F (i.e., false) is assigned to when the document is determined not to belong to the category . In general, text classification is composed of two main phases, called model training phase and classification phase. In the training phase, the task is to approximate the unknown target function that describes how documents should be classified. Based on a training set, a function called the classifier (also called rule, hypothesis, or model) is acquired as the result of approximation. A good classifier is a model that coincides with the target function as much as possible. The TC task discussed above is general. Anyway, there are some additional factors or constraints possible for this task. They include single-label vs. multi-label, category-pivoted vs. document-pivoted and hard vs. ranking classification. Single-label classification assigns exactly one category to each while multi-label classification may give more than one categories to the same . A special case of single-label TC is binary TC where each must be assigned either to category or to its complement . From the pivot aspect, there are two different ways of using a text classifier. Given ., the task to find all that the document dj belongs to is called document-pivoted classification. Alternatively, given , the task to find all that the document belongs to is named category-pivoted classification. This distinction is more pragmatic than conceptual and it occurs when the sets C and D might not be available in their entirety right from the scratch. Lastly, hard categorization is to assign T or F decision for each pair while ranking categorization is to rank the categories in C according to their estimated appropriateness to , without taking any hard decision on any of them. The task of ranking categorization is to approximate the unknown target function by generating a classifier that matches with the target function as much as possible. The result is to assign a number between 0 and 1 to each pair . This value represents the likelihood the document is classified into the category . Finally, for each , a ranked list of categories is obtained. This list would be of great help to a human expert to make the final categorization decision. By these definitions, the focused task in this work is evaluated as single-label, category-pivoted and hard classification.

6.1.1. Formulation of centroid-based text classification In centroid-based text categorization, an explicit profile of a class (also called a class prototype) is calculated and used as the representative of all positive documents of the class. The classification task is to find the most similar class to the document we would like to classify, by way of comparing the document with the class prototype of the focused class. This approach is characterized by at least three factors; (1) representation basics, (2) class prototype construction: term weighting and normalization, and (3) classification execution: query weighting and similarity definition. Their details are described in the rest of this section.

Representation Basics The frequently used document representation in IR and TC is the so-called bag of words (BOW) where words in a document are used as basics for representing that document. There are also some works that use additional information such as word position and word sequence in the representation. In the centroid-based text categorization, a document (or a class) is represented

248

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

by a vector using a vector space model with BOW. In this representation, each element (or feature) in the vector is equivalent to a unique word with a weight. The method to give a weight to a word is varied work by work as described in the following section. In a more general framework, the concept of n-gram can be applied. Instead of a single isolated word, a sequence of n words will be used as representation basics. In several applications, not specific for classification, the most popular n-grams are 1-gram (unigram), 2-gram (bigram) and 3-gram (trigram). Alternatively, the combination of different n-grams, for instance the combination of unigram and bigram, can also be applied. The n-grams or their combinations form a set of socalled terms that are used for representing a document. Although a higher gram provides more information and this may affect in improving classification accuracy, more training data and computational power are required

Class Prototype Construction: Term Weighting and Normalization Once we obtain a set of terms in a document, it is necessary to represent them numerically. Towards this, term weighting is applied to set a level of contribution of a term to a document. In the past, most of existing works applied term frequency (tf) and inverse document frequency (idf) in the form of for representing a document. In the vector space model, given a set of documents , a document is represented by a vector , where is a weight assigned to a term in the document. Here, assume that there are m unique terms in the universe. The representation of the document is defined as follows.

In this definition, is term frequency of a term in a document and is defined as . Here, is the total number of documents in a collection and is the number of documents, which contain the term . Three alternative types of term frequency are (1) occurrence frequency, (2) augmented normalized term frequency and (3) binary term frequency. The occurrence frequency, the simplest and intuitive one, corresponds to the number of occurrence of the term in a document. The augmented normalized term frequency is defined by where is the occurrence frequency and is the maximum term frequency in a document. This compensates for relatively high term frequency in the case of long documents. It works well when there are many technical meaningful terms in documents. The binary term frequency is nothing more than 1 for presence and 0 for absence of the term in the document. Term frequency alone may not be enough to represent the contribution of a term in a document. To achieve a better performance, the well-known inverse document frequency can be applied to eliminate the impact of frequent terms that exist in almost all documents. Besides term weighting, normalization is another important factor to represent a document or a class. Without normalization, the classification result will strongly depend on the document length. A long document is likely to be selected, compared to a short document since it usually includes higher term frequencies and more unique terms in document representation. The higher term frequency of a long document will increase the average contribution of its terms to the similarity between the document and the query. More unique terms also increase the similarity and chances of retrieval

249

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

of longer documents in preference over shorter documents. To solve this issue, normally all relevant documents should be treated as equally important for classification or retrieval. Normalization by document length is incorporated into term weighting formula to equalize the length of document vectors. Although there are several normalization techniques including cosine normalization and byte length normalization, the cosine normalization is the most commonly used. It can solve the problem of overweighting due to both higher term frequency and more unique terms. The cosine normalization is done by dividing all elements in a vector with the length of the vector, that is

where

is the weight of the term

before normalization.

Given a class with a set of its assigned documents, there are two possible alternatives to create a class prototype. One is to normalize each document vector in a class before summing up all document vectors to form a class prototype vector (normalization then merging). The other is to sum up all document vectors before normalizing the result vector (merging then normalization). The latter one is also called a prototype vector, which is invariant to the number of documents per class. However, both methods obtain high classification accuracy with small time complexity. The class prototype can be derived as follows. Let is a document vector belonging to the class } be a set document vectors assigned to the class . Here, a class prototype is obtained by summing up all document vectors in and then normalizing the result by its size as follows.

Classification execution: query weighting and similarity definition The last but not least important factors are query weighting and similarity definition. For query weighting, term weighting described above can also be applied to a query or a test document (i.e., a document to be classified). The simple term weighting for a query is . In the same way as class prototype construction, there are three possible types of term frequency; occurrence frequency, augmented normalized term frequency and binary term frequency. Once a class prototype vector and a query vector have been constructed, the similarity between these two vectors can be calculated. The most popular one is cosine distance. This similarity can be calculated by the dot product between these two vectors. Therefore, the test document ( ) will be assigned to the class whose class prototype vector is the most similar to the query vector ( ) of the test document.

Here, as stated before, is equal to 1 since the class prototype vector has been normalized. Moreover, the normalization of the test document has no effect on ranking. Therefore, the test document is assigned to the class when the dot product of the test document vector and the class prototype vector achieves its highest value.

250

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

6.1.2. Effect of Term distributions Originally, Lertnattee and Theeramunkong (2004a, 2006a, 2006b, 2007b) have done a series of research works to investigate the effect of term distributions on classification accuracy. Therefore, the reader can find the full description of this work in those publications. In this section, the summary of this work is given. Here, three types of term distributions, called interclass, intra-class and in-collection distributions, are introduced. These distributions are expected to increase classification accuracy by exploiting information of (1) term distribution among classes, (2) term distribution within a class and (3) term distribution in the whole collection of training data. They are used to represent importance or classification power to weight that term in a document. Another objective of this work is to investigate the pattern of how these term distributions contribute to weight a term in documents. For example, high term distribution of a word (or term) should promote or demote importance of that word. Here, it is also possible to consider unigram or bigram as document representation.

Term distributions The first question is what are the characteristics of terms that are significant for representing a document or a class. In general, we can observe that (1) a significant term should appear frequently in a certain class and (2) it should appear in few documents. These two properties can be handled by the conventional term frequency and inverse document frequency, respectively. However, we can observe more that (1) a significant term should not distribute very differently among documents in the whole collection, (2) it should distribute very differently among classes, and (3) it should not distribute very differently among documents in a class. These three characteristics cannot be represented by conventional tf and idf. it is necessary to use distribution (relative information) instead of frequency (absolute information). Distribution related information that we can exploit includes distributions of terms among classes, within a class and in the whole collection. Three kinds of this information can be defined as inter-class standard deviation (icsd), class standard deviation (csd) and standard deviation (sd). Let be term frequency of the term of the document in the class . The formal definitions of icsd, csd and sd are given below.

where

is an average term frequency of the term in all documents within the class of classes and is the number of documents in the class .

,

is the number

251

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

1. Inter-class standard deviation: The inter-class standard deviation of a term is calculated from a set of average frequencies , each of which is gathered from each class . This deviation is an inter-class factor. Therefore, icsd for a term is independent of classes. A term with a high icsd distributes differently among classes and should have higher discriminating power for classification than the others. This factor promotes a term that exists in almost all classes but its frequencies for those classes are quite different. In this situation, the conventional factors tf and idf are not helpful. 2. Class-standard deviation The class standard deviation of a term in a class . is calculated from a set of term frequencies , each of which comes from term frequency of that term in a document in the class. This deviation is an intra-class factor. Therefore, csds for a term vary class by class. Different terms may appear with quite different frequencies among documents in the class. This difference can be alleviated by the way of this deviation. A term with a high csd will appear in most documents in the class with quite different frequencies and should not be a good representative term of the class. A low trrcsd of a term may be triggered by either of the following two reasons. The occurrences of the term are nearly equal for all documents in the class or the term rarely occurs in the class. 3. Standard deviation: The standard deviation of a term is calculated from a set of term frequencies , each of which comes from term frequency of that term in a document in the collection. The deviation is a collection factor. Therefore, sd for a term is independent of classes. Different terms may appear with quite different frequencies among documents in the collection. This difference can be also alleviated by the way of this deviation. A term with a high sd will appear in most documents in the collection with quite different frequencies. A low sd of a term may be caused by either of the following two reasons. The occurrences of the term are nearly equal for all documents in the collection or the term rarely occurs in the collection.

Enhancement of term weighting using term distributions The second question is how the above-mentioned term distributions contribute to term weighting. The term distributions, i.e., icsd, csd and sd, can enhance the performance of a centroid-based classifier with the standard weighting . Two issues of consideration are whether these distributions should act as a promoter (multiplier) or a demoter (divisor) and how strong they affect the weight. To grasp these characteristics, term weighting can be designed using the following skeleton. Here, is a weight given to the term of the class .

The includes the factors of term distributions. The parameters , , and are numeric values used for setting the contribution levels of icsd, csd and sd to term weighting, respectively. For each parameter, a positive number means the factor acts as a promoter while a negative one means the factor acts as a demoter. Moreover, the larger a parameter is, the more the parameter contributes to term weighting as either a promoter or a demoter.

252

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Data sets and experimental settings The following shows a set of experiments to investigate the effect of term distribution on classification accuracy. Four data sets are used in the experiments: (1) Drug Information (DI), (2) Newsgroups (News), (3) WebKB1 and (4) WebKB2. The first data set, DI is a set of web pages collected from www.rxlist.com. It includes 4480 English web pages with seven classes: adverse drug reaction, clinical pharmacology, description, indications, overdose, patient information, and warning. Each web page in this data set consists of informative content with a few links. Its structure is well organized. The second data set, Newsgroups contains 19,997 documents. The articles are grouped into 20 different UseNet discussion groups. In this data set, some groups are very similar. The third and fourth data sets are constructed from WebKB containing 8145 web pages. These web pages were collected from departments of computer science from four universities with some additional pages from some other universities. The collection can be arranged to seven classes. In our experiment, we use the four most popular classes: student, faculty, course and project as our third data set called WebKB1. The total number of web pages is 4199. Alternatively, this reduced collection can be rearranged into five classes by university (WebKB2): cornell, texas, washington, wisconsin and misc (collected from some other universities). The pages in WebKB are varied in their styles, ranging from quite informative pages to link pages. Table 6-1 indicates the major characteristics of the data sets. More detail about the document distribution of each class in WebKB is shown in Table 6-2. Table 6-1: Characteristics of the four data set Data sets 1. Type of docs 2. No. of docs 3. No. of classes 4. No. of docs/class

DI HTML 4480 7 640

News Plain Text 19,997 20 1000

WebKB1 HTML 4199 4 Varied

WebKB2 HTML 4199 5 Varied

Table 6-2: The distribution of the documents in WebKB1 an d WebKB2 WebKB1 Course Faculty Project Student Subtotal

Cornell 44 34 20 128 226

Texas 38 46 20 148 252

WebKB2 Washington 77 31 21 126 255

Subtotal Wisconsin 85 42 25 156 308

Misc. 686 971 418 1083 3158

930 1124 504 1641 4199

For the HTML-based data sets (i.e., DI and WebKB), all HTML tags are eliminated from the documents in order to make the classification process depend not on tag sets but on the content of web documents. By the similar reason, all headers are omitted from Newsgroups documents, the e-mail-based data set. For all data sets, a stop word list is applied to take away some common words, such as a, for, the and so on, from the documents. This means when a unigram model is occupied, a vector is constructed from all features (words) except stop words. In the case of a bigram model, after eliminating stop words, any two contiguous words are combined into a term for the representation basic. Moreover, terms occurring less than three times, are ignored.

6.1.3. Experimental Settings and Results This section shows two experimental results as investigation of the effect of term distribution. In the first experiment, term distribution factors are combined in different manners, and the efficiencies of these combinations are evaluated. From now, let us call the classifiers that

253

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

incorporate term distribution factors in their weighting, term-distribution-based centroid-based classifiers (later called TCBs). As the second experiment, top 10 TCBs obtained from the previous experiment are selected for investigating the effect for term distribution factors in different types of frequency-based factor in query weighting in both unigram and bigram models. Three types of query weighting are investigated: term frequency, binary and augmented normalized term frequency. The TCBs will be compared to a number of well-known methods as a baseline for comparison: a standard centroid-based classifier (for short, SCB), a centroid-based classifier modified the term weighting with information gain (for short, SCBIG), k-NN and naïve Bayes (for short, NB). In both experiments, a data set is split into two parts: 90% for the training set and 10% for the test set. In the fourth experiment, since the objective is to investigate the effect of training set size, we fix the size of a test set to 10% of the whole data set but vary the size of a training set from 10% to 90%. All experiments perform 10-fold cross validation. One of the most important factors towards the meaningful evaluation is the way to set classifier parameters. Parameters that are applied to these classifiers are determined by some preliminary experiments. For SCB, we apply the standard term weighting . For SCBIG, a term goodness criterion called information gain (IG) is applied for adjusting the weight in SCB, resulting in . The k values in k-NN are set to 20 for DI, 30 for Newsgroups and 50 for WebKB1, WebKB2 and WebKB12. Moreover, term weighting used in k-NN is where means the maximum term frequency in a document. The k and this term weighting performed well in our pretests. For NB, two possible alternative methods to calculate the posterior probability are binary frequency and occurrence frequency. The occurrence frequency is selected for comparison since it outperforms the binary frequency. The query weighting for TCBs is by default. As the performance indicator, classification accuracy is applied. It is defined as the ratio of the number of documents assigned with their correct classes to the total number of documents in the test set. Effect of term distribution factors This experiment investigates the combination of term distribution factors in improving the classification accuracy. Although the previous experiment suggests the role of each term distribution factor, all possible combinations are explored in this experiment. Two following issues are taken into account: (1) which factors are suitable to work together and (2) what is the appropriate combination of these factors. To the end, we perform all combinations of icsd, csd and sd by varying the power of each factor between -1 and 1 with a step of 0.5 and using it to modify the standard weighting of . At this point, a positive number means the factor acts as a promoter while a negative one means the factor acts as a demoter. The total number of combinations is 125 (=5 5 5). These combinations include and six single-factor term weightings. By the result, we find out that there are only 19 patterns giving better performance than . The 20 best (top 20) and the 20 worst classifiers, according to average accuracy on the four data sets, are selected for evaluation. Table 6-3 shows the number of the best (worst) classifiers for each power of icsd, csd and sd. Moreover, the numbers in parentheses show the numbers of the top 10 classifiers for each power. For more detail, the characteristics and performances of the top 20 term weightings are shown in Table 6-4. Both results are originally provided in (Lertnattee and Theeramunkong, 2004a).

254

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Table 6-3: Descriptive analysis of term distribution factors (TDF) with different power of each factor. Part A: the best 20 and Part 2: the worst 20 (best 10 and worst 10 in parenthesis) (source: Lertnattee and Theeramunkong, 2004a) TDF -1

-0.5

Power of the factor 0 0.5

Number of methods 1

Part A 0(0) 5(4) 9(4)

0(0) 7(4) 7(4)

6(2) 6(2) 4(2)

9(5) 2(0) 0(0)

5(3) 0(0) 0(0)

20(10) 20(10) 20(10)

6(1) 4(0) 1(0)

3(1) 0(0) 1(0)

3(2) 1(0) 1(0)

3(3) 6(2) 6(3)

5(3) 9(8) 11(7)

20(10) 20(10) 20(10)

Part B

Table 6-3 (part A) provides the same conclusion as the result obtained from the first experiment. That is, sd and csd are suitable to be a demoter rather than a promoter while icsd performs opposite. There are almost no negative results, except csd, and it is more obvious in the case of the top 10. On the other hand, Table 6-3 (part B) shows that the performance is low if sd and csd are applied as a promoter. However, it is not clear whether using icsd as a demoter harms the performance. Table 6-4: Classification accuracy of the 20 best term weightings (source: Lertnattee and Theeramunkong, 2004a) Methods

Power of icsd csd sd

Term weightings

DI

News

WebKB1 WebKB2

Avg.

TCB1* TCB2* TCB3* TCB4* TCB5* TCB6* TCB7 TCB8* TCB9* TCB10*

0.5 0.5 0.5 1 0.5 0.5 1 1 0 0

-0.5 -1 -1 -0.5 0 -0.5 -1 -1 -0.5 0

-0.5 0 -0.5 -1 -1 -1 -1 -0.5 0 -0.5

96.81 95.16 92.25 96.65 96.14 92.57 91.07 94.80 93.75 92.90

79.52 79.73 83.17 77.70 77.67 83.13 82.17 78.79 80.70 78.97

82.45 81.90 78.88 82.90 81.50 78.64 80.09 80.16 80.90 79.11

92.67 93.17 93.71 90.21 91.24 91.62 92.28 91.14 89.19 92.86

87.86 87.49 87.00 86.87 86.63 86.49 86.40 86.22 86.13 85.96

TCB11

0.5 -0.5

0

96.45

74.40

80.28

92.02

85.79

0 -0.5 -0.5 1 -0.5 -0.5

90.56 96.18

83.08 71.49

76.28 80.81

92.40 89.76

85.58 84.56

95.58 90.92 88.55 96.00 93.95 90.67 91.67

72.69 78.21 82.70 69.76 71.73 82.09 74.76

78.64 77.02 73.45 79.95 78.45 70.64 77.71

90.93 91.40 90.71 89.50 90.24 90.64 88.76

84.46 84.39 83.85 83.80 83.59 83.51 83.23

TCB12 TCB13 TCB14 TCB15 TCB16 TCB17 TCB18 TCB19 SCB

0.5 0 0 1 0.5 0.5 0

0 -0.5 0.5 -1 0 -1 0 -1 0.5 -1 -1 -1 0 0

Table 6-4 also emphasizes the classifiers that outperform the standard in all four data sets, with a mark *. Here, there are nine classifiers that are raised up. This fact shows that there are some common term distributions that are useful generally in all data sets. Here, the best term distribution in this experiment is . That is, the powers are 0.5 for icsd, and -0.5 for both csd and sd. However, it is observed that the appropriate powers of

255

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

term distribution factors depend on some characteristics of data sets. For instance, when the power of csd changes from -0.5 to -1.0 (TCB1 to TCB3 in Table 6-4), the performances for DI and WebKB1 decrease but those for Newsgroups and WebKB2 increase. This suggests that csd, a class dependency factor, is more important in Newsgroups and WebKB2 than DI and WebKB1. Experiments with different query weightings, and unigram/bigram models In this experiment, top 10 TCBs obtained from the previous experiment are selected for exploring the effect for term distribution factors in different types of query weighting in both unigram and bigram models. In this experiment, the TCBs are compared to SCB, SCBIG, k-NN and NB. Three types of query weighting are investigated: term frequency (n), binary (b) and augmented normalized term frequency (a). The simple query weighting (n) sets term frequency (or occurrence frequency) tf as the weight for a term in a query. The binary query weighting (b) sets either 0 or 1 for terms in a query. The augmented normalized term frequency (a) defines as a weight for a term in a query. This query term weighting is applied for all centroid-based classifiers, i.e., TCBs, SCB and SCBIG. Furthermore, the query term weighting is modified by multiplying the original weight with inverse document frequency (idf). The results for unigram and bigram models are shown in Table 6-5 (panels A and B), respectively. Table 6-5: Accuracy of the top 10 TCBs with different types of query weight compared to SCB, SCBIG, k-NN and NB for unigram and bigram models (source: Lertnattee and Theeramunkong, 2004a) Method

n

DI b

a

n

News b

a

n

WebKB1 b

a

n

WebKB2 B

a

Part A (Unigram)

TCB1 TCB2 TCB3 TCB4 TCB5 TCB6 TCB7 TCB8 TCB9 TCB10 SCB SCBIG k-NN NB

96.81 95.16 92.25 96.65 96.14 92.57 91.07 94.80 93.75 92.90 91.67 96.19 94.60 95.00

97.86 95.96 92.90 97.46 97.25 93.06 92.10 96.36 94.62 94.51 92.99 97.43

97.81 95.87 92.90 97.39 97.21 93.01 92.08 96.32 94.55 94.38 93.01 97.39

79.52 79.73 83.17 77.70 77.67 83.13 82.17 78.79 80.70 78.97 74.76 60.83 82.69 80.82

79.66 80.93 83.44 77.72 78.16 83.30 82.95 79.53 80.83 79.38 75.29 59.31

79.78 80.95 83.64 77.88 78.14 83.46 82.95 79.62 80.91 79.57 75.37 60.40

82.45 81.90 78.88 82.90 81.50 78.64 80.09 80.16 80.90 79.11 77.71 75.02 68.33 81.40

84.66 85.33 82.62 85.12 83.54 81.07 83.83 84.12 83.19 81.47 78.66 78.78

84.59 85.12 82.14 85.02 83.26 80.61 83.54 84.02 82.95 81.21 78.73 78.26

92.67 93.17 93.71 90.21 91.24 91.62 92.28 91.14 89.19 92.86 88.76 90.26 89.16 87.45

90.83 93.05 92.47 88.02 89.16 89.19 90.52 89.69 91.21 91.71 91.12 89.59

91.43 93.21 92.95 88.83 89.76 89.76 90.93 90.12 91.07 92.19 91.07 89.95

98.73 90.33 97.90 98.64 98.04 98.95 85.25 80.36 98.42 97.54 96.07 97.83 97.48 96.76

99.35 94.75 99.24 99.38 98.68 99.31 90.13 85.98 98.93 98.17 97.41 99.00

99.35 94.53 99.22 99.33 98.68 99.31 89.71 85.60 98.91 98.15 97.37 98.88

81.83 82.27 85.15 80.32 81.22 85.58 84.80 81.43 82.77 82.37 77.40 62.50 82.75 82.83

82.37 83.00 85.20 80.94 81.94 85.66 84.98 82.31 83.05 82.71 78.44 61.82

82.36 83.04 85.24 80.88 81.91 85.71 84.92 82.25 83.01 82.75 78.37 62.50

84.19 83.66 82.47 84.57 83.40 82.78 82.88 81.88 82.47 81.09 79.14 76.07 70.16 82.21

86.71 87.88 85.69 87.43 85.76 85.45 86.78 86.88 85.16 84.14 81.71 80.50

86.35 87.47 85.26 87.09 85.43 85.12 86.31 86.19 84.76 83.71 81.31 80.23

93.88 94.67 95.52 92.02 92.74 94.14 94.05 92.76 93.81 94.43 92.31 92.24 91.62 94.02

93.36 94.71 94.98 91.19 92.17 93.36 93.47 92.33 94.88 94.17 93.62 92.97

93.43 94.81 95.19 91.45 92.36 93.52 93.57 92.45 94.88 94.26 93.74 93.02

Part B (Bigram)

TCB1 TCB2 TCB3 TCB4 TCB5 TCB6 TCB7 TCB8 TCB9 TCB10 SCB SCBIG k-NN NB

256

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

According to the results, we found out that the TCBs outperformed SCB, SCBIG, k-NN and NB for almost cases, in both unigram and bigram models, independently of query weighting. Normally the bigram model gains better performance than the unigram model. In the bigram, the term distributions are still useful to improve classification accuracy. However, it is hard to determine which query weighting performs better than the others but term distributions are helpful for all types of query weighting. For SCBIG, the accuracy on DI significantly improves. However, a little bit lower performance than SCB on average. The TCB1, TCB2 and TCB3 seem to achieve higher accuracy than the others even TCB4 and TCB6 perform better in the bigram model for DI and News, respectively. Related works on centroid-based classification Term weighting plays an important role to achieve high performance in text classification. In the past, most approaches (Salton and Buckley, 1988; Skalak, 1994; Ittner et al., 1995; Chuang et al., 2000; Singhal et al., 1996; Sebastiani, 2002) were proposed using frequency-based factors, such as term frequency and inverse document frequency, for setting weights for terms. In these approaches, the way to solve the problem caused by the situation that a long document may suppress a short document is to perform normalization on document vectors or class prototype vectors. That is, a vector for representing any document or any class is transformed into a unit vector the length of which equals to 1. In spite of this, it is doubtful whether such frequencybased term weighting is enough to reflect the importance of terms in the representation of a document or a class or not. There were some works on adjusting weights using relevance feedback approach. Among them, two popular schemas are the vector space model and the probalistic networks. For the vector space model, the Rocchio feedback model (Rocchio, 1971; Salton, 1989, Joachims, 1997) is the most common used method. The method attempts to use both positive and negative instances in term weighting. One can expect more effective profile representation generated from relevance feedback. For probabilistic networks approach, a query can be modified by the addition of the first m terms taken from a list where all terms present in documents deemed relevant are ranked (Robertson and Sparck-Jones, 1976). The probalistic indexing technique was suggested by Fuhr (1989) and Joachims (1997) has analysed a probabilistic consideration of this technique to the Rocchio classifier with term weighting. Deng et al. (2002) introduced an approach to use statistics in a class, call ‘‘category relevance factor’’ to improve classification accuracy. Recently, Debole and Sebastiani (2003) have evaluated some feature selection methods such as chi-square, information gain and gain ratio. These feature selection methods were applied into term weighting for substituting idf on three classifiers: k-NN, NB and Rocchio. From the result, these methods might be useful for kNN and support vector machine but seem useless for Rocchio. Recently the centroid-based classifiers with the consideration of term distribution are explored by Han and Karypis (2000), Lertnattee and Theeramunkong, (2004a; 2004b; 2005; 2006b; 2007a; 2007b; 2009) and Theeramunkong and Lertnattee (2007). As a kind of term distribution, normalization is also an important factor towards better accuracies as investigated by Singhal et al. (1995; 1996) and Lertnattee and Theeramunkong (2003;2006a). A survey on statistical approaches for text categorization was done by Yang (1999) and Yang and Liu (1999). Text classification with semisupervised learning can be found in (Nigam et al., 2000). Conclusions Section 6.1 shows that term distributions are useful for improving accuracy in centroid-based classification. Three types of term distributions: interclass standard deviation (icsd), class

257

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

standard deviation (csd) and standard deviation (sd), were introduced to exploit information outside/inside a class and that of the collection. The distributions were used to represent discriminating power of each term and then to weight that term. To investigate the pattern of how these term distributions contribute to weighting each term in documents, we varied term distributions in their contribution to term weighting and then constructed a number of centroidbased classifiers with different term weightings. The effectiveness of term distributions was explored using various data sets. As baselines, a standard centroid-based classifier with , a centroid-based classifiers with and two well-known methods, k-NN and naïve Bayes are employed. Furthermore, both unigram and bigram models were investigated. The experimental results showed the benefits of term distributions in classification. It was shown that there was a certain pattern that term distributions contribute to the term weighting. It can be claimed that terms with a low sd and a low csd should be emphasized while terms with a high icsd should get more importance. For more detail, the reader can find from (Lertnattee and Theeramunkong, 2004a).

6.2. Document Relation Extraction Nowadays, it has become difficult for researchers to follow the state of the art in their area of interest since the number of research publications has increased continuously and quickly. Such a large volume of information brings about serious hindrance for researchers to position their own works against existing works, or to find useful relations between them. Some research works including (Kessler, 1963; Small, 1973; Ganiz, 2006), have been done towards the solution. Although the publication of each work may include a list of related articles (documents) as its reference, it is still impossible to include all related works due to either intentional reasons (e.g., limitation of paper length) or unintentional reasons (e.g., naively unknown). Enormous meaningful connections that permeate the literatures may remain hidden. Growing from different fields, known as literature-based discovery lead by Swanson (1986; 1990), the approach of discovering hidden and significant relations within a bibliographic database has become popular in medical-related fields. As a content-based approach with manual and/or semi-automatic processes, a set of topical words or terms are extracted as concepts and then utilized to find connections among two literatures. Due to the simplicity and practicality of this approach, it was used in several areas by its succeeding works (Gordon and Dumais, 1998; Lindsay and Gordon, 1999; Pratt et al., 1999). Some works proposed citation analysis based on so-called bibliographic coupling (Kessler, 1963) and co-citation (Small, 1973). While they were successfully applied in several works (Nanba et al., 2000; White and McCain, 1989; Rousseau and Zuccala, 2004) to obtain topical related documents, they are not fully automated and have a lot of labor intensive tasks. Based on association rule mining, an automated approach to discover relations among documents in a research publication database was introduced in Sriphaew and Theeramunkong (2005; 2007a; 2007b). Mapping a term (a word or a pair of words) to a transaction in a transactional database, the topic-based relations among scientific publications are revealed under various document representations. Although the work expressed the first attempt to find document relations automatically by exploiting terms in documents, it utilized only simple evaluation without elaborate consideration. There has been little exploration of how to evaluate document relations discovered from text collections. Most works in text mining utilized a dataset, which includes both queries and their corresponding correct answers, as a test collection. They usually defined certain measures and used them for performance assessment on the test collection. For instance, classification accuracy is applied for assessing the class to which a document is assigned in text categorization

258

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

(TC) (Rosch, 1978) while recall and precision are used to evaluate retrieved documents with regard to given query keywords in information retrieval (IR) (Salton and McGill, 1983). As a more naive evaluation method, human judgment have been used in more recent works on mining web documents, such as HITS (Kleinberg, 1999) and PageRank (Page et al., 1998), where there is no standard dataset. However, this manual evaluation is a labor intensive task and quite subjective. Moreover, there is a lack of standard criteria for evaluating document relations. So far, while there have been several benchmark datasets, e.g., UCI Repository (www.ics.uci.edu/~mlearn/MLRepository.html), WebKB (www.webkb.org), TREC data (trec.nist.gov/data.thml), for TC and IR tasks, there is no standard dataset that is used for the task of document relation discovery. Toward resolving these issues, this section shows a brief introduction to a research work that uses citation information in research publications as a source for evaluating the discovered document relations. The full description of this work can be found in (Sriphaew and Theeramunkong, 2007a). Conceptually, the relations among documents can be formulated as a subgraph where each node represents a document and each arc represents a relation between two documents. Based on this formulation, a number of scoring methods are introduced for evaluating the discovered document relations in order to reflect their quality. Moreover, this work also invents a generative probability that is derived from probability theory and uses it to compute an expected score to capture objectively how good evaluation results are.

6.2.1. Document Relation Discovery using Frequent Itemset Mining A formulation of the ARM task on document relation discovery can be summarized as follows. Let be a set of documents (items) where , and be a set of terms (transactions) where . Also let represent the existence (0 or 1) of a term in a document . A subset of is called a docset whereas a subset of is called a termset. Furthermore, a docset with k documents is called k-docset (or a docset with the length of k). The support of is defined as follows.

Here, an itemset with a support greater than a predefined minimum support is called a frequent k-docset. We will use the term ``docset'' in the meaning of ``frequent docset'' and ``document relation'' interchangeably. Here, we need some kind of evaluation to assess which document relations are better as one shown below.

6.2.2. Empirical Evaluation using Citation Information This subsection presents a method to use citations (references) among technical documents in a scientific publication collection to evaluate the quality of the discovered document relations. Intuitively, two documents are expected to be related under one of the three basic situations: (1) one document cites to the other (direct citation), (2) both documents cite to the same document (bibliographic coupling) (Kessler, 1963) and (3) both documents are cited by the same document (co-citation) (Small, 1973). An analysis of citation has been applied for several interesting applications (Nanba et al., 2000; White and McCain, 1989; Rousseau and Zuccala, 2004). Besides these basic situations, two documents may be related to each other via a more complicated concept called transitivity. For example, if a document A cites to a document B, and transitively the document B cites to a document C, then one could assume a relation between A

259

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

and C. In this work, with the transitivity property, the concept of order citation is originally proposed to express an indirect connection between two documents. With the assumption that a direct or indirect connection between two documents implies topical relation among them, such connection can be used for evaluating the results of document relation discovery. In the rest of this section, introductions of the u-th order citation and v-th order accumulative citation matrix are given. Then, the so-called validity is proposed as a measure for evaluating discovered docsets using information in the citation matrix. Finally, the expected validity is mathematically defined by exploiting the concept of generative probability and estimation.

The Citation Graph and Its Matrix Representation Conceptually citations among documents in a scientific publication collection form a citation graph, where a node corresponds to a document and an arc corresponds to a direct citation of a document to another document. Based on this citation graph, an indirect citation can be defined using the concept of transitivity. The formulation of direct and indirect citations can be given in the terms of the u-th order citation and the v-th order accumulative citation matrix as follows. Definition 1: (the u-th order citation): Let be a set of documents (items) in the database. For is the u-th order citation of x iff the number of arcs in the shortest path between x to y in the citation graph is u ( ). Conversely, x is also called the u-th order citation of y.

d1

d2

d3

d5

d6

d4

Figure 6-1: An Example of a citation graph. (source: Sriphaew and Theeramunkong, 2007a) For example, given a set of six documents and a set of six citations, to , to and , to , and to and , the citation graph can be depicted in Figure 6-1. In the figure, , and is the first, is the second, and is the third order citation of the document . Note that although there is a direction for each citation, it is not taken into account since the task is to detect a document relation where the citation direction is not concerned. Moreover, using only textual information without explicit citation or temporal information, it is difficult to find the direction of the citation among any two documents. Based on the concept of the u-th order citation, the v-th order accumulative citation matrix is introduced to express a set of citation relations stating whether any two documents can be transitively reached by the shortest path shorter than v+1. Definition 2: (the v-th order accumulative citation matrix): Given a set of n distinct documents, the v-th order accumulative citation matrix (for short, v-OACM) is an matrix, each element of which represents the citation relation between two documents x, y where when x is the u-th order citation of y and , otherwise . Note that and . Let be a set of documents (items) in the database. For is the u-th order citation of x iff the number of arcs in the shortest path between x to y in the citation graph is u ( ). Conversely, x is also called the u-th order citation of y. For the previous example, the 1-, 2- and 3-

260

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

OACMs can be created as shown in Table 6-6. Here, the 1-, 2- and 3-OACMs are represented by a set of values [ ]. Table 6-6: the 1-, 2- and 3-OACMs are represented by a set of values [

].

Document [1,1,1] [1,1,1] [0,1,1] [0,0,1] [0,1,1] [0,0,0]

[1,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1] [0,0,1]

[0,1,1] [1,1,1] [1,1,1] [1,1,1] [1,1,1] [0,1,1]

[0,0,1] [0,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1]

[0,1,1] [1,1,1] [1,1,1] [0,1,1] [1,1,1] [0,0,1]

[0,0,0] [0,1,1] [0,1,1] [1,1,1] [0,1,1] [1,1,1]

The 1-OACM can be straightforwardly constructed from the set of the first-order citation (direct citation). The (v+1)-OACM (mathematically denoted by a matrix ) can be recursively created from the operation between v-OACM ( ) and 1-OACM ( ) according to the following formula.

where is an OR operator, is an AND operator, is the element at the i-th row and the k-th column of the matrix and is the element at the k-th row and the j-th column of the matrix . Note that any v-OACM is a symmetric matrix.

Validity: Quality of Document Relation This section defines the validity which is used as a measure for evaluating the quality of the discovered docsets. The concept of validity calculation is to investigate how documents in a discovered docset are related to each other according to the citation graph. Based on this concept, the most preferable situation is that all documents in a docset directly cite to and/or are cited by at least one document in that docset, and thereafter they form one connected group. Since in practice only few references are given in a document, it is quite rare and unrealistic that all related documents cite to each other. As a generalization, we can assume that all documents in a docset should cite to and/or are cited by each other within a specific range in the citation graph. Here, the shorter the specific range is, the more restrictive the evaluation is. With the concept of v-OACM stated in the previous section, we can realize this generalized evaluation by a so-called vth order validity (for short, v-validity), where v corresponds to the range mentioned above. Regarding the criteria of evaluation, two alternative scoring methods can be employed for defining the validity of a docset. As the first method, a score is computed as the ratio of the number of citation relations in which the most popular document in a docset contains to its maximum. The most popular document is a document that has the most relations with the other documents in the docset. Note that, it is possible to have more than one popular document in a docset. The score calculated by this method is called soft validity. In the second method, a stricter criterion for scoring is applied. The score is set to 1 only when the most popular document connects to all documents in the docset. Otherwise, the score is set to 0. This score is called hard validity. The formulation of soft v-validity and hard $v$-validity of a docset X ( , denoted by (X) and (X) respectively, are defined as follows.

For simplicity, we denote a numerator in the above equation with

. Then,

261

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Here, is the citation relation defined by Definition 2. It can be observed that the soft vvalidity of a docset is ranging from 0 to 1, i.e., while the hard v-validity is a binary value of 0 or 1, i.e. . In both cases, the v-validity achieves the minimum (i.e., 0) when there is no citation relation among any document in the docset. On the other hand, it achieves the maximum (i.e., 1) when there is at least one document that has a citation relation with all documents in a docset. Intuitively, the validity of a bigger docset tends to be lower than a smaller docset since the probability that one document will cite to and/or be cited by other documents in the same docset becomes lower. In practice, instead of an individual docset, the whole set of discovered docsets needs to be evaluated. The easiest method is to exploit an arithmetic mean. However, it is not fair to directly use the arithmetic mean since a bigger docset tends to have lower validity than a smaller one. We need an aggregation method that reflects docset size in the summation of validities. One of reasonable methods is to use the concept of weighted mean, where each weight reflects the docset size. Therefore, soft v-validity and hard v-validity for a set of discovered docsets F, denoted by (F) and (F), respectively, can be defined as follows.

where is the weight of a docset X. In this work, is set to , the maximum value that the validity of a docset X can gain. For example calculation, given the 1-OACM in Table 6-6 and , the set soft 1-validity of F (i.e., hard 1-validity of F (i.e.,

(F)) is

(F)) equals to

while the set

.

The Expected Validity The evaluation of discovered docsets will depend on the citation relation , which is represented by v-OACMs. As stated in the previous section, the lower v is, the more restrictive the evaluation becomes. Therefore to compare the evaluation based on different v-OACMs, we need to declare a value, regardless of the restriction of evaluation, to represent the expected validity of a given set of docsets under each individual v-OACM. This section describes the method to estimate the theoretical validity of the set of docsets based on probability theory. Towards this estimation, the probability that two documents are related to each other under a v-OACM (later called base probability), need to be calculated. This probability is derived by the ratio of the number of existing citation relations to the number of all possible citation relations (i.e., ) as shown in the following equation.

For example, using the citation relation in Table 6-6, the base probabilities for 1-, 2-, and 3OACMs are 0.40 (12/30), 0.73 (22/30) and 0.93 (28/30), respectively. Note that the base probability of a higher-OACM is always higher than or equal to that of a lower-OACM. Using the concept of expectation, the expected set v-validity ( ) can be formulated as follows.

262

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Where is the expected v-validity of a docset , is the set of all possible citation patterns for , is the invariant validity of , and is the generative probability of the pattern estimated from the base probabilities under v-OACM ( ). Theoretically, finding possible patterns of a docset can be transformed to the set enumeration problem. Given a docset with the length of k (k-docset), there are possible citation patterns. With different scoring methods, an invariant validity is individually defined on each criteria regardless of the v-OACM. To simplify this, the notation is replaced by and for the invariant validity calculated from soft validity and hard validity, respectively. Similar to , an invariant validity of for soft validity is defined as follows:

For simplicity, we denote a numerator in the above equation by of based on hard validity is given by:

. An invariant validity

In the above equations, is the citation relation among two documents x, y in the citation pattern where =1 when citation relation exists, otherwise =0. Note that all 's have the same docset but represent different citation patterns. The following shows two examples of how to calculate the expected v-validity for 2-docsets and 3-docsets. For simplicity, the expected v-validity based on soft validity is firstly described, and the one based on hard validity is discussed later. With the simplest case, there are only two possible citation patterns for a 2-docset. Therefore, the expected v-validity based on soft validity of any 2-docset (X) can be calculated as follows.

Figure 6-2: All possible citation patterns for a 3-docset. (source: Sriphaew and Theeramunkong, 2007a)

263

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

In the case of a 3-docset, there are eight possible patterns as shown in Figure 6-2. Here, we can calculate the invariant validity based on soft validity ( ) of each pattern as follows. The first to fourth patterns have the invariant validity of 1 (i.e. ). The fifth to seventh patterns gains the invariant validity of 0.5 (i.e. ) while the last pattern occupies the invariant validity of 0 (i.e. ). The generative probability of the first pattern is since there are three citation relations, and that of the second to the fourth patterns equals to since there are two citation relations and one missing citation relation. Regarding the citation pattern, the generative probabilities of the other patterns can be calculated in the same manner. From the generative probabilities shown in Figure 6-2, the expected v-validity based on soft validity can be calculated as follows.

Here, the first term comes from the first pattern, the second term is derived from the second to the fourth patterns, the third term is obtained by the fifth to the seventh patterns and the last term is for the eighth pattern. With another criterion of hard validity, the expected v-validity for a 2-docset is still the same but a difference occurs for a 3-docset. The invariant validity based on hard validity ( ) equals to 1 for the first to fourth patterns and becomes 0 for the other patterns. The expected v-validity for a 3-docset based on hard validity is then reduced to

All above examples illustrate the calculation of the expected validity of only one docset. To calculate the expected v-validity of several docsets in a given set, the weighted mean of their validities can be derived. The outcome will be used as the expected value for evaluating the results obtained from our method for discovering document relations.

6.2.3. Experimental Settings and Results This subsection presents three experimental results when the quality of discovered docsets is investigated under several empirical evaluation criteria. The three experiments are (1) to investigate characteristic of the evaluation by soft validity and hard validity on docsets discovered from different document representations including their minimum support thresholds and mining time and (2) to study the quality of discovered relations when using either direct citation or indirect citation as the evaluation criteria. More complete results can be found in (Sriphaew and Theeramunkong, 2007a). Towards the first objective, several term definitions are explored in the process of encoding the documents. To define terms in a document, techniques of n-gram, stemming and stopword removal can be applied. The discovered docsets are ranked by their supports, and then the top-N ranked relations are evaluated using both soft validity and hard validity. Here, the value of N can be varied to observe the characteristic of the discovered docsets. For the second objective, the evaluation is performed based on various v-OACMs, where the 1-OACM considers only direct citation while a higher-OACM also includes indirect citation. Intuitively, the evaluation becomes less restricted when a higher-OACM is applied as the calibration. To fulfill the third objective, the expected set validity for each set of discovered relations is calculated. Compared to this expected validity, the significance of discovered docsets is investigated. To implement a mining engine for document relation discovery, the FP-tree algorithm, originally introduced by Han et al. (2000) is modified to mine docsets in a document-term

264

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

database. In this work, instead of association rules, frequent itemsets are considered. Since a 1docset contains no relation, it is negligible and then omitted from our evaluation. That is, only the discovered docsets with at least two documents are considered. The experiments were performed on a Pentium IV 2.4GHz Hyper-Threading with 1GB physical memory and 2GB virtual memory running Linux TLE 5.0 as an operating system. The preprocessing steps i.e., n-gram construction, stemming and stopword removal, consume trivial computational time.

Evaluation Material There is no gold standard dataset that can be used for evaluating the results of document relation discovery. To solve this problem, an evaluation material is constructed from the scientific research publications in the ACM Digital Library (www.portal.acm.org). As a seed of constructing the citation graph, 200 publications are retrieved from each of the three computer-related classes, coded by B (Hardware), E (Data) and J (Computer). With the PDF format, each publication is attached with an information page in which citation (i.e., reference) information is provided. The reference publications appearing in these 600 publications are further collected and added into the evaluation dataset. In the same way, the publications referred to by these newly collected publications are also gathered and appended into the dataset. Finally, in total there are 10,817 research publications collected as the evaluation material. After converting these collected publications to ASCII text format, the reference (normally found at the end of each publication text) is removed by a semi-automatic process, such as using clue words of References and Bibliography. With the use of the information page attached to each publication, the 1OACMs can be constructed and used for evaluating the discovered docsets. The v-OACM can be constructed from (v-1)-OACM and 1-OACM. In our dataset, the average number of citation relations per document is 8 for 1-OACM, 148 for 2-OACM, and 1008 for 3-OACM. It takes 1.14 seconds for generating 2-OACM from 1-OACM while it takes 15.83 seconds to generate 3-OACM from 2-OACM. Together with text preprocessing, the BOW library by McCallum,(1996) is used as a tool for constructing a document-term database. Using a list of 524 stopwords provided by Salton and McGill (1986), common words, such as ‘a,’ ‘an,’ ‘is,’ and ‘for’, are discarded. Besides these stopwords, terms with very low frequency are also omitted. These terms are numerous and usually negligible.

Experimental Results As stated at the beginning of this section, several term definitions can be used as factors to obtain various patterns of document representation. In our experiment, eight distinct patterns are explored. Each pattern is denoted by a 3-digit code. The first digit represents the usage of n-gram, where `U' stands for unigram and `B' means bigram. The second digit has a value of either `O' or `X', expressing whether the stemming scheme is applied or not. Also the last digit is either `O' or `X', telling us whether the stopword removal scheme is applied or not. For example, `UXO' means document representation generated by unigram, non-stemming and stopword removal. Table 6-7 and Table 6-8 express the set 1-validity (soft validity/hard validity) of the discovered docsets when various document representations are applied for unigram and bigram, respectively. The minimum support and the execution time of mining for each document representation to discover a specified number of top-N ranked docsets are also given in the table.

265

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Table 6-7: Set 1-validity for various top-N rankings of discovered docsets, their supports and mining time: soft validity/hard validity for the case of unigram. Here, minsup: minimum support time:mining time (seconds) (source: Sriphaew and Theeramunkong, 2007a) N 1000 5000 10000 50000 100000 Average

Set Validity (%) BXX 6.29/6.29

BXO 45.47/43.95

BOO 46.14/44.33

BOX 7.09/7.09

minsup=0.53,time=174.49

minsup=0.67,time=155.92

minsup=3.94,time=442.95

minsup=4.76,time=402.14

29.31/23.88

29.13/27.24

3.83/3.33

3.88/3.59

minsup=0.35,time=188.88

minsup=0.47,time=166.96

minsup=3.15,time=612.82

minsup=3.79,time=570.65

24.49/19.33

24.40/20.50

3.13/2.33

3.20/2.63

minsup=0.32,time=189.52

minsup=0.39,time=170.17

minsup=2.84,time=681.40

minsup=3.42,time=627.61

19.29/ 6.36

18.88/ 8.62

2.46/0.98

2.36/1.19

minsup=0.25,time=195.39

minsup=0.29,time=176.48

minsup=2.31,time=816.43

minsup=2.71,time=767.25

19.51/ 3.67

18.40/ 4.11

2.30/0.63

2.18/0.77

minsup=0.21,time=212.14

minsup=0.28,time=176.57

minsup=2.13,time=862.84

minsup=2.48,time=832.77

27.61/19.64

27.39/20.96

3.60/2.71

3.74/3.05

minsup=0.33,time=192.08

minsup=0.42,time=169.22

minsup=2.87,time=683.29

minsup=3.43,time=640.08

Table 6-8: Set 1-validity for various top-N rankings of discovered docsets, their supports and mining time: soft validity/hard validity for the case of bigram. Here, minsup: minimum support time:mining time (seconds) (source: Sriphaew and Theeramunkong, 2007a) N 1000 5000 10000 50000 100000 Average

Set Validity (%) UXX 2.79/2.79

UXO 3.88/3.78

UOO 2.36/2.26

UOX 1.76/1.76

minsup=32.72,time=122.49

minsup=46.35,time=74.77

minsup=55.61,time=160.98

minsup=74.78,time=89.39

3.77/3.35

2.38/1.99

2.37/2.28

1.55/1.48

minsup=26.98,time=240.57

minsup=40.04,time=175.72

minsup=48.46,time=359.18

minsup=66.84,time=198.16

3.47/2.63

2.16/1.53

2.09/1.75

1.35/1.11

minsup=24.68,time=312.69

minsup=37.63,time=231.41

minsup=45.66,time=466.00

minsup=63.76,time=277.67

2.78/1.44

1.75/0.74

1.68/0.84

1.12/0.49

minsup=19.95,time=478.97

minsup=32.26,time=412.79

minsup=39.64,time=808.61

minsup=57.08,time=539.55

2.71/1.02

1.68/0.48

1.66/0.57

1.14/0.32

minsup=18.37,time=564.65

minsup=30.40,time=531.10

minsup=37.40,time=1008.38

minsup=54.55,time=691.02

3.32/2.44

2.06/1.40

2.12/1.64

1.38/1.03

minsup=24.54,time=343.87

minsup=37.34,time=285.16

minsup=45.35,time=560.63

minsup=63.40,time=359.16

From the table, some interesting observations can be made as follows. First, with the same document representation, soft validity is always higher than or equal to hard validity since the former is obtained by less restrict evaluation than the latter. Both validities involve valid relations between any pair of documents in a discovered docset. A relation between two documents is called valid when there is a link between those two documents under the v-OACM (v=1 in this experiment). The evaluation based on soft validity focuses on the probability that any two documents in a docset will occupy a valid relation. On the other hand, the evaluation based on hard validity concentrates on the probability that at least one docset must have valid relations with all of the other documents. For example, in the case of top-100000 ranking with the `BXO' representation (as shown in Table 1), 19.51% of the relations in the discovered docsets are valid while only 3.67% of the discovered docsets are perfect, i.e., there is at least one document that contains valid relations with all of the other documents in the certain docset. Second, in every document representation, both soft validity and hard validity become lower when more ranks (i.e., top-N ranking with a larger N) are considered. As an implication of this result, our proposed evaluation method indicates that better docsets are located at higher ranks. Third, given two representations, say A and B, if the soft validity of A is better than that of B, then the hard validity of A tends to be higher than that of B. Fourth, the results of the bigram cases (`B**') are much better than those of the unigram cases (`U**'). One reason is that the bigrams are quite superior to the unigrams in representing the content of a document. Fifth, in the cases

266

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

of bigram, the stopword removal process is helpful while the stemming process does not help much. Sixth, in the cases of unigram, non-stemming is preferable while the stopword removal process is not useful. Finally, the performance of `BXO' and `BOO' is comparable and much higher than `BOX' and `BXX', while the performance of `UXO' is much higher than the other unigram cases. However, on average, the `UXX' seems to be the second best case for the unigram. Since the soft validity is more flexible than the hard validity, a higher soft validity is preferable. Although performance of `BOO' seems to be slightly better than `BXO' in the higher ranks, `BXO' performs better on average. In our task, the performance ranks for bigram is `BXO' >`BOO'> `BOX’ > `BXX’ and the performance ranks for unigram is `UXO' > `UXX' > `UOO’ > `UOX’. In terms of minimum support and computation time, we can conclude as follows. First, since a docset discovered from the bigram cases tends to have a lower support than the unigram cases, it is necessary to set a small minimum support in order to obtain the same number of docsets. Second, the cases with stopword removal run faster than ones without stopword removal since they consider fewer words. Moreover, they tend to have a lower minimum support. Besides 1-OACM, the discovered docsets can be evaluated with the criteria of 2-OACM and 3OACM. In this assessment, only four best representations, two from the unigram cases (`UXO' and `UXX') and two from the bigram cases (`BXO' and `BOO'), are taken into consideration. Figure 6-3 displays the soft validity (the left graph) and the hard validity (the right graph) under 1-, 2-, and 3-OACMs. Since the minimum support and mining time in each case is the same as shown in Table 6-7 and Table 6-8, they are omitted from the figure. In the figure, we use the notation to represent the evaluation of docsets under the specified OACM where those docsets are discovered from a specific document representation. For example, `3:BXO' means the evaluation of docsets under 3-OACM where the docsets are discovered by encoding document representation using the BXO scheme (bigram, non-stemming and stopword removal). Being consistent for both soft validity and hard validity, the set 3-validity (one calculated under the 3OACM) of discovered docsets is higher than the set 2-validity (one calculated under the 2-OACM), and in the same way the set 2-validity is much higher than the set 1-validity (one calculated under the 1-OACM). Compared to the evaluation using only direct citation (1-OACM), more relations in the discovered docsets are valid when both direct and indirect citations (2- and 3OACMs) are taken into consideration. Similar to 1-OACM, `BXO' and `BOO' are comparable and perform as the best cases for both soft validity and hard validity under the same OACM. Moreover, in the cases of bigram evaluated under the 1- and 2-OACMs, the set validity drops remarkably when top-N rankings with a larger N are focused.The quality of docsets in the higher rank (smaller N) outperforms that of the lower rank. This outcome implies that our evaluation based on direct/indirect citations seems to be a reasonable method for assessing docsets. For all types of document representation, the bigram cases perform better than the unigram cases when they are evaluated under the same v-OACM. Especially the cases under 3-OACM, where both two bigram cases (`3:BXO' and `3:BOO') are almost 100% valid while two unigram cases (`3:UXO' and `3:UXX') are approximately 50% valid. This phenomenon shows the advantage of bigram in being a good document representation for document relation discovery and those documents in each docset cite to each other under the specific range within citation graph. Furthermore, the performance gap between bigram and unigram becomes smaller when top-N rankings with a larger N are considered. For a top-N ranking with a larger N, the bigram cases tend to have bigger docsets than the unigram cases and then obtain lower validity since naturally a bigger docset is likely to have lower validity.

267

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Figure 6-3: Set validity based on the 1-, 2- and 3-OACMs when various top-N rankings of discovered docsets are considered: soft validity (left) and hard validity (right). (source: Sriphaew and Theeramunkong, 2007a)

Conclusions Section 6.2 shows a method to use citation information in research publications as a source for evaluating the discovered document relations. Three main contributions of this work are as follows. First, soft validity and hard validity are developed to express the quality of docsets (document relations), where the former focuses on the probability that any two documents in a docset has a valid relation while the latter concentrates on the probability that at least one document in a docset has valid relations with all of the other documents in that docset. Second, a method to use direct and indirect citations as comparison criteria is proposed to assess the quality of docsets. Third, the so-called expected validity is introduced, using probability theory, to relatively evaluate the quality of discovered docset. By comparing the result to the expected validity, the evaluation becomes impartial, even under different comparison criteria. The manual evaluation was also done for performance comparison. Using more than 10,000 documents obtained from a research publication database and frequent itemset mining as a process to discover document relations, the proposed method was shown to be a powerful way to evaluate the relations in four aspects: soft/hard scoring, direct/indirect citation and relative quality over the expected validity. For more detail, the reader can find in (Sriphaew and Theeramunkong, 2007a).

268

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

6.3. Application to Automatic Thai Unknown Detection Unknown word recognition plays an important role in natural language processing (NLP) since words, fundamental units of a language, may be newly developed and invented. Most NLP applications need to identify words in sentences before further manipulation. Word recognition can be basically done by using a predefined lexicon, designed to include as many words as possible. However, in practical, it is impossible to have a complete lexicon that includes all words in a language. Therefore, it is necessary to develop techniques to handle words not presented in the lexicon, so-called unknown words. In languages with explicit word boundary, it is straightforward to identify an unknown word and its boundary. This simplicity is not conformed to languages without word boundary (later called unsegmented language such as Thai, Japanese, Chinese), where words are running without any explicit space or punctuation mark (Cheng et al., 1999; Charoenpornsawat et al., 1998; Ling et al., 2003; Ando and Lee, 2000). Whereas analyzing such languages requires word segmentation, existence of unknown words made segmentation (or word recognition) accuracy lower (Theeramunkong and Tanhermhong, 2004; Asahara and Matsumoto, 2004; Jung-Shin and Su, 1997). Accurate detection of unknown words and their boundaries is mandatory towards high-performance word segmentation. As a similar task, word extraction in unsegmented languages has also been explored in several studies (Su et al., 1994; Chang and Su, 1995; Ge et al., 1999; Zhang et al., 2000; Zhang et al., 2008). Instead of segmenting a running text into words, word extraction methods directly detect a set of unknown words from the text without determining boundaries of all words in the text. In Thai, our target language, major sources of unknown words are (1) Thai transliteration of foreign words, (2) invention of Thai new technical words, and (3) emerging of Thai proper names. For example, Thai medical texts often abound in transliterated words/terms or technical words/terms, related to diseases, organs, medicines, instruments or herbs, which may not be in any dictionary. Thai news articles usually include a lot of proper names related to persons, organizations, locations and so forth. Indirectly related to unknown word recognition, Thai compound word extraction and word segmentation without dictionaries were explored in (Sornlertlamvanich and Tanaka, 1996; Theeramunkong and Tanhermhong, 2004; Sornil and Chaiwanarom, 2004). Without any dictionary, these methods applied pure statistics with a kind of machine learning techniques to detect compound words by observing frequently occurred substrings in texts. However, it seems natural to utilize a dictionary for segmentation and simultaneously recognize unknown words when substrings do not exist in the dictionary. In the past, several works (Kawtrakul et al., 1997; Charoenpornsawat et al., 1998) have been proposed to recognize both explicit and implicit unknown words. Forming from multiple contiguous words, an implicit unknown word could be detected by observing its Co-occurrence frequency. On the other hand, an explicit unknown word was triggered by an undefined substring, and its boundary could be found by first generating boundary candidates with respect to a set of predefined rules and applying statistical techniques to select the most probable one. However, one of shortcomings in most previous approaches is that they required a set of manually constructed rules to restrict generating candidates of an unknown word boundary. To get rid of this limitation, this paper proposes a method to generate a set of all possible candidates without constraining by any handcrafted rule. However, with this relaxation, a large set of candidates may be generated, inducing the problem of unbalanced class sizes where the number of positive unknown word candidates is dominantly smaller than that of negative candidates. To solve the problem, a technique called group-based ranking evaluation (GRE) is incorporated into ensemble learning, namely boosting, in order to generate a sequence of classification models that later collaborate to select the most probable unknown word from multiple candidates. As the boosting

269

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

step, given a classification model, the GRE technique is applied to build a dataset for training the succeeding model, by weighing each of its candidates according to their ranks and correctness when the candidates of an unknown word are considered as one group. By experiments, the proposed method, namely V-GRE, is evaluated using a large Thai medical text. Although research on unknown word recognition in Thai language has not been widely conducted as done in other languages, two approaches have been proposed in detecting unknown words from a large corpus of Thai texts, later called Machine Learning-based (MLbased) approach and dictionary-based approach (Theeramunkong et al., 2000; Theeramunkong and Tanhermhong, 2004). In the ML-based approach, unknown word recognition can be viewed a process to detect new compound words in a text without a process of using a dictionary to segment the text into words. The dictionary-based approach attempts to identify the boundary of an unknown word when a system faces with a character sequence which is not registered in a dictionary during segmenting a text into a sequence of words. As an early work of the first approach, Sornlertlamvanich and Tanaka (1996) had presented a method to use frequency difference between the occurrences of two adjoining sorted n-grams (a special case of sorted sistrings) to extract open compounds (uninterrupted sequences of words) from text corpora. Moreover, competitive and unified selections are applied to discriminate between an illegible string and a potential unknown word. By specifying a different threshold of frequency differences, the method can detect a various number of extracted strings (unknown words) with an inherent trade-off between the quantity and the quality of the extracted strings. As two limitations, the method requires manual setting of the threshold, and it applies only frequency difference that may not be enough to express the distinction between an unknown word and a common prefix of words. To solve these shortcomings, some works (Kawtrakul et al., 1997; Theeramunkong et al., 2000; Sornlertlamvanich et al., 2000; Theeramunkong and Tanhermhong, 2004; Sornil and Chaiwanarom, 2004) applied machine learning (ML) techniques to detect an unknown word by using statistical information of contexts surrounding that potential unknown word. Sornlertlamvanich et al. (2000) presented a corpus-based method to learn a decision tree for the purpose of extracting compound words from corpora. In the same period, a similar approach was proposed in (Theeramunkong et al., 2000; Theeramunkong and Tanhermhong, 2004) to construct a decision tree that enables us to segment a text without making use of a dictionary. It was shown that even no dictionary, the ML-based methods could achieve up to 85%-95% of word segmentation accuracy or word extraction rate. As the second approach, Kawtrakul et al. (1997) used the combination of a statistical semantic segmentation model and a set of context sensitive rules to detect unknown words in the context of a running text. The context sensitive rules were applied to extract information related to such an unknown word, mostly representing a name of an entity, such as person, animal, plant, place, document, disease, organization, equipment, and activity. Charoenpornsawat et al. (1998) considered unknown word recognition as a classification problem and proposed a feature-based approach to identify Thai unknown word boundaries. Features used in the approach are built from the specific information in context surrounding the target unknown words. Winnow proposed by Blum (1997) is an ML algorithm used to automatically extract features from the training corpus. As a more recent work, Haruechaiyasak et al. (2006) proposed a semi-automated framework that utilized statistical and corpus-based concepts for detecting unknown words and then introduced a collaborative framework among a group of corpus builders to refine the obtained results. In the automated process, unknown word boundaries are identified using frequencies of strings. In (Haruechaiyasak et al., 2008), a comparison of dictionary-based approach and MLbased approach for word segmentation was presented where unknown word detection is implicitly handled. Since either of the dictionary-based and ML-based approaches has its

270

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

advantages, most previous works (Kawtrakul et al., 1997; Charoenpornsawat et al., 1998; Theeramunkong et al., 2000) combined them to handle unknown words. Although several works have been done in both approaches, they have some shortcomings: 1) most works dominantly separated learning process from word segmentation process; 2) they used only local information to learn a set of rules for word segmentation/unknown word detection by a single-level learning process (a single classifier); and 3) they required a set of handcrafted rules to restrict generating candidates of an unknown word boundary. To overcome these disadvantages, this work provides a framework to combine word segmentation process with learning process that utilizes longdistance context in learning a set of rules for unknown word detection in word segmentation process, where no manual rules are required. Moreover, our learning process also occupies boosting techniques to improve classification accuracy.

6.3.1. Thai Unknown Words as Word Segmentation Problem Most word segmentation algorithms used a lexicon (or a dictionary) to parse a text at the character level. In general, when a system meets an unknown word, three possible segmented results can be expected as an output. The first one is to obtain one or more sequences of known words from an unknown (out-of-dictionary) word, especially for the case of a compound word. For example, มะม่วงอกร่ อง (meaning: a kind of mango) can be segmented into มะม่วง (meaning: mango), อก (meaning: breast), and ร่ อง (meaning: crack). All of these sub words are found in the lexicon. The second one is to gain a sequence of unknown segments which are undefined in the lexicon. For example, we cannot detect any sub word from an out-of-dictionary word วิสญ ั ญี (meaning: Anesthetic) since all of its substrings do not exist in the dictionary. The last one is to get a sequence of known words mixed with unknown segments. For instance, an unknown word ลูคีเมีย (meaning: Leukemia) can be segmented into two portions: an unknown segment (ลูคี, meaning: unknown) and a known word (เมีย, meaning: wife). In terms of processing, these three different results can be interpreted as follows. When we get a result of the first type, it is hard for us to know whether the result is an unknown word since it may be misunderstood to be multiple words existing in a dictionary. This type of unknown words is known as a hidden unknown word. Called as an explicit unknown word, the second type is easily recognized since the whole word is composed of only unknown segments, Namely a mixed unknown word, a third-type unknown word is also hard to recognize since the boundary of the unknown word is unclear. Furthermore, it is also difficult for us to distinguish between the second and the third type. However, the second and third types will have unknown segments, later called unregistered portions, that trigger us to know existence of an unknown word. This work focuses on recognition of an unknown word of the second and third types, a detectable unknown word.

6.3.2. The Proposed Method This section describes the proposed method in short. The reader can find the full description in (TeCho et. al, 2009b). The proposed method consists of three processes: (1) unregistered portion detection, (2) unknown word candidate generation and reduction, and (3) unknown word identification, as shown in Figure 6-4.

271

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Figure 6-4: Overview of the proposed method (source: TeCho et al., 2009b)

Unregistered Portions Detection Normally when we apply word segmentation on a Thai running text with some unknown words, we may face with a number of unrecognizable units due to out-of-vocabulary words. Moreover, without any additional constraints, an existing algorithm may place segmentation boundaries at obviously incorrect positions. For example, the system may place an impossible word boundary between a consonant and a vowel. To resolve such obvious mistakes, recently several works (Sornil and Chaiwanarom, 2004; Haruechaiyasak et al., 2006; Theeramunkong and Usanavasin, 2001; Viriyayudhakorn et al., 2007; Limcharoen, 2008) have applied a useful concept, namely a Thai Character Cluster (TCC) (Theeramunkong et al., 2000; Theeramunkong and Tanhermhong, 2004), which is defined as an inseparable group of Thai characters based on the Thai writing system. Unlike word segmentation, segmenting a text into TCCs can be completely done without error and ambiguity by a small set of predefined rules. The result from TCC segmentation can be used to guide word segmentation not to segment at unallowable positions. To detect unknown words, TCCs can be used as basic units of processing. Using techniques originally proposed by TeCho et al. (2008a; 2008b; 2009a; 2009b), this work employs the combination of TCCs and the LEXiTRON dictionary (2008) to facilitate word segmentation. In this work, the longest word segmentation (Poowarawan, 1986) is applied to segment the text from either left-to-right (LtoR) or right-to-left (RtoL) manner and then the results are compared to select one with the minimum number of unregistered portions. If the number of unregistered portions from LtoR longest matching equals to that of RtoL, the result of the LtoR longest matching will be selected.

272

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Unknown Word Candidate Generation and Reduction For candidate generation, ±h TCCs surrounding an unregistered portion are merged to form an unknown word candidate. By this setting, (h + 1)2 possible candidates can be generated for each unregistered portion. Since, a word in Thai cannot comprise of any special characters, it is possible to have a smaller number of candidates using surface constraints, such as space or punctuation. To filter out unrealistic candidates, two sets of separation markers are considered. The first set contains four types of marker words, such as (1) Conjunctives words: e.g., ก็ต่อเมื่อ (meaning: when), นอกจากนี้ (meaning: Besides this) etc., (2) Preposition words: e.g., ตั้งแต่ (meaning: since), สาหรับ (meaning: for) etc., (3) Adverb words: e.g., เดี๋ยวนี้ (meaning: at this moment), มากกว่า (meaning: more than) etc., and (4) Special verbal words: e.g. , หมายความว่า (meaning: means),

ประกอบด้วย (meaning: comprised with) etc. The second set includes five types of special characters as follows: (1) Interword seperations: i.e., a white space, (2) Punctuation Marks: e.g., ?, -, (…) etc., (3) General typography signs: e.g.,%, ฿ etc., (4) Numbers: including both Arabic (0, …, 9) and Thai (๐, …, ๙) numbers, and (5) Foreign characters: English alphabets (including capital letters).

Unknown Word Identification In the past, most previous works on Thai unknown word recognition (Sornlertlamvanich and Tanaka, 1996; Theeramunkong and Tanhermhong, 2004; Sornil and Chaiwanarom, 2004; Kawtrakul et al., 1997; Charoenpornsawat et al., 1998) treated unknown word candidates independently. However, in the real situation, a set of candidates generated from an unregistered portion, should be considered dependently and treated as a group. In the learning process, each candidate in a group is labeled as a positive or negative instance. Although several candidates can be generated from an unregistered portion, typically only few (just one or two) candidates are the potential unknown words. This phenomenon forms an unbalanced dataset problem. For example, Table 6-9 shows the rank of each candidate, where only two out of forty two candidates are eligible unknown words, i.e., rank 1 and 32. After the ranking process, the most probable candidate is selected as a suggested unknown word. Table 6-9: Example output of predicted unknown word candidates ranked in a group by the proposed method (source: TeCho et al., 2009b) Rank 1 2 3 … 30 31 32 … 40 41 42

Unknown Word Candidate (c) คีโตโคนาโซล ใช ้แชมพูค ี ใช ้แชมพูคโี ต … มพูคโี ตโคนาโซล แชมพูคโี ตโคนาโซ แชมพูคโี ตโคนาโซล … คีโตโค คีโต คี

P(+|c) 9.9988510-01 9.9899510-01 9.9552110-01 … 4.3351510-04 1.0661210-04 8.5327910-05 … 2.6328910-22 8.8801710-53 4.5928810-97

Actual Class Y N N … N N Y … N N N

Predicted Class Y Y Y … N N N … N N N

273

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Feature Extraction As stated in the previous section, TCCs are used as processing units. We therefore use a sequence of TCCs instead of a sequence of characters to denote an unknown word candidate. To specify whether a candidate is the most probable unknown word or not, a set of suitable features need to be considered. In this work, several statistics collected from context around an unknown word can be considered as features. Next, in order to fasten the process to collect statistics from a text, we apply the algorithm proposed by Nagao and Mori (1994) that utilizes sorted sistrings. For each sistring (i.e., unknown word candidate), eight types of features, (f1)-(f8), are extracted. To explain these features, the following description is first given. Let A be a set of possible Thai characters, B be a set of possible TCCs, E be a set of possible special characters, C = c1c2c3 . . . c|C| (ci  A  E) be a corpus, di  D be the i-th document which is a substring of C, Di = C[bi:ei] (bi is the position of the first character in the document Di, ei-1 is the position of the last character in the document Di, C[ei:ei] is a special character specifying the end of the document Di), bi = ei-1+1, T = t1t2t3 . . . t|T| (ti  B) be the segmented corpus of C as a TCC sequence, t1 = C[1:u], ti = C[v:w], ti+1 = C[(w+1):x], and t|T| = C[y:|C|], and W be a set of all possible words in the dictionary. An unknown word candidate S can be defined by a substring of T, ST = T[p:q] (= tp . . . tq) where p and q are the starting and ending TCC positions of S, respectively. Also, the candidate S can be expressed by a substring of C, SC = C[r:s] (= c r . . . cs) where r and s are the starting and ending character positions of S, respectively. As one restriction, no special character is allowed in S. With the above description, eight features, (f1)-(f8), can be formally defined in sequence as follows. (f1) Number of TCCs (Nt) The number of TCCs can be used as a clue to detect unknown words. Intuitively, several unknown words are technical words each of which is a transliteration of an English technical term, and many of them are very long. Formally, the number of TCCs in an unknown word candidate S, Nt(S), can be defined as follows. Nt(S) = |ST| (f2) Number of Characters (Nc) Similar to the number of TCCs, the number of characters in a sequence is another factor to determine whether the sequence is a potential word or not. Concretely, an unknown word tends to be long. The number of characters in an unknown word candidate S can be defined as follow. Nc(S) = |SC| (f3) Number of known words (Nw) Like several languages, some unknown words in Thai language can be viewed as a compound word that contains a number of known words. Therefore, when we recognize a sistring as an unknown word, the number of known words in such sistring can be used as a clue to identify whether the sistring is an unknown word. The number of known words can be defined as follows. Nw(S) = |{w|w=S[a:b]  w  W}| where S[a:b] is a substring of S starting from a to b.

274

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

(f4) Sistring Frequency (Nf ) The sistring frequency is useful information for determining whether the sistring is a word. The number of occurrences of a sistring which is an unknown word tends to be higher than that of a sistring which is not possible to be a word. The definition of sistring frequency is as follows. Nf(S) = |{C[c:d] | C[c:d] = S  1  c  d  |C|}| where C[c:d] is a substring of C starting from c to d and c, d range from 1 to |C|. (f5) Left and Right TCCs variety (Lv,Rv) The variety expresses the potential TCCs which come before or after a string. It implies the impurity or uncertainty. Left (Right) variety is defined as the number of distinct TCCs actually occurring before (after) an unknown word candidate. The high variety of distinct TCCs on the left-hand side (right-hand side) is one of indicators to guess whether the candidate should be detected as an unknown word. We therefore used the number of distinct TCCs on the left and right-hand side as a feature. The definitions of left and right TCC variety are as follows. Lv(S) = |d({T[a:a] | T[a+1:b] = S  1  a  b  |T|}| Rv(S) = |d({T[b:b] | T[a:b-1] = S  1  a  b  |T|}| where the d(L) returns the set of distinct elements in L, T[a:b] is a substring of T starting from a to b and a, b range from 1 to |T|. T[a:a], T[b:b] are the TCC that co-occured on the left-hand side and right-hand side of S in the corpus, respectively. (f6) Probability of a special character on left and right (Ls,Rs) The probability that a special character co-occurs on the left-hand side and the right-hand side of the considering candidate indicates that the candidate is located near delimiters and should be detected as an unknown word. We, therefore, used them as a feature. The definitions of probability of a special character on left and right (Lv and Rv) are as follows. LS(S) = RS(S) =









where C[d:e] is a substring of C starting from d to e and d, e range from 1 to |C|, N f (S) returns a number of unknown word candidate S occurred in the corpus. (f7) Inverse Document Frequency (IDF) The inverse document frequency is a good measurement to specify the importance of a sistring. Since an unknown word is a word that does not happen frequently in several documents, but it appears frequently in only some specific documents. In addition, the high IDF means the sistring is likely to be a unknown word. It was obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient. The formal definition of IDF(S) (the inverse document frequency of S) is as

275

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

where log is the natural logarithm, |D| is the total number of documents in the corpus, and |DS| is the number of documents where S appears. (f8) Term Frequency with Inverse Document Frequency (TFIDF) The TFIDF is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word in a document in a collection or corpus is. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. The definition of TFIDF(S) is as TF

.

Emsemble Classification with Group-based Ranking Evaluation Technique This section describes four main aspects of our proposed approach in learning an ensemble classifier for identifying unknown words. As the first aspect, exploiting the features extracted from the training corpus, naïve Bayesian is applied to learn a base classifier to assign a probability to each unknown word candidate, representing how likely the candidate is a suitable unknown word for an unregistered portion. Second, a mechanism namely Group-based Ranking Evaluation (GRE) is introduced to select the most probable unknown word for an unregistered portion with the consideration of ranking in a group of unknown word candidates generated from the same unregistered portion at a specific location. Third, a GRE-based boosting is employed to generate a sequence of classifiers, where each consecutive classifier in the sequence works as an expert in classifying instances that were not classified correctly by its preceding classifier and a confidence weight is given to each generated classifier based on its GRE-based performance. Fourth, a so-called Voting Group-based Ranking Evaluation (V-GRE) technique is implemented to combine the results obtained from a sequence of classifiers in classifying a test instance, with the consideration of the confidence weight of each classifier. The details of these aspects are illustrated in order as follows. Naïve Bayesian Classification Based on naïve Bayesian method, the probability that a generated candidate c (characterized by a set of features F = {f1, f2, . . . , f|F|}) is an unknown word, can be defined as follows.

where , is the probability that the candidate c is an unknown word, is the unnormalized probability that the candidate c is an unknown word (positive class), and is the unnormalized probability that the candidate c is not an unknown word (negative class), (or is the prior probability that the class is positive (or negative), (or ) is the probability that the feature is when the class is positive (or negative). Here,

276

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

both and are derived from the independence assumption of naïve Bayesian. For continuous attributes ( ), Gaussian distribution with smoothing can be applied as follows.

where (or ) is the mean of the positive (or negative) class, (or ) is the standard deviation of the positive (or negative) class, and ϵ is a small positive constant used for smoothing to resolve sparseness problem. It is set to 0.000001 in our experiments. Group-based Ranking Evaluation Unlike the evaluation model in a traditional classifier, our proposed technique Group-based Ranking Evaluation (GRE) categorizes all candidates produced by the same unregistered portion location into the same group. This technique ranks all candidates with respect to their group based on their probabilities to be an unknown word, and selects then the candidate with the highest probability within that group as the potential prediction for the unknown word. 

where is the most probable candidate of the i-th group, is a group of candidates generated from the i-th unregistered portion, and is the probability that c is an unknown word. To be more flexible, it is also possible to relax to accept top-t candidates as the potential unknown words. GRE-based Boosting AdaBoost (Freund and Schapire, 1999) is a technique to repeatedly construct a sequence of classifiers based on a base learning method. In this technique, each instance in the training set is attached with a weight (initially set to 1.0). In each iteration, the base learning method constructs a classifier using all instances in the training set, and with their weights showing the importance. After evaluation the obtained classifier, the weights of the misclassified examples are increased to make the learning method focus more on the misclassified examples in the next iteration. Originally, AdaBoost evaluates each instance and updates its weight individually. This technique is not suitable for the unknown word data that we treat them as groups of unknown word candidates. We then propose a new technique called GRE-based Boosting to efficiently apply AdaBoost technique to the unknown word data. In this technique, a weight is assigned to each group of candidates. After constructing a base classifier, each group is evaluated‘ based on the GRE technique explained in the previous section. The classifier is considered to misclassify a group when the top ranked candidate in the group is not a correct unknown word. The weight of that group is then increased to make the group be more focused in the next iteration. Figure 6-5 shows the overall process of the proposed GRE-based boosting technique. Initially, , a training set with all groups weighted by 1.0, are fed to INDUCER, a base learning method, in order to generate a classifier . The obtained model is passed to GRE-INCOR to evaluate and obtain the misclassified groups. Then, , a confidence weight of the classifier, and , a ratio of success to unsuccess rate, are calculated from the misclassifying rate (as explained in Algorithm

277

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

1). The confidence weight ( ) represents the performance of the classifier. It is later used to represent the strength of the classifier when the results from several classifiers are combined in the evaluation step. The ratio of success to unsuccess rate ( ) is used as the new weights of the misclassified groups in the next iteration. Basically, this is larger than 1. Hence, the classifier constructed in the next iteration will be specialized to the previously misclassified instances.

Figure 6-5: GRE-based Boosting (source: TeCho et al., 2009b) Algorithm 1: GRE-based Boosting Input:

is an initial training set with all weights set to 1.0 is the number of iterations. Output: is a set of base classifiers 1: ; 2: ; 3: for k=1 to K do 4: ; 5: ; 6: ; 7: ; 8: ; 9: ; 10: foreach do 11: if then 12: ; 13: else 14: ; 15: end 16: end 17: ; 18: end

278

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Algorithm 1 shows the GRE-based boosting technique in details. The algorithm starts with the initial training set with and , where is the group of unknown word candidates generated for the i-th unregistered portion, is the j-th candidate of the i-th unregistered portion, is an initial weight (set to 1 at the first iteration) given to , is the number of unregistered portions, is the number of unknown word candidates generated for the i-th unregistered portion, is the set of feature values representing , and is the target attribute of (designated as the class label), stating whether is the correct unknown word (+1) or not (-1). iterations are conducted to construct a sequence of base classifiers. At the k-th iteration, a training set is fed to INDUCER to construct a base classifier mk. The classifier is then evaluated by GRE-INCOR yielding , a set of misclassified groups. , the error rate of the classifier , can be calculated from . It is used to calculated , and which are the parameters showing the confidence level of the classifier, and the weight for the iteration. Finally, the weight of the misclassified group is set to . Otherwise, it is set to 1. Voting Group-based Ranking Evaluation From the previous step, we obtain a sequence of base classifiers. Each classifier is attached with its confidence weight ( ). In this section, we propose a technique called Voting Group-based Ranking Evaluation to evaluate a group of unknown word candidates, and predict the unknown word by combining votes from all base classifiers. Figure 6-6 shows the process to evaluate a given group of unknown word candidates. Each candidate in the group is fed to all the classifiers to obtain the probabilities that the candidate is a correct unknown word. Each probability is weighted by the confidence weight of the corresponding classifier. These weighted probabilities are summed for each candidate. Finally The candidate with the highest summed probability value is chosen as an unknown word.

Figure 6-6: Voting Group-base Ranking Evaluation (source: TeCho et al., 2009b)

279

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Algorithm 2: Voting GRE (V-GRE) Input:

is a set of base classifiers is a set of unknown word groups. Output: is the set a member of which is the set of the p suggested unknown words for each unregistered portion. 1: ; 2: foreach do 3: ; 4: foreach do 5: 6: foreach do 7: 8: 9:

; ; end

10: 11: end 12: 13: 14: end

; - ;

;

Algorithm 2 shows the evaluation process in details. This algorithm uses as inputs a set of classification models and a testing set with and , where is the model generated at the k-th iteration, is the confidence weight of . is the group of unknown word candidates generated for the i-th unregistered portion, is the j-th candidate of the i-th unregistered portion, n is the number of unregistered portions, is the number of unknown word candidates generated for the i-th unregistered portion. Then, each base classifier and each candidate are fed to the function CLASSIFIER to get , the probability that the candidate is an unknown word based on the model. This probability is weighted by , and added into the corresponding summation . Finally, the top-t candidates are chosen and returns as a set of predicted unknown works by TOP-t-CANDIDATE.

6.3.3. Experimental Settings and Results In the experiment, we used a corpus of 16,703 medical-related documents gathered from WWW taken from (Theeramunkong et al., 2007) with a size of 8.4 MB for evaluation. The corpus is first preprocessed by removing HTML tags and all undesirable punctuations. To construct a set of features, we apply TCCs and the sorted sistring technique. After applying word segmentation on the running text, we have detected 55,158 unregistered portions. Based on these unregistered portions, 3,209,306 unknown word candidates are generated according to the process described previously. Moreover, these 55,158 unregistered portions came from only 3,763 distinct words. In practice, each group of candidates may contain one or two positive labels. Therefore, 62,489 unknown candidates were assigned as positive and 3,146,819 unknown candidates were assigned as negative. The average number of unknown candidates in a group is around 58. Based

280

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

on preliminarily statistical analysis of the Thai lexicon, we found that the average number of TCCs in a word is around 4.5. In this work, to limit the number of generated unknown word candidates, the maximum number of TCCs surrounding an unregistered portion (h) is set to nine. This number is twice of the average number of TCCs in a word. With h=9, the number of generated unknown word candidates becomes 100. Moreover, it is possible to use two sets of separation markers in Sect. 4.2 to reduce the number of candidates. Table 6-10 shows the number of candidates generated with/without applying two sets of separation markers to reduce the number of candidates. The second and fourth columns indicate the distinct and total numbers, respectively. The third and fifth columns shows the ratio over the number of candidates generated without considering any separation markers for the cases of distinct and total numbers, respectively. Table 6-10: Numbers of candidates generated with/without applying two sets of separation markers and their portions compared to ‘None’ (source: TeCho et al., 2009b) Marker Set None First Set Second Set First + Second Set

# Distinct 2,567, 463 2,363,829 1,295,737 1,153,867

% Portion 100.00 92.07 50.47 44.94

# Total 7,632,300 7,158,875 4,241,097 3,891,845

% Portion 100.00 93.80 55.57 50.99

Exploiting a naïve Bayes classifier as the base classifier, the proposed methods, GRE-based boosting (later for short, GRE) and V-GRE, are used to learn ensemble classifiers and to identify an unknown word. For V-GRE, the boosting iteration is set to ten. That is, sequentially ten classifiers are generated and used as Classification committees. Moreover, to evaluate our proposed method in detail, we have conducted the experiments to examine the effect of eight features, (f1)-(f8), on the classification result by comparing performance of each possible feature combination with the others. In the experiments, 10-fold cross validation is employed to compare the proposed methods (GRE and VGRE) to the record-based naive Bayesian method (R-NB). The R-NB is a traditional naive Bayesian method, where all instances in the training/testing set are assumed to be independent of each others. We investigate the performance of GRE, V-GRE, and R-NB when the top-t candidates with t ranging from 1 to 10, are considered as correct answers. Table 6-11 displays the performance of two group-based evaluations; GRE and V-GRE, as well as R-NB in cases of the all-feature set (f1f8) and the best-5 feature sets ((f3,f4,f7), (f3,f4,f5), (f3,f4,f8), (f2,f4,f6,f7), (f4,f6,f8)). More precisely the all-feature set performs well at rank 12 among all possible combinations (255 methods). According to the result, a number of conclusions can be made as follows. Firstly, V-GRE outperformed GRE in both the all-feature set and the best-5 feature sets for all top-t ranks. For the top-1 rank of the all-feature case, V-GRE achieved an accuracy of 90.93%±0.50 while GRE gained 84.14%±0.19. For a higher rank, V-GRE still outperformed GRE even the grap becomes smaller, e.g., at rank-10 V-GRE and GRE gains 97.90%±0.26 while GRE gains 97.25%±0.17. V-GRE outperforms GRE with a gap of 6.79 (90.93%-84.14%) for the top-1 rank. This gap is very small for the top-10 rank, i.e., 0.01 (97.26%-97.25%). In cases of the best feature set (f3,f4,f7), V-GRE can achieve up to 93.93%±0.22 and 98.85%±0.15 accuracy, for the top-1 and top-10 rank, respectively while GRE obtains 84.15%±0.64 and 97.24%±0.27, respectively. The result indicates that VGRE is superior to GRE with the gaps of 9.78 and 1.61 for the top-1 and top-10 rank, respectively.

281

Sponsored by AIAT.or.th and KINDML, SIIT

CC: BY NC ND

Secondly, V-GRE obtains higher accuracy than the record-based naive Bayesian (R-NB) does in most cases. However, GRE may not be superior to R-NB in the case of the top-1 rank but it outperforms R-NB in the case of the top-2 rank. Thirdly, our proposed V-GRE and GRE can find the correct unknown words within the rank of 10 (top-10) with the relatively high accuracy of 97%-98%. Table 6-11: Accuracy comparison among GRE, V-GRE and a naïve Bayes classifier. Here, h is set to nine. (source: TeCho et al., 2009b) Feature set Evaluation 1 Techniques (f3,f4,f7) GRE 84.150.64 V-GRE 93.930.22 R-NB (f3,f4,f5) GRE 89.480.46 V-GRE R-NB (f3,f4,f8) GRE V-GRE R-NB (f2,f4,f6,f7) GRE V-GRE R-NB (f4,f6,f8) GRE V-GRE R-NB (f1-f8) GRE V-GRE R-NB

2

3

4

5

6

7

8

9

10

91.180.43 93.490.40 94.850.36 95.740.33 96.280.33 96.590.32 96.820.30 97.040.28 97.240.27 95.440.26 96.301.78 97.150.23 97.810.23 98.150.20 98.410.20 98.590.19 98.720.20 98.850.15

89.960.21 90.430.41 92.570.29 94.980.41 95.360.37 95.670.34 95.820.33 96.070.29 96.180.28 96.420.26 93.480.46 94.430.41 94.941.18 95.360.37 95.670.34 95.820.33 95.070.29 96.180.28 96.420.26 96.570.29 81.960.11 90.630.32 95.280.21 95.420.32 95.890.26 96.120.27 96.430.25 96.570.22 96.790.23 96.940.22 97.150.19 92.630.32 95.420.32 95.880.27 96.120.27 96.430.25 96.570.22 96.790.23 96.940.22 97.150.19 97.280.21 89.700.05 86.030.59 98.900.18 93.500.40 95.350.36 96.650.34 97.570.27 98.080.19 98.520.20 98.770.19 99.060.15 91.950.39 96.520.20 97.440.18 98.010.13 98.450.10 98.660.11 98.850.11 98.980.14 99.060.15 99.120.16 89.700.08 78.870.49 94.620.26 87.010.46 89.250.51 90.550.43 91.650.38 90.780.76 93.990.42 94.830.34 95.310.33 95.730.30 96.080.34 87.890.07 84.140.19 91.710.22 93.520.33 94.860.25 95.740.24 96.290.20 90.930.50 94.920.43 96.050.42 96.630.43 97.040.43 97.270.40 82.480.12

92.600.30 93.430.26 94.040.29 95.130.27 96.410.30 96.690.29 96.850.28 97.030.25 96.600.17 96.830.19 97.040.22 97.250.17 97.490.36 97.650.34 97.790.31 97.260.26

Conclusions Section 6.3 presents an automated method to recognize unknown words from a Thai running text. We described how to map the problem to a classification task. The naïve Bayes with a smoothing technique classifier is investigated using eight features: number of TCCs, number of known words, string length, number of left and right TCCs variety, probabilistic of special character occurring on left and right, number of document found, term frequency and TFIDF scores, for evaluating the model. In practice, the unknown word candidates actually have relationship among them. To reduce the complexity in unknown word boundary identification, reduction approaches are employed to decrease a number of generated unknown word candidates to 49%. This paper also proposed the group-based ranking evaluation technique. This technique considered the unknown word candidates as groups that can solved the unbalanced datasets problem. To further improve the prediction of a classifier, we apply a boosting technique with voting under group-based ranking evaluation (V-GRE). We have conducted a set of experiments on real-world data to evaluate the performance of the proposed approach. From the experiment results, the proposed technique achieves the accuracy of the order of 90.93%0.50 to 97.90%0.26 at the first rank and tenth rank. Our proposed ensemble method can achieve an increase in classification accuracy of the order of 6.79% to 8.45% at the first rank when compared to the ordinary evaluation and group-based ranking evaluation (GRE) technique, respectively. For more detail, the reader can find in (TeCho et al., 2009b).

282

Sponsored by AIAT.or.th and KINDML, SIIT

Suggest Documents