Automatic Classification of Swedish Email Messages - CiteSeerX

2 downloads 49309 Views 204KB Size Report
Aug 17, 2005 - reply to the email. In order to meet this challenge, automatic categorization of all incoming emails can be extremely useful because it can help ...
Automatic Classification of Swedish Email Messages 17th August 2005

A thesis submitted for the degree of Bachelor of Arts in Computational Linguistics by

Kathrin Eichler [email protected] Seminar f¨ ur Sprachwissenschaft Eberhard-Karls-Universit¨at T¨ ubingen

in connection with the seminar Machine Learning in Computational Linguistics taught by Sandra K¨ ubler

Contents 1 Introduction 1.1 Task . . . . . . . . . . . . . . . . 1.2 Algorithms . . . . . . . . . . . . 1.2.1 Naive Bayes . . . . . . . . 1.2.2 Support Vector Machines 1.2.3 k-nearest Neighbor . . . . 1.2.4 Rule-learning . . . . . . . 1.2.5 Boosting . . . . . . . . . . 1.3 Evaluation measures . . . . . . . 1.4 Data . . . . . . . . . . . . . . . . 1.5 Related work . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

3 3 4 5 5 6 6 7 7 8 8

2 Preparatory steps 10 2.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Identification of categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 First Results

12

4 Feature reduction 4.1 Lower-case conversion . . . 4.2 Feature merging . . . . . . 4.3 Dimensionality reduction . 4.3.1 Stop words . . . . . 4.3.2 Low-frequency words

13 14 14 15 16 17

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

5 Balancing the corpus

18

6 Term weighting

19

7 Definition of a threshold

20

8 Summary and future work

23

References Appendix: Implementing an email classification system

2

1

Introduction

In the last decade, the importance of email as a means of communication has increased dramatically, and, due to this development, the need to organize these emails has grown as well. For companies receiving huge numbers of emails to a single email address, e.g. the address of the company’s customer service, it is particularly important to make sure that all emails are responded to within a reasonable amount of time, by someone with the knowledge required to reply to the email. In order to meet this challenge, automatic categorization of all incoming emails can be extremely useful because it can help route an email to the right person. In this paper, I will investigate how well machine learning techniques can be applied to the task of email categorization (or email classification), conducting my experiments on a relatively small corpus of Swedish email messages addressed to the customer service of a Swedish home furnishing retailer. The idea is to find out whether this classification can be used to automatically route incoming emails to the person in charge. The paper is organized as follows: The remainder of this section introduces the task, algorithms commonly used for similar tasks, the measures applied in evaluating classification effectiveness, the data on which my experiments will be based, and literature related to this project. Section 2 describes the steps necessary to prepare the experiments and section 3 presents the results of the first experiments. In section 4, several methods to reduce the number of features are investigated, section 5 investigates corpus balancing, section 6 presents the results when using continuous instead of binary feature weights, and section 7 shows how the definition of a threshold could affect the results. Section 8 summarizes the results and suggests future work. The appendix focuses on practical issues by describing the steps necessary when implementing an email classification system.

1.1

Task

The task of classifying emails is strongly related to the more general task of automatic text categorization (TC), which aims at categorizing text documents according to their topic by assigning tags from a pre-defined set of categories. The categorization decision is based on the documents’ contents and usually involves some analysis of the words in the documents. In principle, there are several ways to do TC and hence several decisions to be made before starting (cf. Sebastiani[25], ch. 2): One important decision is whether to assign exactly one category to each document (single-label TC) or to allow any number of categories to be assigned to the same document (multi-label TC). A special case of single-lable TC is binary TC, in which each document is assigned to either one of two categories (e.g. spam, non-spam). Another decision to make is whether to enforce a classification for every document or to classify only those that can be classified at a certain confidence level, i.e. with a probability above a certain threshold that needs to be determined in advance. A third alternative is to

3

build the system in such a way that for each document d, it ranks the categories according to their appropriateness to d. Such a ranked list would then help a human expert to take the categorization decision. When building a text classification system, two basic approaches are possible (cf. Sebastiani[25], ch. 4). In the ’80s, the most popular approach was to use knowledge engineering, i.e. to manually define a set of logical rules. However, this approach is costly because it involves enormous labor power of not only the knowledge engineer but also a domain expert knowing which documents belong to which of the possible categories. Therefore, even though some real-world applications1 still use human encoded linguistic rules for text classification, machine learning has become the dominant approach for this task in the last 15 years. Here, a so-called learner automatically builds a classifier for each category by observing the characteristics of the documents that have been manually assigned this category by a domain expert. As manually tagged data in the form of pre-classified documents is provided by the domain expert, this kind of text classification is an example of supervised learning. Each new document is then classified under the category whose characteristics best match the characteristics of the document. Even though you still need a knowledge engineer and a domain expert, their tasks have become easier. Instead of describing the concept of each category in words, the domain expert now only has to select instances of it. The knowledge engineer, who had to build a classifier by defining rules manually in the knowledge engineering approach, now has to build a system that learns rules, i.e. builds classifiers, automatically. Since these systems work domain-independently, they are also highly portable to different domains. Using machine learning, the dominant approach for text classification tasks is to use the presence or absence of different words as so-called features, i.e. classify each email based on the words that it does and does not contain. In the most common document representation in text categorization, each document is seen as a so-called bag-of-words, i.e. the set of all words appearing in the document; context and word order are ignored. Each document is then converted into a vector mapping every feature to a specific value. Feature values can be either binary (e.g. 1 if the feature is present, 0 otherwise) or continuous (e.g. a weight assigned to each feature based on the importance of the feature).

1.2

Algorithms

Various methods have been used for text classification tasks, of which the most common ones, namely Naive Bayes, Support Vector Machines (SVM), k-nearest neighbor (kNN), rulelearning (in particular decision trees), and boosting, will be presented in the following. Another method that should be mentioned is based on neural networks (cf. Quinlan[20]), but will not be discussed further in this paper. 1

e.g. Sophia by the Italian company Celi

4

1.2.1

Naive Bayes

Bayesian classifiers build their classification models based on Bayes’ theorem. For a training sample (here: an email) E, the classifier calculates for each category, the probability that the email should be classified under Ci , where Ci is the ith class, making use of the law of conditional probability: P (Ci |E) =

P (Ci )P (E|Ci ) P (E)

(1)

Making the naive Bayes assumption that the probability of each word in a document is independent of the word’s context and position in the document,2 P (E|Ci ) can be calculated by multiplying the probabilities of each individual word wj appearing in the category (wj being the j th of l words in the email): P (E|Ci ) =

Y

P (wj |Ci )

(2)

1≤j≤l

Equation 1 can then be reformulated as: P (Ci |E) =

P (Ci ) Y P (wj |Ci ) P (E) 1≤j≤l

(3)

The category maximizing P (Ci |E) is predicted by the classifier. Naive Bayes has been successfully applied by several authors (e.g. by Baker & McCallum[2], Domingos & Pazzani[7], Joachims[11], Koller & Sahami[14], and Lewis & Ringuette[15]) but has also been claimed to be inferior to other methods such as SVM, C4.5, and kNN algorithms (cf. Joachims[12]). 1.2.2

Support Vector Machines

The SVM method has been used by Joachims[12] and is based on the idea that, in ndimensional space, the positive training examples may be separated from the negative ones by a surface of dimension n - 1 (cf. Sebastiani[25], ch. 6).3 The following graph illustrates the idea for n = 2 (i.e. the two-dimensional case), where the surface is of dimension 1 (i.e. a line): 2

Even though this assumption is often violated in practice, Bayesian classifiers still perform quite well (cf. Domingos & Pazzani[7] for discussion). 3 If this is possible, the positives and negatives are called linearly separable.

5

6

B ◦ B B ◦ @B B ◦ B B @ @ @ @ B B @ + @ @B @B ◦ @ @ B ◦B + ◦ @ B @@ ◦ B @ + @ +B @ B @◦ B@ B @ @ + B @ + +B B @@@@ B @ lml2 B

l1

-

Of all possible lines, the SVM method then chooses the middle element lm of that set s of parallel lines in which the maximum distance between two lines l1 , l2 ∈ s is largest. 1.2.3

k-nearest Neighbor

kNN (e.g. Yang[29], Guo et al.[10]) is a similarity-based approach, which transforms the documents in the training data into feature vectors and computes their similarity. To classify a new document, the k most similar training examples (i.e. the k nearest neigbors) are retrieved and the classification decision is made based on majority voting. Classification efficiency is strongly dependent on the value chosen for k. 1.2.4

Rule-learning

A commonly used rule-learning algorithm, C4.5, is based on decision trees (cf. Quinlan[20]): 

w1

 y \ n \\   

w2

w3

  y C n C . . .  

w4

w5

c1

c2

 

Here, a document is classified by starting at the root node of the tree (here: w1 for word 1) and moving through it until a leaf (here: c1 or c2 for class 1 or 2, respectively) is reached. At each non-leaf node, the outcome is determined (here: y if the word is contained in the document, n if not) and the classifier moves on to the node corresponding to the outcome. When a leaf is reached, the document is classified according to the class at the leaf. A similar approach applied by Cohen[6] is the rule-learning algorithm RIPPER. Instead of 6

building a decision tree, RIPPER starts with an empty ruleset and keeps adding rules until all positive examples are covered. For each rule, the algorithm starts with an empty antecedent and adds conditions until no negative examples are covered. 1.2.5

Boosting

Boosting algorithms combine the individual judgements of several classifiers (so-called weak learners). However, unlike other approaches combining the outcome of several classifiers, e.g. majority voting, the classifiers are trained sequentially. In this way, when training a classifier, the system can concentrate on the training examples which the previous classifiers have performed worst on (cf. Sebastiani[25], ch. 6). One boosting algorithm that has been used for classification tasks (e.g. by Michelakis et al.[17]) is the LogitBoost algorithm introduced by Friedman et al.[8]. LogitBoost is based on regression stumps, a form of decision trees of unary depth (Androutsopoulos et al.[1]), as weak learners.

1.3

Evaluation measures

Classification effectiveness is usually measured in terms of accuracy, precision, recall, and F-measure (cf. Sebastiani[25], ch. 7). These measures can be calculated using a so-called contingency table (Table 1).

category ci predicted ↓ YES NO

correct YES NO T Pi F Pi F Ni T N i

TP = true positives TN = true negatives FP = false positives FN = false negatives

Table 1: Contingency table

Accuracy is calculated by dividing the number of correctly classified instances (here: emails) by the total number of instances, i.e.

T P +T N T P +T N +F P +F N .

Precision is defined as the probability

that if a random document is classified under ci , this decision is correct. Precision is calculated by dividing the number of true positives, i.e. those instances correctly classified under ci , by the number of all instances classified under ci , i.e.

TP T P +F P .

Analogously, recall is the probability

that a random instance that would be correctly classified under ci is assigned to that class. Recall is calculated by dividing the number of true positives by the number of those instances that would have been correctly classified under ci , i.e.

TP T P +F N .

As neither precision nor recall

makes sense in isolation from each other (high levels of precision can be obtained at the price of low levels of recall and vice versa), both measures should be combined. A common measure used for this purpose is the so-called F-measure or Fβ function: Fβ =

(β 2 +1)·precision·recall β 2 ·precision+recall

7

If β = 1, the function attributes equal importance to precision and recall. If β > 1, precision is stressed, analogously, if β < 1, recall is attributed greater importance. For the experiments described in sections 4, a β value of 1 was used.

1.4

Data

The following experiments are based on a corpus of 115

4

email messages sent to the customer

service of a large Swedish home furnishing retailer via a contact form on the company’s website. The messages range in size from a single sentence to several paragraphs and mainly contain questions about the products offered by the company (e.g. color, size, weight, or price of the product), the stock available at a particular store, and customer complaints. Since this email corpus is rather small compared to corpora usually used for text classification tasks, e.g. the Reuters collection (used by e.g. Calvo & Ceccatto[5], Guo et al.[10], Joachims[12]), the expectation is that the results will by no way be comparable to results of projects where a lot more training data was available. However, the project can still be of great use because even a small gain in efficiency brought by the automatic routing of email may already speed up the processing of the incoming email.

1.5

Related work

To my knowledge, this is the first attempt to apply machine learning techniques to the task of classifying email messages written in Swedish. The literature used for this paper is thus mainly on text classification experiments carried out on English data. An interesting paper dealing with text classification is by Joachims[12]. He compares SVM to four other common methods, namely Naive Bayes, Rocchio (cf. Rocchio[23]), kNN and C4.5. Features are selected using the information gain criterion as presented by Yang & Pedersen[30], which measures the amount of information obtained by knowing the presence or absence of a word and ranks all words accordingly. The ”best” features, i.e. those words with the highest information gain, are kept, the other features are removed. Joachims conducts his experiments on the Reuters corpus and reports that SVM shows the best results. Of the other methods, kNN outperforms the decision tree method and the Rocchio algorithm; Naive Bayes performs worse. Another paper is by Baker & McCallum[2], who investigate how document classification can benefit from word clustering, i.e. forming feature groups by joining similar words. They use a Bayesian classifier and base their experiments on three real-world corpora, namely the 20 Newsgroups data set (newswire stories), the Reuters-21578 data set (UseNet articles), and the Yahoo! data set (web pages). Comparing distributional clustering (a clustering method 4

The original corpus contained 132 messages of which 17 were follow-up emails. These were taken out of the corpus because follow-up emails are always directed to a specific person at the customer service and are thus not relevant for the task of automatically routing all emails sent to the customer service’s email address.

8

measuring the similarity between two words as the similarity between their class distributions) to other feature clustering and feature selection algorithms, they report to achieve significantly higher accuracy, losing only about 2% while reducing feature dimensionality by three orders of magnitude. On the Yahoo! data set, distributional clustering even increases classification accuracy. Most relevant to my work, however, are experiments conducted on email corpora. What is special about emails compared to other text documents is that they are typically short and contain a limited amount of structure due to the division into a body and several headers. Several authors have investigated email classification; many of them, however, concentrate on binary classification, where emails are classified under either one of two categories, e.g. spam filtering. A paper dealing with two-class problems is by Kiritchenko & Matwin[13], who explore co-training, i.e. the use of unlabeled data to boost classification performance. Using 1500 emails grouped into three folders, they form three two-class problems by defining the emails from each group in turn as positive examples, the emails from the other two groups as negative ones. To build the feature list, stop words and words appearing in only one email are removed, the remaining words are stemmed. For each feature, the frequency of the word in the email is used as the feature value. Using implementations of Naive Bayes and SVM, they report that SVM benefits from co-training whereas the performance of the Naive Bayesian classifier degrades. Experiments on email classification using more than two classes have been described by, for example, Provost[19]. He compares a Naive Bayesian algorithm against RIPPER on a data set of 2051 emails hand-sorted into eight different folders. The most important steps he applies in selecting features are to tokenize the messages, convert all words to a single case, and remove high-frequent content words as well as single-character words; stemming is not performed. In his experiments, the Bayesian classifier outperforms RIPPER, achieving 80% accuracy after 175 examples and 87% classification accuracy after 400 training examples. In another interesting paper, Rennie [21] presents a tool called ifile, which applies a Naive Bayesian algorithm to the task of what Rennie calls mail filtering, i.e. the categorization of incoming email into different folders. The number of features was reduced by tracking word frequency statistics along with the word’s age (i.e. the number of emails added since the first appearance of the word) and discarding old, infrequent words. Conducting his experiments on his own email corpus with 4447 messages and 33 folders, he reports an accuracy of around 86%. A similar tool, MailCat (Segal & Kephart[26]), constructs an email classifier by analyzing the user’s existing folders and then uses this classifier to predict the three most likely destination folders. Unlike ifile, it is thus designed as a categorization aid for users with many folders rather than a tool that categorizes an email directly into a specific folder. The algorithm used is similarity-based and classifies a new message by computing the similarity between the 9

message’s word-frequency vector and the weighted folder vectors. The obtained accuracy for this task ranges between 80% and 90% for six different users. Summarizing the results reported in the papers on email classification tasks, the accuracy achieved is usually between 80% and 90%. However, these results are hardly comparable to the results expected for the task at hand, due to the differences in amount of training data, number of categories, email contents, and type of categorization.

2

Preparatory steps

2.1

Data preprocessing

In the original corpus, the actual messages were embedded in all kinds of header information, e.g. name and address of the sender. Although information about the email sender may be useful for other email classification tasks, it is likely to harm performance in this case because most senders will only occur once in the corpus. It was therefore removed before processing the data. In addition, the file containing the message usually also contained the reply sent to the customer by the customer service representative. This reply was deleted as well because it will not be available once the system has been put into use and would make the results less representative. After removing these additional data, the email bodies were saved in a database to facilitate further processing. When classifying real-world email, this pre-processing will not be necessary because the email bodies can be extracted automatically, e.g. using the JavaMail API5 , which provides a framework to build email applications.

2.2

Identification of categories

Since this project is a preliminary study and not directly connected to a current project with the furnishing company, I could not make use of categories provided by the company but had to come up with appropriate categories myself. Having examined all 115 emails, I decided on the following three categories and assignment criteria: sortiment

questions about the company’s products, e.g. price, weight, size, packing, and delivery time

lager

questions concerning the stock

reklamation

customer complaints

These categories were chosen for several reasons. First, they allow a very straightforward manual annotation because most of the emails can be classified as belonging to precisely one of the possible categories. Second, they seem to be in line with the typical departments of a company. For this company, one might for example imagine a product department, which can answer questions concerning the company’s products, a purchasing department for questions 5

cf. http://java.sun.com/products/javamail/javadocs

10

concerning the products in stock, and a complaints department, which can deal with customer complaints. Moreover, three seems to be a reasonable number of categories for the restricted amount of data available. Given the criteria specified above, 105 of the 115 emails can be classified as belonging to exactly one of the categories: category

# of emails

sortiment

69

lager

20

reklamation

16

For the remaining 10 emails, it is impossible to assign exactly one category, even for a human annotator, either because the email content would fit into more than one category or because none of the categories fit. My first thought was to add a fourth category that would collect all those emails that do not fit in anywhere else. However, I discarded the idea because the contents of the 10 affected emails are much too diverse to be merged into a single category. Such a category would not be reasonable but confuse the program and make it worse than it really is. Instead, I have decided to take the 10 emails out of the training data. This will presumably make the results of my experiments slightly better than if I used all emails for training. I will, however, deal with these 10 emails in section 7. Given the three categories above, the baseline accuracy is 65.71%, which is the accuracy obtained when assigning the most probable category sortiment to all emails.

2.3

Feature extraction

I adopted the most common document representation in text categorization, where each document is represented as the set of all words appearing in the document. In a first attempt, I use all words (i.e. tokens separated by blank space) that appear in the email corpus as features. Basic punctuation marks, i.e. [,.!?;:”()], are removed beforehand. This results in 1603 features, a number I will later reduce using feature selection methods. To build the training corpus, each email is transformed into a list of feature values by setting, for each word in the feature list, the feature value (or weight) to 1 if the email contains the word, to 0 otherwise.6 The training data on which the classification model will be built is provided in the attributerelation file format (ARFF), a file format commonly used for machine learning tasks. ARFF files consist of two sections, a header section and a data section. The header section contains the name of the relation (an arbitrary name) and a list of all features (also called attributes) together with their types (i.e. the possible values they can hold). The following lines are extracted from the header section used in one of the experiments described in section 4:

6

These binary weights will be converted into continous weights in chapter 6.

11

@relation xxx @attribute 1 {0, 1} @attribute 2 {0, 1} @attribute 3 {0, 1} @attribute 3-sits {0, 1} . . . @attribute @attribute @attribute @attribute @attribute

a ¨r {0, 1} a ˚ka {0, 1} a ˚r {0, 1} o ¨verdrag {0, 1} class {sortiment, lager, reklamation}

The data section contains a line for each instance (here: email) in the training data, consisting of the values of all features for the respective instance separated by comma. In the following lines, extracted from the data section used in one of the experiments above, value 1 at position n indicates that the nth feature was found in the email, 0 indicates that it was not found. The last feature represents the class (or category) that the instance has been assigned to. @data . . . 0,1,0,1,(...),0,0,0,0,lager 0,0,0,0,(...),1,0,0,0,sortiment 0,1,0,1,(...),1,1,0,0,reklamation . . .

In the above example, the sequence of feature values in the third row represents an email of category reklamation that contains, among others, the words 2, 3-sits, a ¨r, and ˚ aka, but does not contain 1, 3, ˚ ar, or o ¨verdrag.

3

First Results

For running my experiments, I used the Weka toolkit (cf. Witten & Frank[27]), an opensource tool providing a large collection of machine learning algorithms. To start with, I tried Weka implementations of the algorithms described in section 1.2, namely Naive Bayes, SMO (an SVM-based algorithm), IBk (a kNN algorithm), J48 (Weka’s C4.5 implementation), JRip (Weka’s implementation of the RIPPER algorithm), and LogitBoost. Due to the small amount of data available, I used all 105 emails for training instead of splitting the corpus into a training set and a validation set, and estimated the performance of the trained models using leaveone-out (LOO) cross-validation (cf. Witten & Frank[28]). LOO cross-validation measures classification performance by averaging over the results of several runs. In the first run, the first instance (here: email) is left out, the learner is trained on the remaining instances and then tested on the instance that was left out. In the following runs, each instance in turn is left out, until, in the last run, all instances but the last one are used as training data and the last

12

instance is used for testing. In the case at hand, LOO cross-validation consists of 105 runs (also called folds). Due to its high computational cost, LOO cross-validation is usually infeasible for large datasets but is very useful for small datasets because it provides the best possible estimate of performance. Table 2 shows the accuracy achieved applying the different algorithms.

algorithm LogitBoost (Boosting) SMO (SVM) J48 (C4.5) Naive Bayes JRip (RIPPER) IBk (kNN)

correctly classified 78 73 70 68 65 62

accuracy 74.29% 69.52% 66.67% 64.76% 61.90% 59.05%

Table 2: Comparing Weka algorithms

Feeding my data into the various algorithms it turned out that LogitBoost worked significantly better than all the others, achieving 74.29% accuracy on the spot. Of the others, SMO, an SVM-based approach, performed best, reaching an accuracy of 69.52%. J48, Weka’s implementation of the C4.5 algorithm, achieved 66.7%; all the others performed below the baseline level of 65.71%. A possible explanation for the inferior performance of the LogitBoost algorithm is that it uses boosting, i.e. combines the individual judgements of several classifiers. Boosting algorithms are rather slow compared to individual classifiers but still fast enough for the small amount of data I am using. I thus decided to use LogitBoost for the following experiments.

4

Feature reduction

Reducing the number of features does not only reduce computational complexity of the learning algorithm, but can even improve classification quality, first because features may be too specific or too general and harm the classification rather than improve it, second because subsuming several features into one can reduce the problem of data sparseness. One technique is to perform stemming, i.e. reduce all inflected forms of a word to the stem. Even though it is commonly used in text classification tasks, stemming will not be applied in this paper because it has occasionally been shown to be ineffective or even degrade performance (cf. Gaustad & Bouma[9]), and, as Swedish is a language with a relatively simple morphology, the expectation is that stemming would have little if any effect on performance for the data at hand. Other ways of reducing the number of features are to convert all words to lower case, to merge several features into one, and to remove stop words or words with a low frequency. All of these methods will be applied in the following.

13

4.1

Lower-case conversion

While in other languages, e.g. German, converting all words to lower case might actually decrease performance by removing potentially crucial information about the word class (e.g. ”Buche” (beech) vs. ”buche” ((to) book, 1st-sg.)), this is unlikely to happen with Swedish data because capitalization is only used sentence-initially and for proper nouns. The conversion to lower case turned out to be a very good idea for the data at hand. It decreased the number of features from 1603 to 1436 and increased accuracy from 74.29% to 79.05%. This improvement is plausible if we consider that due to the small amount of data, data sparseness plays a significant role. Table 3 shows the results after lower-case conversion, broken down into the three categories:

class sortiment lager reklamation

actual sortiment lager reklamation

precision 75.8% 100% 100%

sortiment 69 14 8

recall 100% 30% 50%

F-measure 0.862 0.462 0.667

predicted lager reklamation 0 0 6 0 0 8

Table 3: Lower-case conversion

The results show that the algorithm is biased towards the most probable class, namely sortiment, and reaches 100% recall for it. The precision score for this category is also fairly good, resulting in a comparably high F-measure score. For the other two categories, the precision score is 100%, but the recall score is fairly bad, especially for the lager class.

4.2

Feature merging

Taking a closer look at the emails of the lager class, a potential difficulty arises: Since most of the emails contain questions asking whether a certain product is available at a certain store location, a considerable part of the words in the emails constitutes store location names. Even though there exist only about a dozen store locations of the company in Sweden, most of them occur only once or twice in the data. So even though they would probably be good evidence of the email belonging to the lager class, the algorithm has no chance finding any regularities given the small number of emails. My idea was thus to group all store location names into the single feature STORE . This is similar to clustering (cf. Baker & McCallum[2]) in a way that it groups several words into a single feature, the difference being that the features to be grouped are specified in advance rather than clustered automatically. The list of store loca14

tions was extracted from the company’s web page. It contains ten single-token town names in addition to four location names consisting of the name of the closest city followed by a more specific description of the place, e.g. the name of a smaller town close to the store. In the latter case, all tokens were considered store locations individually because, in the training data, the individual tokens are sometimes used instead of the whole store location name, and none of them clashes with any of the other features. Merging all store locations into a single feature involves the following steps: 1.

deletion of all store location names from the list of features

2.

insertion of the feature STORE

3.

replacement of all store locations in the emails by the variable STORE

When creating the training data, the value of the feature STORE is then set to 1 for each mail that contains at least one store name. Table 4 shows the results after introducing the variable STORE :

class sortiment lager reklamation

precision 78.8% 83.3% 100%

actual sortiment lager reklamation

sortiment 67 10 8

recall 97.1% 50% 50%

F-measure 0.87 0.625 0.667

predicted lager reklamation 2 0 10 0 0 8

Table 4: Introducing the variable STORE

Due to the variable, the number of features is reduced to 1422, the number of correctly classified instances rises from 83 to 85 (79.05% to 80.95%). Even more striking is the recall score for the class lager. Before introducing the variable, the algorithm had classified only 6 of the 20 lager emails as belonging to the lager class, afterwards this number rises to 10. Even though a recall score of 50% is still quite low, the results show that the variable improves the score significantly, the reason being that it addresses the sparse data problem by merging several semantically similar low-frequency words into a single feature.

4.3

Dimensionality reduction

Since 1422 features are still quite many, the question is whether a considerably smaller number of features could reach the same or even better performance. The basic idea here is to eliminate all those words that are not indicative of any of the categories. Common techniques are stop 15

word filtering and filtering of low-frequent words; both will be investigated in the following. 4.3.1

Stop words

One possible approach is to make use of a stop word list and remove all those terms from the list of features that appear in the list of stop words. For this purpose, I downloaded a list of Swedish stop words from the internet7 , which contained 114 Swedish function words. Unfortunately, deleting all stop words in the list from my feature list (leaving 1329 features) decreased the scores, resulting in only 82 correctly classified emails compared to 85 before. Assuming that this outcome is not due to chance, the only possible explanation is that the learner actually makes use of stop words when classifying the emails. This is not as implausible as it may sound if we consider that in the corpus used for training, many emails belonging to the same category contain questions (and sentences) formulated in a similar way, i.e. making use of identical words, including function words.8 One example of a function word being indicative is the relative pronoun som. It appears very commonly in emails of the sortiment class, e.g. in sentences like F¨ or en tid sedan k¨ opte vi en pall som man kan kl¨ a om sj¨ alv . A while ago, we bought a stool that you can reupholster yourself. Other examples as well show that som is often used to describe a product that one has bought or wants to buy and asks information about. Som is also contained in some reklamation emails but, surprisingly, although it appears in a total of 44 out of the 105 emails, it is contained in none of the emails of category lager. This may be explained by looking at a typical lager email: Jag undrar n¨ ar [produktnamn] finns i lager p˚ a n˚ agot av G¨ oteborgs varuhus. I wonder when you will have [product name] in stock at one of your stores in Gothenburg. The example suggests that people asking whether a certain product is available at a certain store usually know what the product is called and do not describe it using relative clauses. Of course, this is only a vague attempt at explaining the fact that som does not appear in any lager email in the corpus at hand. In fact, chances are that we would find som in some email of that category if we looked at more data. It does, however, show that stop words can be useful in finding regularities across categories. As stop word filtering decreased the performance of the learner, it was not applied in the following experiments. 7

http://snowball.tartarus.org/algorithms/swedish/stop.txt Riloff[22] showed that function words can provide important clues about the information surrounding them. Since my list of features contains individual words, my results cannot be explained in the same way, but rather suggest that function words can also represent concepts on their own. 8

16

4.3.2

Low-frequency words

Error analysis shows that a potential problem for the learner is the large amount of uninformative data (e.g. person names) and noise (e.g. misspelled words). Assuming that these words tend to have a low frequency in the corpus, a promising approach seems to be to use a feature selection method based on word count, i.e. to eliminate all those words that appear only very few times in the corpus. This variant of term space reduction has been applied by several authors, e.g. Bekkerman et al.[4] and Baker & McCallum[2]. Table 5 shows the results after deleting words occurring at most x times:

accuracy 100%

6

90%

80%

?

?

?

?

? ?

70%

-x 0

x 0 1 2 3 4 5

# of features 1422 491 319 220 171 130

1

2

3

4

correctly classified 85 85 86 89 86 77

5

accuracy 80.95% 80.95% 81.90% 84.76% 81.90% 73.33%

Table 5: Deletion of low-frequency words

As we can see, deleting only those features that occur exactly once does not affect the total score of correctly classified emails but reduces the number of features by factor 3. Expanding the list of words to be deleted with those words that occur twice in the data brings even better results, namely 86 correctly classified emails. The highest number of correctly classified emails is reached when eliminating all words from the feature list that occur at most three times. Increasing the threshold to four occurrences affects the score slightly negatively, using an x value of five decreases the performance dramatically. Table 6 shows the detailed results after deleting all words that occur up to three times. The results show that the amount of available training data affects the performance tremendously. The class sortiment, which 69 of the 105 emails are assigned to, achieves an F-measure of 91.2%, a score that is far higher than the scores of the other two classes. The high recall 17

class sortiment lager reklamation

precision 85.9% 81.3% 81.8%

actual sortiment lager reklamation

sortiment 67 6 5

recall 97.1% 65% 56.3%

F-measure 0.912 0.722 0.667

predicted lager reklamation 1 1 13 1 2 9

Table 6: Deletion of words occurring up to 3 times

score for the sortiment class can be attributed to the fact that the algorithm tends to assign the most probable tag in case of doubt. However, this does not affect the precision score, which is also significantly higher for the sortiment class than for the other two classes. This suggests that classification performance could be improved considerably using, for each class, as much training data as was available for the sortiment class, i.e. around 70 emails.

5

Balancing the corpus

As can be seen in table 6, the algorithm is clearly biased towards the most probable class sortiment, resulting in a recall score of 97,1% compared to 65% for the lager class and 56,3% for the reklamation class. This result can be due to the imbalanced data set used for training, i.e. the fact that the training examples were not equally distributed among the three categories. Several attempts at dealing with the class imbalance problem have been proposed (cf. Batista et al.[3]), one of them being Random over-sampling, which balances class distribution by randomly replicating minority class examples. For investigating the effect of over-sampling on the data at hand, I split the corpus into two data sets: 80% (84 emails) were used for training, the remaining 20% (21 emails) served as testing data.9 As the test corpus should reflect the class distribution in the training data, it contains 14 emails of the sortiment class, 4 lager emails, and 3 of category reklamation. Table 7 shows the results for different degrees of over-sampling the training data. ni represents the number of times the training examples of category i are presented to the learner. The percentage in parentheses indicates the resulting fraction of training examples of the respective category. The results show that over-sampling can affect the performance of the classifier, though in an 9

The data split was necessary because leave-one-out cross-validation - as used in the previous experiments - could no longer be applied. As over-sampling leads to identical training examples appearing several times, leave-one-out cross-validation would have caused a situation in which the system is asked to classify an example that it has already seen in the training data. As this is unlikely to happen with real world data, the results would have been optimistic and not representative.

18

Experiment # 1 2 3 4

nsortiment 1 (66.67%) 1 (45.16%) 1 (36.84%) 2 (33.73%)

1 2 3 7

nlager (19.05%) (25.81%) (31.58%) (33.73%)

nreklamation 1 (14.29%) 3 (29.03%) 4 (31.58%) 9 (32.53%)

accuracy 85.71% 85.71% 76.19% 90.48%

Table 7: Over-sampling

unclear way. While doubling the lager emails and tripling the reklamation emails (experiment 2) results in the same accuracy score that was achieved when all training examples were used only once (baseline), accuracy decreases dramatically when lager emails are used three times and reklamation emails are presented four times. This result is particularly surprising looking at the accuracy achieved when presenting nearly equal fractions of training data for all categories (experiment 4). Even though the proportions have hardly changed compared to experiment 3, the accuracy score is much higher, exceeding the baseline score for this experiment. As this difference is likely to be due to the fact that the training corpus was even smaller than the one used in the experiments in the previous sections (due to the split into training data and test data) and is very sensitive to small changes in the training data, it seems impossible to draw any conclusions concerning the effectiveness of over-sampling for the data at hand.

6

Term weighting

Having investigated the performance of the LogitBoost algorithm with binary feature weights, I will now assign continuous weights in order to see whether this can improve the results further. As the results depend crucially on the choice of term-weighting system, I will apply several term-weighting formulas taken from a paper by Salton & Buckley[24] and compare their performance for the data at hand. Each term-weighting formula is built up as a combination of three components: term frequency (measuring the number of occurrences of a term (a word) in a document (here: an email)), collection frequency (measuring the number of occurrences of a term in all documents), and normalization. The possible component values are listed in table 8. The symbols used are tf for term frequency, N for the total number of documents, and n for the number of documents containing the term. Table 9 lists the different weighting systems used in my experiments and the obtained results. The triple in the first column reveals the respective values of the three term-weighting components. The results show that weighting does not improve performance. One of the weighting systems, namely bfx, performs comparably to the system using binary weights (bxx ), all the others perform worse. The system performing worst is the one using the most complex term weight,

19

Term Frequency Component b 1.0 binary weights (full weights) t tf term frequency 0.5·tf n 0.5 + max tf normalized term frequency Collection Frequency Component x 1.0 no change in weight N f log n inverse collection frequency factor Normalization Component x 1.0 no normalization c pP 1 cosine normalization 2 vector

(tfi )

Table 8: Term-weighting components system bxx bfx

document term weight 1 N n

log

accuracy 84.76% 84.76%

nxx 0.5 +

0.5·tf max tf

82.86%

txc tf

pP

vector

(tfi )2

78.1%

tfc tf ·log

pP

vector

N n

(tfi ·log

N 2 ) n

74.29%

Table 9: Term-weighting

namely the tfc system, which corresponds to what is usually referred to as cosine-normalized tfidf. As most of the documents in my corpus are rather short, containing between one and ten sentences, these results are in line with the findings of Salton & Buckley[24] that, for short document vectors, it is better to use fully weighted terms and no normalization, i.e. binary weights.

7

Definition of a threshold

Up to now, I have investigated the performance in the case where a classification decision was enforced for each email. An alternative path is to define a confidence threshold t such that an email e is classified under class c if and only if the probability of e belonging to c exceeds t. If the probability is below t, the email may either be classified manually or put into a pre-defined default category. As many classification errors are due to a low confidence level, defining a 20

threshold can be useful to avoid misclassification. In the case at hand it might for example be reasonable to define the category sortiment as the default category because people working in the sortiment department might also be able to answer lager emails but not the other way around. If we then receive an email that contains lager as well as sortiment questions and that could be classified under either of them with a low confidence level, the system would decide on the sortiment class. The threshold can be derived in two different ways (cf. Sebastiani[25], ch. 6): analytically (i.e. based on a theoretical method indicating how to compute the best threshold value) or experimentally. I decided on the experimental policy and tested different values for t in order to find out which of them would maximize effectiveness. Table 10 shows the results for threshold values of 50%, 55%, 60%, 65%, and 70%. For the data at hand and the features used at the end of section 4, a threshold value below 43.9% has no effect because all decisions are made with a confidence level of at least 43.9%.

100%

6

∗ ∗

90%

?

?

∗ ?

? ∗

∗?

80%

? ∗

70%

- threshold 45%

50%

∗ coverage t < 43.9% 50% 55% 60% 65% 70%

55%

60%

65%

70%

? accuracy

accuracy 84.76% 87% (+2.64%) 87.5% (+0.57%) 88.17% (+0.77%) 88.89% (+0.81%) 89.02% (+0.15%)

coverage 100% 95.24% (-4.76%) 91.43% (-4.0%) 88.57% (-3.13%) 85.71% (-3.22%) 78.1% (-9.11%)

Table 10: Defining a threshold (data: 105 classifiable emails)

Evaluating the different threshold values as to accuracy and percentage of classifiable emails, there seems to be no clear winner. Several candidates may be best depending on whether one stresses accuracy (the fraction of correctly classified emails) or coverage (the fraction of classifiable instances). 50% may be a good choice because it brings a considerable improvement in accuracy (from 84.76% to 87%) and still classifies most of the emails (95.24%). A threshold of 70% may be too high because is brings little gain in accuracy (only 0.15% more than a threshold of 65%) and decreases coverage tremendously (from 85.71% to 78.1%). All values in 21

between may be chosen over the 50% threshold depending on how much importance is placed on accuracy over coverage. To decide on the best threshold value, let’s also take into account the 10 emails that have not been classified so far. Since none of them could be classified manually, it would probably be a good idea to put a good deal of them into the default category. Table 11 shows the percentage of classifiable emails, given the respective threshold value. coverage 100%

6

∗ ∗

90% 80% 70%



60%



50% 45%

50%

55%

60%





65%

70%

- threshold

coverage 100% 90% 60% 60% 50% 50%

t < 47.4% 50% 55% 60% 65% 70%

Table 11: Defining a threshold (data: 10 unclassifiable emails)

Most remarkable is the percentage of unclassifiable emails with a threshold of 55%. While a threshold of 50% would still classify 90% of the emails into one of the categories, this number goes down to 60% with a threshold of 55%, i.e. at least 4 of the 10 emails would be put into the default category. The number of emails classified under the default category might even be a lot higher, depending on which of the categories is selected as the default. For example, if we assume the sortiment class to be the default, 9 of the 10 emails would be classified under sortiment, 4 of them because of the low confidence level, the other 5 because they can be classified under sortiment with a confidence level above the threshold. Summarizing the results from both tables, the best thresholds seem to be 50% and 55%. 50% because it results in a high level of accuracy with little loss in coverage, 55% because while classifying most of the classifiable emails, the system puts a good deal of the unclassifiable emails into the default category.

22

8

Summary and future work

In this paper, I investigated the application of machine learning techniques to the task of automatic email classification, conducting my experiments on a small Swedish email corpus. All in all, the results were better than expected, reaching, with the best feature list, more than 84% accuracy for the 105 manually classifiable emails, 87% with a confidence threshold of 50% (resulting in a coverage of 95.24%). Even if we take into account that 10 of the 115 original emails were taken out of the training corpus because they could not be assigned to exactly one category, the accuracy is still above 77% if we assume that all of the 10 emails would be assigned the wrong category, a result significantly above baseline (66%). These encouraging figures are especially surprising because so little training data was used, and are probably due to the fact that the domain was restricted, which kept down the problems caused by the fact that words can have different meanings depending on context. However, using more training data could presumably improve the results further. The results for the category sortiment, for which most training data was available, suggest that with about 70 training examples for each category, an accuracy approaching 90% might be achievable. As to feature reduction, the experiments showed that conversion to lower case, feature merging, and the removal of low-frequency words can be very effective, while stop word removal and term weighting tend to harm the results for the data at hand. The fact that the best results were obtained with only 15% of the features used in the first experiments suggests that a small set of features suffices for distinguishing among different categories. An issue for future work could be the investigation of incremental learning, which would involve the implementation of a classifier that adapts its model to new information. Such a learner can be very effective because it does not rely on the initial training corpus only but constantly updates its classification model using new emails as additional training data. Additional labeled training data could be obtained by monitoring the user’s behavior; cotraining (cf. Kiritchenko & Matwin[13]) could additionally exploit unlabeled training data. In particular, it would be interesting to see whether classification performance could be improved by taking the email’s age into account, as proposed by Rennie[21], e.g. by giving more weight to information obtained from newer emails. Such an approach would give consideration to the fact that the contents of the messages within a category may change over time, e.g. due to changes in the company’s range of products.

Acknowledgements Thanks a lot to Sandra K¨ ubler for her inspiring comments, Andreas Wieweg for suggesting the idea for this project, Artificial Solutions for making the dataset available, Christoph Zwirello for making software available, Diana Razera for providing code, Zhang Yajing for proofreading, and Christopher Eichler for commenting on the graphical presentation of my results.

23

References [1] Androutsopoulos, Ion; Georgios Paliouras; Eirinaios Michelakis (2004): “Learning to Filter Unsolicited Commercial E-mail.” Technical Report, National Centre for Scientific Research “Demokritos”. [2] Baker, L. Douglas & Andrew Kachites McCallum (1998): “Distributional Clustering of Words for Text Classification” In: Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval. [3] Batista, Gustavo E. A. P. A.; Ronaldo C. Prati, Maria Carolina Monard (2004): “A study of the behavior of several methods for balancing machine learning training data.” In: SIGKDD Explorations. [4] Bekkerman, Ron; Ran El-Yaniv; Naftali Tishby; Yoad Winter (2003): “Distributional Word Clusters vs. Words for Text Categorization” In: JMLR Special Issue on Variable and Feature Selection. [5] Calvo, Rafael A. & H. Alejandro Ceccatto (2000): “Intelligent document classification” In: Intelligent Data Analysis. [6] Cohen, William W. (1996): “Learning Rules that Classify E-Mail” In: In AAAI Spring Symposium on Machine Learning in Information Access. [7] Domingos, Pedro & Michael J. Pazzani (1997): “Beyond independence: Conditions for the Optimality of the Simple Bayesian Classifier” In Lorenza Saitta (ed.): Machine Learning: Proceedings of the Thirteenth International Conference. [8] Friedman, Jerome; Trevor Hastie; Robert Tibshirani (1998): “Additive logistic regression: A statistical view of boosting.” Technical report, Statistics Department, Stanford University. [9] Gaustad, Tanja & Gosse Bouma (2002): “Accurate stemming of Dutch for Text Classification” In: Theune, Marit & Anton Nijholt (eds.): Computational Linguistics in the Netherlands (CLIN) 2001. [10] Guo, Gongde; Hui Wang; David Bell; Yaxin Bi; Kieran Greer (2004): “An kNN Modelbased Approach and Its Application in Text Categorization” In: The 5th International Conference on Computational Linguistics and Intelligent Text Processing, LNCS 2945. [11] Joachims, Thorsten (1997): “A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization” In: International Conference on Machine Learning (ICML). [12] Joachims, Thorsten (1998): “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” In: Proceedings of the European Conference on Machine Learning (ECML). [13] Kiritchenko, Svetlana & Stan Matwin (2001): “Email Classification with Co-Training.” In: Proceedings of the 2001 Conference of the Centre for Advanced Studies on Collaborative research, pages 192-201. [14] Koller, Daphne & Mehran Sahami (1997): “Hierarchically classifying documents using very few words” In: Proceedings of ICML-97, 14th International Conference on Machine Learning.

24

[15] Lewis, David D. & Marc Ringuette (1994): “A comparison of two learning algorithms for text categorization” In: Third Annual Symposium on Document Analysis and Information Retrieval. [16] Lewis, David D.; Robert E. Schapire; James P. Callan; Ron Papka (1996): “Training Algorithms for Linear Text Classifiers” In: Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development in Information Retrieval. [17] Michelakis, Eirinaios; Ion Androutsopoulos; Georgios Paliouras; George Sakkis; Panagiotis Stamatopoulos (2004): “Filtron: A Learning-Based Anti-Spam Filter” In: Proceedings of the first Conference on Email and Anti-Spam (CEAS). [18] Mitchell, Tom (1997): Machine Learning. Boston, MA: McGraw-Hill. [19] Provost, Jefferson (1999): “Naive-Bayes vs. Rule-Learning in Classification of Email” Technical Report AI-TR-99-284, University of Texas at Austin, Artificial Intelligence Lab. [20] Quinlan, J. Ross (1994): C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. [21] Rennie, Jason (2000): “ifile: An Application of Machine Learning to E-Mail Filtering” In: Proceedings of the KDD-2000 Workshop on Text Mining. [22] Riloff, Ellen (1995): “Little words can make a big difference for text classification” In: Proceedings of SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval. [23] Rocchio, Joseph J. (1971): “Relevance feedback in information retrieval” In: The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323. Englewood Cliffs, NJ: Prentice Hall. [24] Salton, Gerard & Christopher Buckley (1988): “Term-weighting Approaches in Automatic Text Retrieval” In: Karen Sparck Jones & Peter Willett (eds.): Readings in Information Retrieval. [25] Sebastiani, Fabrizio (2002): “Machine Learning in Automated Text Categorization” In: ACM Computing Surveys, Vol. 34, No. 1. [26] Segal, Richard B. & Jeffrey O. Kephart (1999): “MailCat: An Intelligent Assistant for Organizing E-Mail.” In: Proceedings of the Third International Conference on Autonomous Agents. [27] Williams, Ken (2003): “A Framework for Text Categorization” Master Thesis, University of Sydney (School of Electrical and Information Engineering). [28] Witten, Ian H. & Eibe Frank (2000): Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco, CA: Morgan Kaufmann. [29] Yang, Yiming (1994): “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval” In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. [30] Yang, Yiming & Jan O. Pedersen (1997): “A Comparative Study on Feature Selection in Text Categorization” In: Proceedings of ICML-97, 14th International Conference on Machine Learning.

25

Appendix: Implementing an email classification system Customer requirements Building a system that automatically routes incoming email to the right person or department requires some decisions to be made by the customer. The most essential issue is the number of categories (e.g. departments), under which incoming email is to be classified (section 2.2). Another important decision is whether to classify all emails or to define a default category for all those emails for which classification probability is below a certain confidence level (section 7). For each category, the customer then has to provide training data, i.e. emails that have been manually classified by a domain expert.

Technical environment Implementations of machine learning algorithms, which form the core of any classification system, are provided by, for example, Weka, an open source software that can be downloaded freely from the internet10 (installation files are available for Windows and Linux). The installation process is quick and self-explanatory. The Weka API containing the Weka Java classes is included in the distribution and can be used for implementation11 .

Feature selection In order to apply machine learning techniques to the training data, the data needs to be converted into a list of features. Each email is represented in terms of these features and the classification model will be based on generalizations over these representations. The training data is provided in the file format ARFF (section 2.3). A good starting point for the list of features is to extract all words (tokens separated by blank space) from all emails in the data, convert them to lower case and remove all those that appear only once. The experiments above have shown that it may also help to merge several features into one (section 4.2) and to remove words that appear only twice or three times in the data (section 4.3.2), not only because it speeds up the classification process but also because it can improve the results.

Training Once the data file has been built, it serves as input to build the classification model according to which new email will be classified. In order for the model to adapt to new information, it should be re-trained when new training data has arrived. This could for example be done in a periodic fashion (e.g. daily). Re-training usually requires monitoring the manual correction of classification errors, e.g. by storing each email that is forwarded to a different department and add it to the training data along with the correct class (inferable from the department that it was forwarded to).

10 11

http://www.cs.waikato.ac.nz/ml/weka/ See http://alex.seewald.at/WEKA/ for instructions on how to call Weka from Java.

26

Suggest Documents