Email classification for automated service handling Ross Tailby1, Richard Dean1, Ben Milner2 and Dan Smith2 1
Antech Engineering Ltd, Hewett Road, Great Yarmouth, NR31 0NN, UK {rd, ross}@antech.org.uk
2
School of Computing Sciences, University of East Anglia Norwich, NR4 7TJ, UK {b.milner, dan.smith}@uea.ac.uk
ABSTRACT We describe the experience and lessons learned from developing a range of electronic services for a specialist engineering company. We are using a custom workflow management system as the base for a range of services which are offered via a multimodal portal, using a language-based approach to extracting information from HTML forms, email, and SMS. We describe the email classification experiments we have carried out and discuss the development of customer services based on automatic email classification.
Categories and Subject Descriptors H.3.3 [Information Search and Retrieval] Clustering. Information filtering.
General Terms
services. Section 2 contains the description of the email database. The experiments and evaluation of email classification techniques using vector-based and probabilistic approaches are described in section 3 and 4 respectively. In section 5 we briefly describe related work. Our conclusions are presented in section 6.
1.1 Automated services The range of e-services falls into three groups; notification services, calibration management services, and information services. The first group of services is directed towards giving real-time status information, and quotation information. The second group allows customers to outsource all or parts of their calibration management and related quality assurance functions. The third group is a set of push services to which customers subscribe for news and technical information. The logical architecture is shown in Figure 1.
Algorithms, Measurement, Experimentation.
Keywords
workflow database
Email, vector methods, T-Route, T-Trans, Naïve Bayes.
Message extraction
Database query
Message controller
e-mail interface
Inbox
SMS interface
Mobile phone
HTTP/Web interface
Browser
other interfaces
...
1. INTRODUCTION Response formatting
In this paper we describe our experiences and lessons learned in developing an email classifier to support automatic handling and routing of routine enquiries for a specialist engineering company. Our work highlights a number of issues of interest to both the research and industrial communities.
Figure 1. E-services architecture.
The company has two business areas: certified calibration, pressure valve testing and repair. The work of the company is very heterogeneous as a result of the many possibilities at completion of each task, several hundred types of equipment, with a correspondingly large number of inspection, calibration and repair tasks covering electrical, pressure, temperature and dimensional parameters, the need to integrate with many different customer workflow systems and business practices, and different levels of automation for low-level measurement processes. There are several distinct approaches, each with several variants, used by customers in ordering and authorizing work: individual purchase order, bulk purchase order and credit card.
Calibration certificates and job information can be retrieved using several criteria, including: certificate number, job reference, purchase order reference, manufacturer, model number, etc. Because of the wide range of identifier types we use an inverted index, implemented as a view over the workflow database, restricted by customer, to identify matches for the customer's search terms and query type. The portal also supports “push” services, where subscribing customers are automatically contacted through their chosen modality in response to events including: purchase order and confirmation messaging, notification of faults needing repair and customer responses, job tracking, news and technical information.
The remainder of this section describes the company’s electronic
Customer queries submitted through the website are easy to process, as the fields necessary to construct the database request are all present, but when customers send email or SMS messages the information must be extracted from the message as they are free-form modalities. These messages are classified and the relevant parameters extracted; messages which cannot be classified with sufficient confidence and those which require human action are passed to the appropriate people or departments in the company.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’06, April, 23-27, 2006, Dijon, France. Copyright 2006 ACM 1-59593-108-2/06/0004…$5.00.
e-services database
A database query is formed based on the message class and extracted parameters. For example, identifying a customer through header information, in either the email or SMS, restricts the query to a customer's equipment. Additional information may further restrict the database query or allow the enhancement or correction of incomplete query parameters found in the message.. The information exchange interfaces are XML structures, which enable the information to be transformed and presented in the appropriate output modality and address specified in the customer profile. This prevents inappropriate delivery choices, such as attempting to send a PDF copy of a calibration certificate to a PDA. An example dialogue is shown in Figure 2. Email Request: Thurs 15 February 2004 20:34:56 GMT From:
[email protected] To:
[email protected] Subject: Return of gas density transducer 7812 Please can you let me know the expected delivery of gas density transducer GDT-200394842/3? We are currently renovating our rig and need it back as soon as possible. Many thanks, Graham SMS Response: Thurs 16 February 2005 21:04:42 GMT From: Antech Sales (+44 7713 355276) Gas Density Transducer GDT-200394842/3. Received: 11/02/05. Awaiting labour. Completion due: 17/02/04 Delivery: express courier 18/02/05 Email Response Thurs 16 February 2005 21:04:42 GMT From:
[email protected] To:
[email protected] Subject: Re: Return of gas density transducer 7812 Graham, We confirm receipt of your equipment, delivered to us on 11/02/05 via courier. Your gas density transducer GDT200394842/3 has been inspected and is currently awaiting labour. We expect this to be completed by 17/02/04 and delivered back to you via courier, arriving on 18/02/04. Thank you for your patience, Antech Calibration Services
Figure 2. Example of email-based job status request
2. DATABASE CREATION To train and test the email classification system it was first necessary to create a database of emails. There are many such email databases which can be used for testing email classification systems, but the system under construction here needs to be specific to the task of classifying Antech’s own customer enquiries, which meant that collecting our own database was necessary. This section describes the collection of such a database and the subsequent labeling of each individual email as belonging to a particular category. Finally, the emails are converted from text to vector form. A collection of 1437 emails was categorized and labeled into the classes as shown in table 1. The primary aim was to accurately identify emails as documents belonging to one of the six classes. Data from December, February and April form the training data set, with the remainder of the data used for testing. The data was prepared by removing HTML markup, numeric tokens, non-word characters and stopwords before being lightly stemmed and vectorized. The prepared collection contained 1053 emails containing 23,118 terms with a vocabulary of 6,500 words.
Dec 68 8 30 18 12 28
1 Quote 2 Certificate 3 Collection 4 Job status 5 Account 6 Technical
Jan 68 12 32 32 24 42
Feb 92 6 40 22 8 34
Mar 147 34 28 29 33 46
Apr 174 30 63 57 69 72
May 132 9 66 48 39 117
Table 1: Initial training data set
3. VECTOR-BASED CLASSIFICATION Based on an analysis of published results, we used the T-Route and T-Trans vector comparison algorithms [3], experimenting with several term weighting schemes, a Latent Semantic Analysis (LSA) transformation and some post-processing to improve the classification accuracy. In the T-Route method, an average document vector is computed for each of the K=6 routes. Therefore the T-Route term-document matrix, WTR, is of size M ✕ K, where each element, Wij, represents the number of times term, ti, occurs for email, or route, rj. The T-Trans method differs slightly in that instead of averaging the document, every unique training data document vector is retained and stored in the term-document matrix, W TT, which is now of size M ✕ N where each column represents an email, and each element Wij represents the frequency of term ti in email dj. The input email is assigned the same class as the column vector in WTT with the smallest Euclidian distance to the input email. Our base accuracy for T-Route was 33.7% and for T-Trans was 40.35%. Results in the remainder of this section are for T-Trans. Term number
Text-based email Please send a certificate for the Avo
Vectorisation
Document vector
1
a
2
2
and
1
3
Antech
0
4
Avo
1
……………
and a delivery data.
115
certificate
1
…………… ……………
Figure 3: Illustration of vectorization of an email The term count represent by each element, Wij, in the termdocument matrix W is unsuitable for direct matching to an input document vector. Several term weighting methods were investigated including Inverse Document Frequency (IDF), Carpenter weighting [2], Mutual Information scoring and Bellegarda weighting [1]. Bellegarda weighting gave the best improvement in accuracy. It combines a global weighting for the overall importance of each word with a localized weighting for its N 1 Gi = 1 − Ei = 1 − − ∑ f ij log f ij log( N ) i =1
importance in each document. The global weighting, Gi, is defined as
c where f = ij and cij is the number of times word wi appears in ij ti
document dj. The local weighting is defined as
cij Lij = log 2 1 + nj
where nj is the number of words in document dj. The full Bellegarda weighting is therefore defined as Wij = Gi Lij
Weighting TF-IDF IDF Carpenter Mutual Information Bellegarda local Bellegarda full Bellegarda full and MI
which is the new word-document matrix W of word wi in document dj ∈ T The overall classification accuracy improved to 52.1% with weighted term occurrences.
To improve email classification accuracy an uncertainty measure was derived from the Euclidean distance between the input email and the closest email in the term-document matrix. The optimal uncertainty threshold is 0.6 in this application. 100%
Table 2: Results from term weighting experiments We saw very similar results from three of the weighting algorithms: TF-IDF, Mutual Information and the full Bellegarda weighting. However, combining the full Bellegarda and Mutual Information improved the accuracy, at the cost of processing time as the MI approach is processor intensive and constitutes a significant increase in training time for comparatively little gain. These results show that more training data was needed for the classifier to reach an acceptable level of accuracy. One of the big problems with giving the classifier more data was that the spread of data across the classes was so uneven that the quotations class 60 50
% Accuracy
LSA uses singular value decomposition (SVD) to compact the original term-document matrix, W, into a much smaller number of dimensions. SVD is first applied to the term-document matrix, W, to give W = U S VT, where U and V are orthonormal matrices of size M ✕ N and N ✕ N and matrix S is a diagonal matrix which comprises the eigenvalues of matrix W. The terms forming matrix W can then be compacted by reducing the dimensionality of matrix S such that only the top R eigenvalues are selected, with the remaining N-R eigenvalues set to zero to give matrix S’. The resulting compacted term-document matrix is then computed as W’ = U S’ VT An input document vector, q, is transformed into the transformed feature space by q’ = q T U (S ’)-1 This initially offered minimal improvement but, with the addition of more training data the overall accuracy rose to 53.2%
Accuracy 48.5% 38.6% 36.8% 49.1% 48.5% 49.1% 52.1%
40 Series1
30
Poly. (Series1)
20
90%
10
80% 70% 60%
Incorrect Correct
50% 40%
0 0
200
400
600
800
1000
No. of training emails
30% 20% 10% 0% 0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0 1.0-1.1 1.1-1.2 Distance to prediction
Figure 4. Classifications by distance To further improve the classification accuracy of the remaining emails information in the company’s workflow database was used. First, any input email whose sender could not be matched to a contact instance in the database was removed and the email forwarded a human agent. This accounted for approximately 15% of emails. Second, customer information stored in the workflow database was used to restrict the number of feasible classes for an input email. For example a customer with no current jobs cannot be asking for the status of an item being calibrated.
3.1 Experimental results One of our first decisions was whether to use the T-Route or Ttrans algorithm for email classification. In the initial training phase, we were using the data from three months of company emails, approximately 400 documents. With no term weighting, omitting singletons, using a stoplist and applying absolute distance measures, T-Route gave 33.7% accuracy and T-Trans gave 40.35%. This difference was consistent across experiments. The second aspect to determine was the term weighting that should be used by the classifier. We experimented with a wide range of weighting algorithms which can be seen in table 2.
would increase dramatically compared to the other classes, often only making minor increments. In the worst case, the quotes doubled from 231 to 428 mails as compared to the certificate requests, which increased from 51 to 61. Figure 5: Increasing training data against accuracy Due to the uneven spread of data, the classification accuracy decreases as the training set grows over 600 emails (figure 5). To overcome this problem, we introduced a number of new techniques. The first involved an alternative testing method, in which only one email at a time is tested, all other documents are left in the training set for comparison. This avoids the need for a dedicated test set and allows the training set to increase the proportion of emails in the smaller classes; it increased the overall accuracy to 61.4%. More data should improve the classifier as we increase the size of the training set. Therefore, we increased the training dataset to nine months of company emails, but restricting quotes to the number of emails in the next largest class. This data set has 6166 dimensions and gave an overall accuracy of 63.2%. We took 63.2% as the baseline accuracy for these experiments, achieved using the full Bellegarda weighting, omitting singletons and using the T-TRANS approach with extra months of data from the company in the training set. As we believed that the LSA technique would improve the classification accuracy, it was implemented and tested across a varying number of dimensions.
60 50 40
LSA Dims Vs. Accuracy
30
2 per. Mov. Avg. (LSA Dims Vs. Accuracy)
20 10
4. PROBABILISTIC CLASSIFICATION
0 0
100
200
300
400
500
As vector-based classifiers were insufficiently accurate we conducted a series of experiments on the same collection using a Naïve Bayes (NB) classifier [9] [10].
Figure 6: Accuracy vs. dimensionality reduction Quote Quote 83.3 Cert. 23.8 Collect 40.0 Status 60.0 Acc. 40.0 Tech. 33.3
Cert Collect Status Acc. 0.0 3.3 3.3 0.0 47.6 4.8 0.0 14.3 3.3 23.3 6.7 6.7 0.0 10.0 23.3 3.3 0.0 3.3 0.0 53.3 0.0 3.3 3.3 6.7
Tech. 10.0 9.5 20.0 3.3 3.3 53.3
Total 448 92 167 171 160 224
Table 4: Results for first pass of database information The overall accuracy was of the results shown in Table 4 is 47.37%. This is a significant fall from our baseline. In cases where there is substantial uncertainty in the classification of an email (i.e. the uncertainty measure is above the threshold) a query is issued to the workflow database to see if the number of feasible classes could be reduced. Since the database restriction query is complex there is a noticeable performance penalty if it is run for each email. If the number of feasible classes was reduced the email was reclassified using the smaller number of classes. This increased the overall accuracy to 68.5% (table 5). Quote Cert. Collect Status Acc. Tech.
We attempted to increase the classification accuracy of this classifier by adding more data from the Antech email accounts. However, the class distributions have skewed this accuracy as adding more training data increases the gulf between the number of quotation emails and most of the other categories. This in turn biases the classifier to predict quotations too often. Therefore, we decided to stick with the best performing levels of data found so far and attempt alternative techniques.
Quote Cert Collect Status Acc. Tech. Total 78.3 0.0 4.3 13.0 0.0 4.3 448 5.9 64.7 5.9 0.0 11.8 11.8 92 0.0 5.3 63.2 10.5 10.5 10.5 167 12.5 0.0 18.8 62.5 6.3 0.0 171 5.3 0.0 5.3 0.0 84.2 5.3 160 23.5 0.0 5.9 5.9 11.8 52.9 224
Table 5. Results using threshold and database information This gives us an improvement over our baseline accuracy by approximately 5% overall. This improvement is relatively consistent across all configurations of the classifier. The improvement is most marked in smaller classes, where the email content or request is closely related to database content.
3.2 Discussion The vector comparison methods gave unacceptably poor results. Further analysis of the dataset highlighted four issues: 1. Irrelevance. Requests tacked onto long correspondences originally discussing different matters, familiarity with staff members producing incomplete requests, slang words and personal enquiries, … 2. Over-representation. A few customers making many similar requests in a few categories. Their name and language style becomes synonymous with that category and their requests are very terse. 3. Overfitting. Standardized message wording from commercial services bureaux (e.g. forwarding quotation requests). 4. Non-words. Personal names, items, figures and model numbers.
To calculate the probability figure for class Bi given email E, we need to find the maximum value of P(E | Bi )P(Bi ) P(E) where i ranges over the classes. We ignore P(E), as we assume all documents are equally likely. P(Bi) is calculated for each class by dividing the number of words in class Bi by the total number of € words occurring throughout the entire dataset. P(Bi | E) =
We can evaluate the probability by: P(E | Bi ) = P(W1 |W 2 ∧W 3 ∧ ...W k ,Bi ) where Wk represents the range of words observed in email E. This equation is impractical to calculate. Therefore, we assume that each of the words in email E are independent of each other, € giving, k
P(E | Bi ) = ∏ P(W i | Bi ) • P(Bi ) i=1
where k is the number of distinct words observed in email E. In order to deal with zero probability multiplications (i.e. if an observed word in a test email matched to a probability of zero in € the equivalent class matrix, the entire probability prediction for that class would be wiped to zero), zero figures were replaced with extremely small numbers (experimentally estimated as 1.0E8 or less). These were converted to logarithms to avoid problems of numeric instability occurring after multiplying several long sequences of very small values. Utilising this technique, we achieved accuracy of 89.5%. To further improve this performance, we used the Bellegarda weighting algorithm, which gave us an accuracy of 91.8% and the confusion matrix seen in table 6. Quote Cert. Collect Status Acc. Tech.
Quote 80.0 0.0 0.0 0.0 3.3 0.0
Cert Collect Status Acc. Tech. 0.0 0.0 10.0 3.3 6.7 90.5 0.0 9.5 0.0 0.0 0.0 90.0 6.7 3.3 0.0 0.0 0.0 100 0.0 0.0 0.0 0.0 3.3 93.3 0.0 0.0 0.0 3.3 0.0 96.7
Total 448 92 167 171 160 224
Table 6. Naïve Bayes with Bellegarda weighting
4.1 Applying Thresholds After studying the probability scores of correct and incorrect predictions, we decided to use the normalized (with respect to test mail word count) difference between the prediction and the second place prediction as our uncertainty measure. Results using a cut-off of 40 are shown in Table 7. Using this technique we can classify to 98.7% accuracy, only forwarding 13% of mails to human agents.
6. CONCLUSIONS Quote Quote 100 Cert. 0.0 Collect 0.0 Status 0.0 Acc. 3.6 Tech. 0.0
Cert Collect Status Acc. Tech. 0.0 0.0 0.0 0.0 0.0 94.4 0.0 5.6 0.0 0.0 0.0 100 0.0 0.0 0.0 0.0 0.0 100 0.0 0.0 0.0 0.0 0.0 96.4 0.0 0.0 0.0 0.0 0.0 100
Total 448 92 167 171 160 224
Table 7. Adjusted Naïve Bayes with threshold levels
4.2 Discussion We have identified four reasons why NB outperforms vector approaches for this application. Firstly, the vector approach relies strongly on the vocabulary data for which each vector is created. We have already outlined several problems with the vocabulary of the dataset (see section 3.2), all of which reflected in the word frequency vectors of the training and test emails, and at least partially responsible for pushing vectors of differing classes together in the vocabulary space. The T-TRANS algorithm compares the distance between a training email and a new email, using the distances between all words in the vocabulary. Since there is a substantial overlap in the vocabulary of our classes, high-weighted words in a training email will often lead to misclassification as all positions are always compared. The NB classifier only considers the probabilities of words that occur in the new email and so avoids this problem.
We have described the process of constructing an automatic email classification system in a corporate environment where database information regarding the client is available. We have shown that a Naïve Bayes classifier can perform well with badly skewed training data where vector-based approaches perform poorly. We have shown that the information available in customer databases can be used to restrict the classifier to eliminate infeasible classifications. Overall, we have succeeded in creating an automatic email classification system that is sufficiently accurate for deployment in an multimodal automated response system.
ACKNOWLEDGEMENTS We thank the UK Knowledge Transfer Partnership programme for its support of this project (KTP 4317), Jim Gunn and Ron McDonald for their vision, support and encouragement.
REFERENCES [1] J. R. Bellegarda. A multispan language modelling framework for large vocabulary speech recognition. IEEE Trans. on Speech and Audio Processing. 6(5), 456-467, 1998. [2] J. Chu-Carroll and R. Carpenter. Vector-based natural language call-routing. Comp. Ling., 25(3), 361-388, 1999. [3] S. J. Cox and B. Shahshahani. A comparison of some different techniques for vector based call-routing, Proc. Eurospeech, 2001.
NB also addresses the problem of biased class distributions by normalising each of the class word probabilities by the number of words in the class. The vector approaches normalized for document length, but did not address the problem of very different class sizes, particularly the quotation class. With the NB classifier, the probabilities for the classes sum to 1.0, which avoids this problem.
[4] H. K. J. Kuo and C. Lee. Discriminative training in natural language call-routing. Proc. Int. Conf. on Spoken Language Processing, October 2000.
Most term weighting techniques do not weight words with regard to the class they appear in. The Bellegarda weighting algorithm calculates a global weighting for the word in the whole dataset, and a local weight for the word in the specific document it appears in. It does not calculate weightings for the words within classes to show which words are indicative of that class. The NB classifier weights words by dividing the frequency of the word occurring in that class over the frequency of all words in the class, heavily weighting words that are important to the class.
[6] R. Cilibrasi and P. Vitanyi. Clustering by compression. IEEE Trans. on Information Theory, 51(4) 2005
Using database information to remove infeasible classifications has given a consistent modest improvement in classification accuracy, improving it by 5 percentage points. We anticipate that this gain will be larger as we add more classes, as the number of potential confusions increases.
5. RELATED WORK There has been substantial research into email classification, for automated response system, spam filtering, or email forwarding [2][3][4][7][8] and NB methods have been used for spam filtering [10][11][12][13]. Document comparison in the zip or gzip domain is explored in [6]. Using database information to restrict classifiers is less common. [5] uses customer preferences for product recommendations and improves accuracy with featureweighting algorithms (Mutual Information, Inverse User Frequency) and limiting selection to very relevant instances.
[5] K. Yu, X. Xu, M. Ester, H.-P. Kriegel. Feature weighting and instance selection for collaborative filtering. Knowledge and Information Systems, 5(2), 201-224, 2003
[7] B. Klimt and Y. Yang. The Enron corpus: A new dataset for email classification research. ECML, Pisa, 2004. [8] T. A. Meyer and B. Whateley. SpamBayes: Effective opensource, Bayesian based, email classification system. CEAS, 1-8, 2004. [9] McCallum, A and Nigam, K. A comparison of event models for Naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization, 1998. [10] Sahami, M et al. A Bayesian approach to Filtering Junk Email. AAAI-98 Workshop on Learning for Text Categorization, 1998 [11] Androutsopoulos et al. Learning to Filter Spam Email: A Comparison of a Naïve Bayesian and a Memory-Based Approach. PKDD-2000. Sept 2000. [12] Langley, P, Iba, W and Thompson, K. An Analysis of Bayesian Classifiers. National Conference on Artificial Intelligence, 223-228, 1992. [13] Calvo, R, Lee, J-M, Li, X. Managing Content with Automatic Document Classification. J Digital Info., 5(2). 2004.