Towards An Adaptive Mail Classifier - CiteSeerX

1 downloads 143 Views 1MB Size Report
The process of analyzing and organizing e-mail messages is a challenging application of Web ... to inherent preprocessing steps, such as lemmatization and stop-lists, ..... rent terms, with an average message length of 200 terms per message.
Towards An Adaptive Mail Classifier Giuseppe Manco1

Elio Masciari1 Massimo Ruffolo1 Andrea Tagarelli2 1 ISI-CNR – Institute of Italian National Research Council Via Bucci 41c, 87036 Rende (CS) Italy E-mail: {manco,masciari,ruffolo}@isi.cs.cnr.it 2

DEIS-UNICAL Via Bucci 41c, 87036 Rende (CS) Italy E-mail: [email protected]

Abstract. We introduce a technique based on data mining algorithms for classifying incoming messages, as a basis for an overall architecture for maintenance and management of e-mail messages. We exploit clustering techniques for grouping structured and unstructured information extracted from e-mail messages in an unsupervised way, and exploit the resulting algorithm in the process of folder creation (and maintenance) and e-mail redirection. Some initial experimental results show the effectiveness of the technique, both from an efficiency and a quality-of-results viewpoint.

1

Introduction

The process of analyzing and organizing e-mail messages is a challenging application of Web and Text mining techniques. In recent years, the increasing popularity of the Web as a mean for sharing information has generated a huge traffic of e-mail messages in various forms. The problem of detecting unsolicited (spam) e-mails, analyzing newsletter and mailing-list messages, and rapidly detecting important messages (and separating them from unimportant ones), has become readily actual. The activity of organizing e-mails keeps users busy for a relevant amount of their time, and it may also cost money in dial-up connections or bandwidth consumption. Many tools have been developed, with the aim of helping users to organize their mailbox. For example, the Microsoft Outlook mail reader offers the user the possibility of defining filters, i.e., rules that allow the identification of the content of an e-mail message, and, consequently, to classify and organize the e-mail in the context of the mailbox. Such tools, however, are mainly human-centered, in the sense that the user is required to manually describe rules and keyword lists that can be used to recognize the relevant features of messages. Such an approach has the following disadvantages: • it is inadequate to a large volume of e-mails, which may contain too many distinct and possibly overlapping concepts. As a consequence, the process of manually detecting the main contents of mail messages occurring in a mailbox can be time-consuming. • it does not accomodate to requirement changes, that usually occurs when users need to periodically re-organize their mailbox. The contents of e-mail messages may change with

a given frequency, and consequently the user is required to periodically re-organize their filters. The current literature has posed much attention to the problem of rapidly detecting spam (i.e., unsolicited) e-mail messages. The problem of detecting spam messages can be approached by building a classifier system capable of minimizing three main measures: error rate, false-positive rate, and false-negative rate [27]. Among the techniques used for classifying spam messages, we mention the Rocchio approach (and similar approaches based on Support-Vector Machines [7]). Papers [4] and [13] describe two rule-based systems exploiting text mining techniques for the classification of e-mail correspondence. These approaches differ mainly in the preprocessing phase: in the first approach a simple boolean vector model is used; on the other side, [13] proposes a frequency-based vector model. Other approaches [21, 17, 2, 22] are mainly based on bayesian classification. Although very sensitive to inherent preprocessing steps, such as lemmatization and stop-lists, bayesian approaches have been shown more accurate than rule-based approaches [17, 16, 8, 18]. Differently from the above approaches, which perform a binary classification (spam/no spam), the MailCat system [23] uses an Instance-Based text classifier to predict the most likely destination folder for each incoming message. The above approaches provide a supervised classification framework: message folders pre-exist and the main objective is to detect the most likely folder for an incoming message. To our knowledge, few approaches deal with unsupervised classification, i.e. the automatic construction of subject-based folders starting from a set of incoming messages. Among these we mention Scatter/Gather [5], that uses a complex intermixing of iterative partitional clustering and an agglomerative scheme. The Athena [1] system provides a clustering algorithm to produce only cluster digests for topic discovery, and performs a message partitioning on the basis of such digests using a bayesian classifier. Other approaches to organization of text messages, mainly based on collaborative filtering, have been adopted in the definition of agents for accessing usenet news [19, 11]. The main problem with such approaches is that they require interaction with the user in order to learn a suitable organization of messages. The objective of this paper is the definition of a mail classification system capable of automatically organizing e-mail messages stored in a mail server, and to provide a web-based e-mail client capable of both suggesting the discovered classification rules and automatically organizing the incoming messages. Our aim is the development of a web-based service, which has the advantage of allowing the organization service as a platform-independent, non invasive, middleware among the mail server and the various mail clients. The organization service is mainly based on clustering algorithms. We study how to extract and structured and unstructured information from e-mail messages, and adopt a representation model for such information. Next, we propose a clustering algorithm to accomplish the task of organizing such information, and study the corresponding accuracy of the proposed system. The paper is organized as follows. Section 2 describes the basic text preprocessing steps required to represent the message collection in a suitable form for the clustering phase. In Section 3 we define the adaptive message classification problem, whereas section 4 describes a methodology for organizing collections of incoming messages in folders by a clustering framework. The section ends with some experiments stating the effectiveness of the approach. Finally, section 5 contains concluding remarks and some pointers for future developments of the framework.

2

Preliminaries: Text Processing

In this section, we briefly review how to represent a message collection by a vector space model. Details of such a representation can be found in [3]. The main issue is the set of relevant terms (index terms) likely to appear in the documents to be considered. In order to accomplish such an issue, some standard operations are generally acknowledged. We now introduce each operation in turn. 1. Lexical analysis. In this phase, the string of text representing a given document is tokenized in order to identify the candidate words to be adopted as index terms. 2. Removal of explicit stopwords. Stopwords are words occurring with high frequency in the text of the messages. As a consequence, such terms are not relevant, as they do not discriminate among contents. Removal of stopwords has the advantage of making the selection of the candidate index terms more efficient and reducing the size of the index structure considerably. 3. Stemming. This is the process of reducing the syntactical variants of the words to their root, mainly eliminating plurals, tenses, gerund forms, prefixes and suffixes. Stemming aims at improving the match between a given index term and a term appearing in a document, and as a consequence contributes in reducing the index structure. To obtain a suitable representation of the message collection, we need to provide information about frequencies of index terms in the document. In particular we need to compute two basic measures about frequency: the term frequency and the inverse document frequency. To evaluate the relevance of a term for content representation, term frequency measures the number of occurrences of an index term. In our particular case, we define the Message frequency, denoted by ϕ(wj , mi ), as the number of occurrences of term wj within message mi . Index terms with high frequency represent at best the message content, especially in long-sized messages. On the other hand, for short-sized messages, the Message Frequency information is likely to be negligible, or even misleading. In order to significantly represent message contents, it is useful to compare the message frequency of an index term with the number of occurrences within the overall set of messages. Indeed, a term occurring in many different messages is not as discriminant as other terms that appearing in a few messages. This suggests to define the relevance of a term by a combination of its message frequency and an inverse function of the number of messages in which the term occurs. More precisely, we define the Collection frequency, denoted by ψ(wj , C), as the number of messages within a given collection C = {mi1 , . . . , mih } containing wj . In correspondence to collection frequencies, we can associate a further preprocessing step, namely the removal of implicit stopwords. By implicit stopword we mean a term w exhibiting a collection frequency not included in a fixed interval [l, u] (i.e., either ψ(w, C) > u or ψ(w, C) < l). As a result, we can define the (normalized) term-message matrix, i.e. the set {w1 , . . . , wN } of the feature vectors, defined as follows: ( √Pϕ(wj ,mi )·log(N/ψ(wj ,C)) 2 if l ≤ ψ(wj , C) ≤ u p [ϕ(wp ,mi )·log(N/ψ(wj ,C))] wji = 0 otherwise

3

Message Classification: Problem Statement and Decomposition

Computer-mediated communication systems are composed of two main modules: User agents, UA and Message Transfer Systems, MTA. The main purpose of UA tools (such as Eudora) is that of providing assistance to users in writing and delivering messages, while MTA (such as Sendmail) cope mainly with message transfer, according to transfer protocols such as POP3, IMAP4 or SMTP [26]. Usually, messages are formatted according to the RFC 822 standard, or its extension, MIME [26]. RFC 822 requires that messages, mainly in ASCII text, contain both formatting instructions and structured information (such as To, From and Received fields). The MIME standard introduces some extensions to RFC 822, by dealing with messages that in principle may contain more than simple ASCII text (such as, for example, Content-Type, Content-Transfer-Encoding, etc.). In this paper we do not deal with MIME extensions: as a matter of fact, adapting the general results described in this paper to MIME messages can be considered as a conservative extension, which can be treated in a separate way. The problem of message classification can be hence stated as follows. Given a MTA, define an adaptive and autonomous User Agent, i.e., a UA capable of 1. finding a suitable way of organizing text messages coming from MTA and directed to a given user in homogeneous groups/hierarchies; 2. redirecting future incoming messages from M T A according to such an organization. 3. incrementally refining the organization according to requirement changes, such as user needs and traffic. Point 2 can be seen as a generalization of the approaches analyzed in section 1 and hence can be dealt with according to such approaches (e.g., bayesian approaches). Points 1 and 3 can be simply referred to as the problems of i) automatically defining content-based folders, and ii) changing the contents/organizations of such folders along time. It is clear that points 2 and 3 are strictly correlated: new messages may contain unclassified information, and consequently require a revision of the created folders. To the purpose of this paper, in the following we only deal with point 2. A more detailed treatment of each of the three aspects can be found in [14]. 4

Clustering of Messages

To our purposes, the most interesting problem to deal with is the automatic identification of interesting and important messages, and their organization in homogeneous groups of messages [12]. The most common technique to define interestingness and importance, as well as homogeneity, can be formalized as follows: i) build lists of relevant keywords or phrases from messages; ii) define matching criteria for messages based on such keywords; iii) exploit suitable clustering and categorization schemes to detect and affix labels to each message. The above observations imply the definition of a knowledge extraction process composed of several steps , necessary to prepare and mine raw messages. The next sections are devoted to the analysis of the two main steps of the process, namely preparation and mining.

4.1

Information preprocessing

The first step in the process of knowledge extraction is data consolidation and preprocessing. In the case of message classification, information sources are composed of messages stored by the MTA. For each message we extract and store information concerning Sender, Recipients, Date/Time, Subject, Attachment filename, and Content of the message. From the above fields we can derive a feature vector x = (y w) that is mainly composed of structured and unstructured information. Structured information (denoted by y) is obtained from the four main fields, and comprises the followings features: Categorical

Numeric

Sender domain (e.g. yahoo.com)

Message length

Most frequent recipient radix domain (e.g., gov, com, edu)

Nr. of recipients

Weekday

Nr. of messages received from the same sender

Time period (e.g., early morning, afternoon, evening) Attachment file extension (e.g., jpg, ps, xls)

Table 1: Structured features

Unstructured information (denoted by w) is mainly obtained from the Subject and Content fields. In principle, both fields contain unstructured text, and hence need to be processed in a more suitable form. Moreover, the two fields contain information that, in principle, can be of different importance (as remarked in [12, sect.3.4]): usually, the Content field may contain a lot of redundant information that the Subject field does not contain (i.e., it is more likely that the subject contains keywords of the message). Hence, it is more appropriate to further split the feature vector w in two subvectors ws and wc . The two feature vectors are obtained by applying standard information retrieval processing steps [3]. In particular, by considering a set M = {m1 , . . . , mN } of messages, we apply the steps illustrated in section 2. 4.2

Knowledge extraction

The main objective of the knowledge extraction phase is the identification of homogeneous groups that shall represent folders in the re-organized mailbox. Practically speaking, we aim at finding a categorization of the feature vectors obtained by the processing phase. Formally, we can state the problem as follows: given a set M = {m1 , . . . , mN } coming from a MTA, find a suitable partition P = {C1 , . . . , Ck } of M in k groups (where k is a parameter to be determined), such that each group contains an homogeneous subset of messages, and has an associated label that characterizes the subset. The statement requires a rigorous definition of the notions of homogeneity and labelling. The notion of homogeneity can be measured by exploiting the feature vectors defined in the previous section. Practically, we define a similarity measure s(xi , xj ) as a real number, as follows: s(xi , xj ) = αs1 (yin , yjn ) + ηs2 (yic , yjc ) + γs3 (wi , wj )

(1)

where s1 defines the similarity of the structured parts of the message composed by numerical features, s2 defines the similarity of the structured parts of the messages composed

by categorical features and s3 takes into account the unstructured part of the message. More precisely: • s1 (yin , yjn ) and s2 (yic , yjc ) are defined in terms of the dissimilarity among objects. For each y, we can separate the categorical features yc from the numeric features yn , and define a dissimilarity measure for each of them as follows: – d1 (yin , yjn ) is the standard Euclidean distance. P c c – d2 (yic , yjc ) = i δi (yi , yj ), where δi (u, v) is the Dirichlet function over the i-th attribute [9]; By exploiting the above dissimilarities, we can define s1 and s2 as follows: n

n

c

c

s1 (yin , yjn ) = e−d1 (yi ,yj ) s2 (yic , yjc ) = e−d2 (yi ,yj ) • s3 can be chosen among the similarity measures particularly suitable for documents [25, 3]. In particular, we exploit the cosine similarity measure, defined as follows: s3 (u, v) =

uT v kuk · kvk

The values α, η and γ (ranging within the interval [0, 1]) are used to tune the influence of each part of the feature vectors to the overall similarity. The assessment of convenient values for such parameters is particularly important and is treated in deeper details later in this paper. Concerning labels, we have to find a suitable way of associating to a given group a significant label. The structural part is simple to deal with, since we can associate to the categorical part yc the mode vector [9], and to the numerical part yn the mean vector. The unstructured part requires some attention. In a sense, we are looking for a label reflecting the content of the messages within the cluster, and contemporarily capable of distinguishing two different clusters. The most intuitive notion that can come out is a list of characteristic terms. More precisely, a term w is called characteristic (w.r.t. a given threshold λ) for a cluster Cj if 1. ψ(w, Cj ) ≥ λ. 2. ψ(w, Ci ) < λ, for i 6= j. The intuition around the above definition is the following. First of all, the removal of implicit stopwords, as described in section 2, states that the only candidate representatives are terms with significant frequencies. However, in order to correctly characterize the cluster, a term must be peculiar of that cluster: that is, the term must occur with a high frequency within the cluster, and is likely to occur with a lower frequency in other clusters [15]. In principle, the above conditions can be difficult to achieve for a given value of λ. To this purpose, a possible simplification could be that of choosing the three most frequent terms in the cluster that do not appear in other clusters. Faced with the above notions, we are now ready to define a suitable clustering algorithm capable of finding the desired partition. Many different clustering algorithms can be

Algorithm Integrate (M, δ); Input : a small subset M0 = {m1 , . . . , mP } of messages randomly chosen from the dataset M = {m1 , . . . , mN }, a stopping threshold δ. Output : a partition C = {C1 , . . . , Ck } of M. Method : • Obtain the feature vectors {x1 , . . . , xP } from M0 ; • Apply hierarchical algorithm to these feature vectors and compute the concept vectors; • Supply the K-means algorithm with these concept vectors;

Figure 1: The Agglomerative Hierarchical and K-Means integration algorithm

exploited [10]. The nature of the data sets to be mined, however, suggests the adoption of techniques directly adapted from Information Retrieval, such as hierarchical clustering equipped with cosine measure [3, 24]. Hierarchical methods are widely known as providing clusters with a better quality. Moreover, to our purposes it is particularly convenient to adopt a hierarchical approach, since it should allow the generation of a hierarchy of labels (and hence of message folders). However, such methods have a serious efficiency drawback: each iteration requires O(N 2 ) comparisons. When N is large, these approaches fail to suitably support online processing, as we instead expect. For this reason we shall analyze a particularly suitable partitional technique, namely Spherical K-Means [6], that has the main advantage of requiring O(N ) comparisons and guaranteeing a good quality of clusters. To improve the quality of clustering, we integrate hierarchical agglomeration and iterative relocation by first using the hierarchical agglomerative algorithm over a small arbitrary subset of P messages and then refining the results using K-Means. Practically P N , e.g. the 30% or 40% of N . We avoid a problem accompanying the use of a partitional algorithm like Kmeans, i.e. the choice of the number of desired output clusters, and by the initial execution of the hierarchical algorithm we attempt to select a better starting partition so that the K-means is more likely to find the optimum value of the stopping criterion. 4.3

Experimental results

To assess the quality of the implemented filtering methodology, it is important to understand whether the filtering processes described in the previous sections meet the user’s requirements. In particular, it is important to check whether the resulting folders (the results of the clustering algorithms) can be considered of good quality: a comparison with a ideal categorization of messages has to be provided. In a sense, it is particularly easy here to define an ideal partition, by relating it to the contents of the messages. A human expert, in fact, can easily provide a user-defined “optimal” partition, that can exploited to verify the resulting clusters. To this purpose, our aim is to compare our message categorization framework against some sets of messages previously filtered by a given user on the basis of their content. To test our technique we performed several experiments with different collections of messages. In this section we will show some experiments performed on a data set containing 251 messages related to different topics such as: Decision Trees, Neural Networks, General

Classifier, Filtering Instances, Web Classification, etc. The data set contains 93567 different terms, with an average message length of 200 terms per message. The ideal partition we used to measure the quality of the results is composed of 34 clusters. We applied the described filtering methodology, to obtain a message reorganization in suitable folders. We performed several experiments, trying different combinations of the Integrate algorithm with the parameters defined in the previous sections: • explicit stopwords were removed by exploiting a public available list of 582 stopwords; • implicit stopwords were removed by trying different combinations of the l and u parameters. In our experiments, optimal values for such parameters resulted as l = 2% and u = 30-40%; • the similarity parameters are by far the most important ones. The best results were obtained by providing higher importance to the subject of the message (SW ), as somehow expected, and lower importance to the structured part of the message. Significant intervals were α ∈ [0, 0.1], η ∈ [0, 0.2] and γ ∈ [0.7, 1]. The results obtained were evaluated by both measuring the standard precision and recall measures, and comparing the clustering results with different values of the weight β of the standard F-measure (FM in the following) [3, 20].

Figure 2: Experimental Results (without structured information)

Figure 2 shows the results of some experiments performed exploiting only unstructured information. The advantage of using a Hierarchical algorithm over a portion of the original dataset for computing the initial centroids to be used by the K-Means algorithm allows a significant gain in accuracy. Moreover, using a partitional algorithm allows a good compromise between accuracy and efficiency.

Figure 3: Experimental Results (with structured information)

Fig. 3 shows the results of some experiments adopted exploiting also structured information. In particular we used the following features: Sender domain (SD), Recipient Domain (RD), Message length (ML), Number of recipients (NR), Weekday (WD), and Time period (TP). In the experiments, we can observe that the usage of numerical features provides no significant benefit. Categorical features can influence the result, provided that a suitable weight (e.g., 15-20%) is assigned to such features. More detailed experimental results are described in detail in [14]. 5

Conclusions and Future Works

In this paper we described a technique automatic filtering of messages, and the architecture of a system that implements it. Our technique, based on clustering algorithms for data classification, was tested in several experiments, showing a high degree of flexibility, efficiency and effectiveness in the message classification context. The current implementation has to be considered only as a starting point of the general framework of message filtering. Indeed, several issues have still to be addressed. • The first issue is concerned with the classification of incoming messages. Currently, the tool works fine a large amount of mail messages: that is, we assume a model of interaction in which the user periodically re-organizes all his e-mail messages. This type of interaction does not take into account the problem of updating the model, as described by the points 2 and 3 in page 4. Such problems have to be addressed, by combining the proposed approach with supervised learning techniques. In particular, the problem described in point 3 is challenging. • We only concentrated on text contents. However, the MIME format allows even multimedia contents. For example, a message may contain attachments of different types (such as pdf documents, jpeg images, etc.). It is important to study the impact of such attachments to the message classification issue. Such a study, however, requires the combination of several content analysis and information retrieval techniques. References [1] R. Agrawal, R. Bayardo, and R. Srikant. ATHENA: Mining-based interactive management of text databases. In In Proc. 7th Int. Conf. Extending Database Technology (EDBT00), pages 365–379, 2000. [2] I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000. [3] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press Books. Addison Wesley, 1999. [4] W. W. Cohen. Learning Rules that classify E-mail. In Proc. of the 1996 AAAI Spring Symposium in Information Access, 1996. [5] D.R. Cutting, K.R. David, J.O. Pedersen, and J.W. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In In Proc. of the 15th Intl ACM-SIGIR Conf. on Research and Development in Information Retrieval, 1992. [6] I. Dhillon and D. Modha. Concept Decompositions for Large Sparse Data Using Clustering. Machine Learning, 42:143–175, 2001.

[7] H. Drucker, D. Wu, and V. N. Vapnik. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural networks, 10(5), 1999. [8] J. M. G. Hidalgo, M. M. L´opez, and E. P. Sanz. Combining Text and Heuristics for Cost-Sensitive Spam Filtering. In Proc. of the 4th Computational Natural Language Learning Workshop, CoNLL-2000, 2000. [9] Z. Huang. A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. Data Mining and Knowledge Discovery, 1997. [10] A. K. Jain, M. N. Murthy, and P.J. Flynn. Data Clustering: A Survey. ACM Computing Surveys, 31(3):264– 323, 1999. [11] A. Jennings and H. Higuchi. A Personal News Service Based on a User Model Neural Network. IEICE Transactions on Information and Systems, 1992. [12] F. Kilander, E. Fahraeus, and J. Palme. Intelligent Information Filtering. Technical report, Department of Computer and Systems Sciences, Stockholm University, 1997. Available at http://www.dsv.su.se/˜fk/if_Doc/IntFilter.html. [13] D. Lewis and W. Gale. A Comparison of Two Learning Algorithms for Text Categorization. In Proc. Symposium on Document Analysis and Information Retrieval, 1994. [14] G. Manco, E. Masciari, and A. Tagarelli. A Framework for Adaptive Mail Classification. Technical report, ISI-CNR, 2002. available at http://www.isi.cs.cnr.it/isi/manco/AMCo. [15] M.F. Moens. Automatic Indexing and Abstracting of Document Texts. Kluwer Academic Publishers, 2000. [16] P. Pantel and D. Lin. SpamCop: A Spam Classification and Organization Program. In Learning for Text Categorization: Papers from the 1998 Workshop, 1998. [17] J. Provost. Naive-Bayes vs. Rule-Learning in Classification of Email. Technical report, Dept. of Computer Sciences at the U. of Texas at Austin, 1999. [18] J. Rennie. ifile: An Application of Machine Learning to E-Mail Filtering. In In Proc. KDD00 Workshop on Text Mining, Boston, 2000. [19] P. Resnick et al. GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In Proc. ACM-SIGIR Conference on Information Retrieval, pages 272–281, 1994. [20] M. Ruffolo and S. Iiritano. Managing the Knowledge Contained in Electronic Documents: a Clustering Method for Text Mining. Technical report, Laboratorio per l’innovazione dell’Azione Progettuale Ricerca del Piano Telematico Calabria, 2001. [21] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop, 1998. [22] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spyropoulos, and P. Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. In Proc. of the 6th conference on Empirical Methods in Natural Language Processing, 2001. [23] R. B. Segal and J. O. Kephart. MailCat: An Intelligent Assistant for Organizing E-Mail. In Proc. of the 3rd International Conference on Autonomous Agents, 1999. [24] M. Steinbach, G. Karypis, and V. Kumar. A Comparison of Document Clustering Techniques. ACMSIGKDD Workshop on Text Mining, 2000. [25] A. Strehl, J. Ghosh, and R. Mooney. Impact of Similarity Measures on Web-Page Clustering. In Proc. of AAAI Workshop on Artificial Intelligence for Web Search, pages 58–64, July 2000. [26] A. Tanenbaum. Computer Networks. Prentice Hall, 1996. [27] I. H. Witten and E. Frank. Data Mining - Pratical Machine Learning Tools and Techniques with Java implementations. Data Management Systems. Morgan Kaufmann, 2000.

Appendix A

A system for Adaptive Mail Classification

In this section we describe the implementation of the message classification system, called AM Co (Automatic Mail Category Organizer). AM Co was developed in Java and is composed of the modules shown in fig. 4.

Figure 4: The Mail Classification System

The system architecture has the following features involving the various modules: 1. Messages stored in the MTA are made available to the Preprocessing Module that extracts the corresponding feature vectors; 2. Suitable clusters are computed by the Clustering module; 3. Messages in the MTA are reorganized in folders according to the clusters computed by the Clustering module. Each module is implemented in a separate process, thus allowing to separate concerns between computation of the filtering strategies (i.e., clusters definitions) and the presentation of the filtered messages. In particular, the User Interface and the Mining Module are separate from the MTA, thus resulting in a non-invasive service. The interactions between these modules are made using the JavaMail 1.2 package and the POP3 and IMAP protocol. The default protocol provided by JavaMail 1.2 is the IMAP protocol so we had to implement a suitable module to use also POP3. We choose to use POP3 because it is widely used and it permits to avoid the interaction with the mail server in the classification phase. The User Interface is mainly implemented as a web-based service. The main functionalities are similar to those of a common mail client (e.g., Microsoft Outlook) but with the typical Java Beans “Look & Feel” (fig. 5). A tree structure is used to organize the folders obtained from the clustering phase. As shown in fig. 6, each internal node is identified with a cluster identifier (the set of characteristic terms and contains a leaf for each message that is labelled by its Subject. The interaction with the Mining Module (fig. 7) can be tuned by setting the values of some given parameters, such as the α, η and γ similarity weights, the clustering algorithm, and the structured features.

Figure 5: Initial form of AMCo application

Figure 6: Tree view of the folders

Figure 7: Configuration Setting

Suggest Documents