A New Approach to Email Classification Using Concept ... - IEEE Xplore

4 downloads 5175 Views 229KB Size Report
vectors can not accurately represent Email content, which will result in the inaccurate of classification. This paper presents a new approach to Email.
2008 Second International Conference on Future Generation Communication and Networking Symposia

A new approach to Email classification using Concept Vector Space Model Chao Zeng Institute of Computer Applications East China Normal University Shanghai, China [email protected] Junzhong Gu Institute of Computer Applications East China Normal University Shanghai, China [email protected]

Zhao Lu Institute of Computer Applications East China Normal University Shanghai, China [email protected]

then people can access to the content of their relations accurately and quickly, which will greatly improve the efficiency, thereby reduce the loss in manpower, financial and material resources. Above all Email classification is of great significance and value. Email classification has become a new academic subject. It has been concern about by various circles of society in recent years. In commercial circles software for Email classification emerge in endlessly, while in academia set off on an upsurge of the Email classification. The accuracy of Email classification encourages the researchers. Email filtering technology develops continuously while a means of Email changes constantly. Nowadays many anti-Email technology programmers will not only adopt a technology, but a synthesis of multiple technologies. The main technology used by current anti-Email products are as follows: black list, white list, DNS identification, rate control, OCR recognition and analysis, virus scanning, a comprehensive reputation system, rule-based scoring system, data mining and so on. In recent years, a large number of researchers study Email classification based on data mining technology. Current data mining techniques are Bayesian, artificial intelligence, text clustering and decision tree, and so on.

Abstract Email classification methods based on the content general use Vector Space Model. The model is constructed based on the frequency of every independent word appearing in Email content. Frequency based VSM does not take the context environment of the word into account, thus the feature vectors can not accurately represent Email content, which will result in the inaccurate of classification. This paper presents a new approach to Email classification based on the Concept Vector Space Model using WordNet. In our approach, based on WordNet we extract the high-level information on categories during training process by replacing terms in the feature vector with synonymy sets and considering the hypernymy-hyponymy relation between synonymy sets. We design a Email classification system based on the concept VSM and carry on a series of experiments. The results show that our approach could improve the accuracy of Email classification especially when the size of training set is small.

1. Introduction Email has been an efficient and popular communication mechanism as the number of Internet users increase. However, the existence and spread of Email have result in greater interference to us while enjoying the convenience of the Email. Email often reflects the current hot issues of the social and public feelings, the proliferation of Email at the same time also affected people on the collation and acquisition of information. If Email can be automatically classified,

978-0-7695-3546-3/08 $25.00 © 2008 IEEE DOI 10.1109/FGCNS.2008.7

2. Related work Typically, the process of Email classification has the following three main steps: pre-processing, feature selection and classifier construction. Pre-processing was composed of word segmentation, feature representation and feature

162

classification is adaptable for the situation of simple text and low vector dimension. Because Email is basically small text, which will not be many, simple vector classification is appropriate for Email classification in respect to the complexity and effect. After anglicizing the present technology, this paper presents a new approach of feature selection. In our approach, based on WordNet[9], for describing a text Email by establishing concept vector space model, we can firstly extract the high-level information on categories during training process by replacing terms with synonymy sets in WordNet and considering hypernymy-hyponymy relation between synonymy sets. Secondly, we use TF * IWF * IWF method to revise the weight of the concept vector. In the end, we could determine the type of text Email using simple vector classification method.

extraction. Currently feature representation in accordance with the semantic understanding can be divided into two categories: the expression model based on keywords - Vector Space Model (VSM), and the notional expression model based on acceptation understanding. Although VSM does not consider the semantic information and lost some relationship between word and word, it is simpler and easier to handle, and the text-processing (classification mainly) can be more effective than the latter. Therefore VSM is the most common used method. The existing feature selection methods can generally be classified into two categories: filtering methods, and eliminate method. The former considers feature selection as a step of pre-processing. It weights the features through a series of rules, and then constructs the drop-dimensional vector space by the first k feature vector ranked by weight. For example: Document Frequency, Mutual Information and χ2 Statistics Law. The flaw of filtering methods is that they consider the Dimension of feature is independent of each other, thereby lowering the accuracy of classification. Now many improvements has been made based on Document Frequency and TF * IDF for calculating the weight of the vector proposed by Salton in 1973, such as: Thorsten Joachims proposed probability TF * IDF algorithm [1]; Roberto Basili proposed TF * IWF * IWF algorithm [2], and so on. Eliminate is the method that considers the classification as a black box for feature selection. Such method has been verified more effective than the filtering method. However, its calculation is too excessive, especially when the number of the feature vector is too large, its practicality is not strong. Classifier construction can be broadly categorized into: statistics based Classifier [3], connection based Classifier [4] and rule-based Classifier [5]. Naïve Bayes[6], KNN[7], SVM[8] are Statistics based methods; neural network is a connection based method; rule-based decision tree is a rule based method. Some researchers have verified the validity of these algorithms, according to the experiment result, among 14 kinds of classification algorithms including the KNN, decision tree, Park Bayes, neural networks, KNN and other algorithms classification accuracy rate is satisfactory after training by a large volume of training set. However, they presence a common problem: do not consider the semantic relationships between words so that often appear in highdimensional vector space, which will greatly reduce the property of the classification; On the other hand, in the condition of limited training set , because that the type information is too small, the level is too low, resulting in the classification accuracy of these algorithms greatly decreased. Simple vector

3. Email classification using concept VSM Email request from two stages: the training stage and classification stage. In the training stage we train the classifier by using type-marked Email, thus get the feature vector space of each type; in the classification stage we take the unclassified Email as input, and tell the type of the Email as result. As shown in Figure 1. Email folders

Incoming Email

Pre-processing

Concept list formulation

Weight revision

Classifier

Email folder

……

Email folder

Figure 1 Email classification based on concept VSM

3.1. Training The process of training is as follows: 1. Pre-processing 2. Concept list formulation 3. Weight revision

163

VectorId, then change the corresponding value of Vi in VectorCValue,

3.1.1. Pre-processing.The process of Pre-process is as follows: 1) Segment the words in Email d j , remove

Vi ← ⎯⎯ Vi + N ( wi : C k ) ,and change the

punctuation and high frequency words, remove roots and affixes. Represent d j as: 5)

{w1 w2 ...wi ...wk } , count the appearance time of wi in d j as N ( wij ) 。 2)

Count the appearance time of

weight of the direct hyponymy word of these synonyms collections as well. If not, return to 3). If all the words in Q L ' (C k ) was done, then we have constructed the concept vector space of the training set of type C k .

wi in Email of type

6)

nk

C k as N ( wi : Ck ) = ∑ N ( wij ) ,in which C k =

Join the VectorId and VectorCValue of each type to form the VectorId and VectorSValue of the whole training set.

j =1



3.1.3. Weight Revision.The process of weight revision is as follows: 1) Count the Inverse Document Frequency of Vi

d1 d 2 ...d j ...d nk } , means :There are

nk training Email of type C k . Set up a word set Q L (C k ) to save the words appear in Email of 3)

type. Count the appearance time of

IDF (Vi ) = [log(

wi in the whole

number of Email of the whole training set in which Vi appears and N (Vi ) is the number of

n

training set as N ( wi )

= ∑ N ( wi : C k ) , n is the

Email of type

k =1

number of the type. Set up the word set of the training set represent as Q J .

2)

3.1.2. Concept List Formulation.The process of concept list formulation is as follows: Initialization: Q L ' (C k ) = Q L (C k ) ; Q J ' = Q J . 1)

If

wi

then

does

not

appear

in

For word

to the weight of 3)

,

in

which

Vi in the whole training set

divided by the number of the type. Revise the weight the concept:

VectorCValue(Vi ) = VectorCValue(Vi ) * IDF (Vi ) * FICF (Vi )

Save the concept list to VectorId sequentially. And at the same time save N ( wi : Ck ) to VectorCValue, as the weight of synonyms collection of wi (i=1,2,3,4,…,m, in which m is

4)

Normalizing the VectorCValues:

VectorCValue(Vi ) ←

VectorCValue(Vi ) n

∑ (VectorCValue(V )) i

QL ' (C k ) )in the concept

i =1

3.2. Classification

concept list. If VectorId exists, then no need to save again. For each wi (i=2,3,4,…,m) in Q L ' (C k ) , search in WordNet, get the concept list of

J

U ij is the weight of Vi in type C k , Vi is equal

;

list, and also as the weight of the synonyms collection of wi ’s direct hyponymy word in the

4)

2

i

J

wi in QL ' (C k ) , get the concept list of

the number of words in

∑ (U − V ) ∑V ij

wi by search in WordNet. 3)

Vi

Count the Inverse Category Frequency of

FICF (Vi ) =

QJ ' ← ⎯⎯ Q J ' − wi 2)

C k in which Vi appears.

ij

WordNet,

QL ' (C k ) ← ⎯⎯ QL ' (C k ) − wi

N (Vi ) 2 )] , in which N is the N

The process of classification is as follows: 1. Do pre-processing to the test Email, construct the concept list of the Email, and then get VectorId and VectorCValue of the Email.

wi . If the

synonyms collections of the concept list are in

164

2

2. 3.

classified Email and the number of Email which belongs to the type, it mainly reflects the classifier’s ability of searching extension. F1 is a evaluation method which considers both Precision and Recall comprehensively. We compute the precision, recall and F1 values as follows:

Revise the weight to get VectorId and VectorCValue of the Email. Calculate the similarity of the concept vector

X = ( x , x ,..., x )

1 2 m space of the Email and the concept vector space of each type

Y = ( y1 , y 2 ,..., y m )

:

P(Pr ecision) =

m

Fsim ( X ,Y ) =

∑ (x m

∑x k =1

4.

k

* yk )

k =1

2

k

m

* ∑ yk

N correct N actual

N correct N total P*R*2 F1 = P+R The average F1 score is used as the evaluation R(Re call ) =

2

k =1

The test Email belongs to the type which corresponding to the largest value of Fsim ( X , Y ) .

measure in all the following evaluation.

4. Experiments

4.3. Experiment Results

In this section, extensive experiments were performed to gauge many aspects of the proposed Email classification method. We introduced the experimental data set, the evaluation metrics and the experimental results.

We made two experiments: 1. Fixing the number of training set and test set, we compare the performance of the traditional vector space model and the concept vector space model using WordNet proposed by this paper; 2. with different size of training set, compare the performance of the traditional VSM and the concept VSM. In the first experiment three categories has been selected from 20_newsgroups, including alt.atheism, comp.graPhies, reo.autos. We selected 300 Email from each category as training set, which means a total of 900. As testing set, 100 Email was selected from each category. Table 1 Traditional VSM Type rec.auto Alt.athesim comp.graphics Precision 0.78 0.70 0.73 Recall 0.70 0.70 0.80 F1 0.74 0.70 0.76

4.1. Data Set Data set is the prerequisite and foundation for Email classification, and is also an essential basis for objective evaluation of performance of an essential basis for classification. In this paper, we use documents of 20_newsgroups as data set, which is a standard document set. We put these documents below 20 directories, each directory is a category of the news group, and each category generally includes 1,000 articles. Select some types from the set of documents of 20_newsgroups to do the experiments, take a part of the data set to be the training set, taking the other to be the test set. We do the experiments using traditional vector space model and concept based vector space model proposed by this paper separately.

Type Precision Recall F1

4.2. Performance Evaluation We select three commonly used assessment methods to evaluate the performance of the classification system, named Precision, Recall and F1 value. Precision is the ratio of the number of correctly classified Email and the actual number of Email which was assigned to the type, it mainly reflects the ability to search accurately of the classifier. Recall rate is the ratio of the number of correctly

Table 2 Concept VSM rec.auto alt.athesim comp.graphics 0.86 0.93 0.90 0.82 0.88 0.96 0.84 0.90 0.93

Table 1, 2 respectively lists the result of the traditional method and the improved method. It can be seen from the results of the two tables, the concept VSM based on WordNet results superior to the traditional vector space model.

165

Classification Accuracy

reduce the dimension of the feature vector when the training set is large.

1.2 1 0.8 0.6 0.4 0.2 0

6. References

30

100

300

600

700

[1] Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of ICML’97, pages 143-151. [2] R. Basili, A. Moschitti, M. Pazienza. A text classifier based on linguistic processing. In Proceedings of IJCAI-99, Machine Learning for Information Filtering. [3] Manu Aery, Sharma Chakravarthy. eMailSiftining-based Approaches to Email Classification[C]. Proceeding of the 27th Annual International Conference on Research and Development in Information Retrieval, ACM, 2004, 580~ 581. [4] Clark J,Koprinska I,Poon J.A neural network based approach to automated e-mail classification.Proc of the IEEE/WIC Intl Conf on Web Intelligence,2003:702-705. [5] Irena Koprinska, Felix Trieu, Josiah Poon and James Clark. E-mail Classification by Decision Forests. Proc. 8th Australasian, Document Computing Symposium (ADCS), 2003. [6] Meyer, T.A., Whateley, B., SpamBayes: effective Opensource, Bayesian Based, Email Classification System. First Conference on Email and Anti-Spam (CEAS), 2004, pp. 1-8. [7] L Baoli,L Qin,Y Shiwen.An Adaptive k-Nearest Neighbor Text Categorization Strategy. ACM Transactions on Asian Language Information Processing(TALIP),2004. [8] Andrew Farrugia. Investigation of Support Vector Machines for Email Classification. 2004. [9] C. Felbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts, 1998.

900

Training set size Concept VSM

Tradition VSM

Figure 2 Performance comparisons with different size of training set In experiment 2, we selected 30, 100, 300, 600, 700, 900 from the training set in experiment 1, and keep the testing set the same with experiment 1. Here the evaluation index is calculated by the Hong average calculation. It can be seen from Figure 2 that the scale of the training set has impact on the performance. With the increasing in the size of training set, the classification accuracy also increases in both of the tradition VSM and concept VSM, it is because when training set is small, it is difficult to select the better features to represent Email vector, so that the classifier is not performing well. But with the increase in the number of training samples, you can more easily choose the better features to represent Email vector.

5. Conclusion and future work This paper presents an approach of feature selection. In our approach, based on WordNet, for describing a text Email by establishing concept vector space model, we can firstly extract the high-level information on categories during training process by replacing terms with synonymy sets in WordNet and considering hypernymy-hyponymy relation between synonymy sets. Secondly, we use TF * IWF * IWF method to revise the weight of the concept vector. In the end, we could determine the type of text Email using simple vector classification method. We carry on a series of experiments to compare our approach with the term-based VSM approach. The results show that our approach could improve the accuracy of text Email classification especially when the size of training set is small. Our future research is to use the concept vector obtained by the method proposed in this paper to do level classification. At the same time, we will further attempt to improve the classification accuracy and

166