ARABIC ALERT E-MAIL DETECTION USING RULE BASED FILTER 1
Qasem A. Al-Radaideh
and
2
Ahmed F. AlEroud
1
Department of Computer Information Systems, Faculty of Information Technology and Computer Science, Yarmouk University, Irbid 21163, Jordan. {Email:
[email protected]} 2
Department of Information Systems, College of engineering and information Technology, University of Maryland, Baltimore County, USA. {Email:
[email protected]}
Abstract This paper utilized the performance of the rule-based filter for detecting Arabic alert e-mail. Alert e-mails are those e-mails related to criminal or terrorist activities which are of a great interest for both security agencies and people. A set of Arabic e-mails have been collected, pre-processed, and normalized. The useful features were extracted from the collected e-mails by involving categorical proportional difference (CPD) and term frequency variance (TFV) as features weighting methods for the rule-based filter. As a result, the rule based filter has achieved good accuracy results where it was able to detect about 85% of alert e-mails used in the experiments. Keywords: Data Mining, Threatening E-Mail, Alert EMail, Statistical Filters, Rule based Filters, and Arabic Language Processing. 1. INTRODUCTION It has been noted that the e-mail service has become the most popular and efficient mean of communication for users due to its convenience, immediacy, and low cost. E-mail service is used for facilitating contact between individuals and enhancing productivity of organizations. Commonly, e-mail systems allow end users to manually construct keyword-based rules and filters to automatically direct e-mails into specific folders in order to facilitate efficient retrieval. This filtering includes filtering undesired e-mails (called spam) by deleting them or moving them into a specific folder (called Junk mail folder). As the popularity of e-mail increased, email management became an important and growing problem for individuals and organizations. As a result, there has been a growing interest in building systems that can learn automatically to filter and move e-mails into specific folders. In practice, e-mail filtering is the process of automatic organization of incoming e-mails into folders, where each folder represents a category to which an e-mail may belong. Most of researches in the area of e-mail filtering have focused on classifying e-mails into spam and non spam messages. However the massive amount of e-mails interchanged daily, and the needs for personalization and organizational data analysis, has opened new and critical research areas in the field of e-mail filtering. An example
is the filtering of threatening e-mails related to terrorism activities. The threatening e-mail is the one that carries information about the future criminal activities that may have serious consequences (including the loss of life). The first effort in this area was made by Appavu and Rajaram [1] and Appavu et al. [2], who used machine learning concepts, such as rule based filtering, for detecting and filtering threatening e-mails. E-mail system’s users may receive alert e-mails sent either by security agencies or by online news providers where subscribers may request e-mail alerts on breaking news. For instance, several countries who suffered from terrorist attacks have issued instant e-mail warnings that are sent by alert systems about any expected future terrorist attacks. The central Europe terrorism news website offers a free terrorism alert news service and sends such news by e-mail. E-mail alerts are received daily by millions of subscribers around the world from online news providers; one-quarter of them addressed government and politics, followed by crime and international news [3]. Detecting alert e-mails related to terrorist attacks embedded in e-mail messages is very important for users and analysts who are interested in such kind of data. In practice, there are two types of e-mails that have information related to terrorist attacks; while the first type is the alert e-mails that warn users about expected attacks in the future, the second is informative e-mails that inform users about recent terrorist attacks. It has been noticed that alert e-mails are usually sent using the future tense. The massive amount of e-mail received, makes it difficult for users to navigate through the Inbox in order to find alert e-mails. This motivates researchers to develop new automated techniques for detecting such alert e-mails and store them in a separate folder labeled with “alert folder” instead. Previous studies in the area of e-mail filtering did not take into account e-mails that contain threats and alerts regardless of the senders of such e-mails. This type of emails should be treated with high priority that guarantees accurate analysis and filtering of such e-mails. In this case, appropriate mechanisms should be used to notify users about such e-mails. Appavu and Rajaram [1] and Appavu et al. [2] have proposed a rule based technique based on classification and association for detection of threatening e-mails by
analyzing the text embedded in English e-mail text. The basic idea of rule based classifiers is to use a set of rules that is generated in the training phase of rule based classification algorithm (e.g. ID3, OneR). Two basic techniques were used for feature extraction from the text, information gain (IG) and term frequency variance (TFV). The proposed method involved verb tense as an important factor and showed that associative-classifier provided competitive detection rates compared to common classification methods.
have combined support vector machine(SVM) and knearest neighbor (k-NN) classifiers to provide a hybrid model for spam e-mail filtering. As for the alert e-mail filtering, Appavu and Rajaram in [1] and Appavu et al. [2] have used machine learning data mining concepts and proposed two techniques that were based on classification and association for detection of threatening e-mail related to terrorism attacks by analyzing the text embedded in an English email text. To measure the performance of association rule mining (apriori algorithm) in alert e-mail detection, Appavu and Rajaram have used a mixture of e-mails containing 1000 informative emails related to terrorism attacks that have occurred and 1000 alert emails related to terrorist attacks expected to happen in the future. The proposed method involved verb tense as an important factor and showed that associative-classifier provided a competitive detection rates compared with common classification methods. In terms of classifier accuracy, Appavu and Rajaram [1] have showed that the decision tree classifier has outperformed other classifiers, such as support vector machine and Naïve Bayesian.
In literature and up to the knowledge of the authors, no research has been reported to capture and analyze alert e-mails that are written in the Arabic language. This paper evaluated the rule based filtering using the ID3 classifier [4] for Arabic alert e-mails. We think that this work is going to be the first that measures the performance of rule based filter on those e-mails that carries security threats. 2. RELATED WORK In recent years, several studies successfully addressed the issue of spam filtering with the aid of machine learning methods such as the Naïve Bayes [5], Support Vector Machine [6], and rule based filters. A comparative study was performed by Youn and McLeod [6] to measure the performance of different e-mail filters. The study was performed using four filtering techniques: Decision tree, Naïve Bayesian, Support Vector Machine (SVM), and Artificial Neural Networks (ANN).
3. THE METHODOLOGY The methodology adopted in this research to evaluate the filter is divided into four phases. The first phase is the email text pre-processing which involves two main steps: the removal of Arabic stop words and special characters and normalizing letters by replacing some characters shapes with a unified letter. The second phase is feature selection; in this phase two measures were used as feature selection measures: the Categorical Proportional Difference (CPD) measure proposed by Simeon and Robert [15] and the Term Frequency Variance (TFV) measure which was used by Appavu and Rajaram [1] and Appavu et al. [2].
The main problem with rule based filters is the difficulty of keeping them up to date with the abilities of spammers to make different variations of the same spam message. This problem has forced several researchers to employ statistical approach in the e-mail filtering process; an example is the Bayesian statistical filter proposed by Gajewski [7]. Sheu et al. [8], on the other hand, proposed a mathematical modelling process in order to prove the importance of using message content in filtering.
The third phase is the filters construction phase which involves building both the rule based filter using the ID3 algorithm. The evaluation of the filter is performed in the fourth phase which involves evaluating the performance of the rule based filter for alert threatening e-mail detection. The four phases are explained in details in the next sections.
Several other studies have proposed some approaches for spam e-mail filtering. Chen et al. [9] have evaluated and compared three novel naïve Bayesian filters with other common filters for e-mail filtering. In the same direction, Nazir et al. [10] have used a two-pass statistical approach for automatic personalized spam filtering, whereas Hershkop and Stolfo [11] have used the decision forest for e-mail spam filtering. Chiemeke and Longe [12] have used a decision tree based model to find a set of association rules that could identify spam emails. Liu et al. [13] proposed a hybrid combined approach for e-mail filtering. The hybrid filter utilized the performance of the N-Gram filter, Naïve Bayesian filter, content based Naïve Bayesian, Term frequency– inverse document frequency (TF-IDF) filter, and limited N-Gram based filter.
Phase 1: E-mail text Pre-processing This phase involves three main steps:
Aris and Georgios [5] have proposed a stacking filter model for spam filtering and Blanzieri and Anton [14] 2
1.
Tokenization: Before an e-mail text message can be processed, it is first split into units called tokens; this process is called tokenization. As this text is being read, tokenizing it into tokens (words) actually takes place as using a blank space or any other character as a delimiter.
2.
Stop Words Removal: The next step of e-mail text pre-processing involves the removal of some Arabic stop words; removal of those words that have little
meaning and usually appear frequently in text documents. Stop words can only have a syntactic meaning in text documents, and usually do not have a relation to the text subject as in the words “or” ()أو, “whose” ()لمن, “on” ()على, “where” ( )حٍث, “in” (ً)فث, “from “()مثثن, “beyond” ( )غٍثثث, and “all” ( )كثثث. Removing stop words leads to document with less length, which results in more efficient processing, and enhances the efficiency of terms indexing process [16]. 3.
other categories. When the value of CPD is 1, it means that the word occurs in one category document and does not occur in the other. The maximum ratio associated with a category Ci for a word w is the final value of CPD for that word; this is given in next formula. CPD (w) = Max i {CPD (w, ci )} The second feature selection method used is the term frequency variance (TFV); a category dependent feature selection technique. The TFV was used by Appavu and Rajaram [2] for alert e-mail detection and has showed a competitive performance when compared with information gain (IG). TFV is an indication of the term importance in each category, where higher values indicate that the term t has predominantly occurred in one category more than other categories, which means that the term t is more important and could be useful for filtering purposes.
Text Normalization: It involves replacing some variants of character shapes such as ( أ, إ,) آ, (ه, ) ةby a unified character. This step also involves removing numbers and some special characters such as ("1" ,"2" ,"@" ,"[" ,"(" ,"?", "/").
Phase 2: Feature Selection
To compute the TFV value, the term frequency tf in both categories is initially found followed by finding the mean value of the term occurrence in both categories (mean_tf). The variation between the term frequency in each category and the mean of term frequency in all categories was evaluated and squared. Finally, the sum of this variation represents the value of TFV for that term. Formula 3 is used to find the TFV for each term. In the evaluation phase, the values for both TFV and CPD need to be normalized first since the range of values of TFV is greater than those of the CPD.
The next step involves finding relevant keywords (tokens) that could help in identifying alert e-mails. As for the Arabic language, it has been noticed that alert emails are sent using the future tense, therefore the most common verbs prefixes used in Arabic language to send terrorist alerts are the tokens “will” ()ﺴ, “may” ( )قث, “will” ( )س ثand “with” ()ب. On the other hand, informative e-mails are commonly sent using simple present or simple past action verbs such as the words “happened” ( )وقث, “shocked” ( )هث, “hit” ()ضث ب, “kill” ( )قتث, “targeted” ( )اسثتدand “burst” ( )انفجث. Common examples of nouns that also indicate that an incoming email is an informative e-mail are the words “explosion” ( )تفجٍث, “murdering” ( )اغتٍ ث, “killed” ( )مقت ث, “death” ( )مصث, “target” ( )اسثتد اand “hijack” ( )اختطث. There are also several nouns that indicate that an e-mail may be an alert or an informative e-mail. Some action verbs and gerunds such as the words “attack” ( )هجث, “target” ( )اسثتد ا, “bomb” ( )مفخخث, “car” ( )سثٍ ةةand “assault” ( )اعتث اcould also appear in an Arabic alert e-mail message.
TFV (t )
i
2
i 1
The set of documents and the selected features are represented using the term document matrix format [17], where the intersection of a row and a column takes the value of (either 1 or 0) to indicate the existence (1) or absence (0) of a particular term in a selected document. An example of this matrix is found in the evaluation section of this paper (Section 4, Table 1).
For the purpose of feature selection, two methods were used; the first is the categorical proportional difference (CPD) [15], which measures the degree to which a word w contributes to differentiating a particular category from other categories. The CPD value for a word w in a category Ci is calculated using the next formula where an e-mail E belongs to one of two categories (C1, C2). CPD ( w, c)
k
tf (t , c ) mean _ tf (t )
Phase 3: Filter Construction The third phase of the process is the filters construction. This phase involves building the rule based filter using the ID3 algorithm [4]. The filter is discussed in details in the next section. Building the Rule Based Filter
A I A I
Where A is the number of e-mails where the word w and the category C1 occurs together and I is the number of e-mails where word w and category C2 occurs together. The total Number of e-mails in which the word w occurs is N = A + I.
The second evaluated filter is the rule based alert email filter where the rules are constructed using the ID3 algorithm [4]. The ID3 filter is a well known decision tree based algorithm that uses the information gain measure to select the best feature then creates a branch for each known value of the test feature, a recursive partitioning is then applied for each partition.
The possible values for CPD are within the interval (−1, 1], where as the value of CPD increases, it contributes to differentiating a particular category from
Different number of features was involved in the two feature selection techniques (CPD and TFV). Using different number of features in each of our experiments 3
was used to measure its effects on the performance of the rule based filter. The accuracy of the rules in the training sample is selected as a rule-quality measure.
e-mail corpus that could be used for alert e-mail filtering. Alert e-mail detection is relatively a new research direction of e-mail filtering. Appavu and Rajaram (2008) have also mentioned that the data collection step was a major challenge in their work and caused some bias and noisy results. In their research, the two datasets used for experiments were constructed in a brain storming session.
Phase 4: Filter Evaluation To measure the effectiveness of the two filters, several metrics were used. The metrics include: accuracy, true positives, false positives, and false negative rates. These metrics were also used to find the recall and precision rates for each filter. The F-measure was also used to find the best combination between precision and recall.
In this research, to build the Arabic alert e-mail corpus, the e-mails have been collected from some news websites. These websites such as Al-Jazeera website (www.aljazeera.net) and the BBC Arabic news website (http://news.bbc.co.uk/hi/arabic/news) are a medium for terrorist groups to spread alerts. The corpus consists of 1500 e-mails; 825 of them were alerts, representing about 55% of the total number of collected e-mails and 675 were informative e-mails, representing about 45% of the total number of e-mails.
The true positive (TP) rate is measured by the number of actual alert e-mails that are classified as alert e-mails. The true negative (TN) rate is measured by the number of actual informative e-mails that are classified as informative e-mails. The false negative (FN) rate is also measured as the number of the actual alert e-mails that are incorrectly classified as informative e-mails. The false positive (FP) rate is measured as the number of informative e-mails that were classified as alert e-mails.
Data Preparation for the Rule based Filter The e-mail corpus was processed in order to generate the term document-matrix. Sample of the term document matrix is shown in Table 1; where every cell gives an indication about the existence (1) or absence (0) of a term in an e-mail text. The terms are found in the header of the matrix along with their English translation.
The accuracy is the ratio of correctly filtered e-mails to the total number of e-mails used in the evaluation phase. The accuracy is evaluated using the next formula. Accuracy
Number ( Alert Alert ) Number ( Informative Informative) Number ( Alert ) Number ( Informative)
For experiments, the corpus dataset was divided into two folds using the (80/20) rule, where the 80% of the collected e-mails were used in the training phase; this represents about 1200 e-mails, and 20% of the collected e-mails; which represent about 300 e-mails, were used in the evaluation phases. In each fold, about 55% of e-mails were alert and about 45% were informative e-mails.
Recall (R) is an indication of the percentage of alert e-mails that has been filtered as alerts from the overall number of alert e-mails in the whole testing collection of e-mails. The precision measure (P) is used also as a quality measure to evaluate the filters of interest. The FMeasure (F) is the harmonic mean of precision and recall, which aims to find the best combination of precision and recall.
Comparing TFV and CPD As mentioned earlier, the term frequency variance (TFV) and the categorical proportional difference (CPD) are category dependent. TFV was used by Appavu et al. [2] for alert e-mail detection, while CPD was proposed by Simeon and Robert [15].
4. EXPERIMENTS AND RESULTS DISCUSSION
Data Collection Collecting Arabic alert e-mails was a major issue in this research because there was no standard Arabic alert
Table 1: Sample of Term-Document Matrix.
0
الستهذاف (To Target) 1
رعاياها (National) 0
تستهذف (Target) 1
يهذد (Threatening) 1
تهذيذاث (Threats) 1
ٍبش (Wage) 0
فجر (Dawn) 0
اَتحاريت (Suicide) 0
بقتم (killing) 0
1
0
1
1
1
0
1
0
0
1
1
2
0
0
0
0
0
1
0
0
1
1
3
1
1
0
0
1
0
0
0
0
1
4
0
0
0
0
0
0
1
0
0
0
5
0
0
0
0
0
0
0
1
0
0
6
0
0
0
0
0
0
1
0
0
0
7
0
0
0
0
0
0
0
1
0
0
E-Mail #
4
Alert 1
The major problem of the TFV feature weighting schema, is its inability to discern the term that exists only in the documents of one category and not in others. Table 2 shows that the TFV method assigned the weight 0.5 for the two terms (T1 and T2) given that term (T2) does not exist in the documents of informative category, on the other hand the CPD assigns higher weights to the term (T2), due to its absence from the informative category. The second example shows that there are some variation in the two weighting schema’s used, the TFV shows that the term T3 is more important than term T4, although the CPD shows that term (T4) is more important than term T3.
the e-mail client application which, followed by evaluating the set of rules on a total of 300 e-mails; 160 were alerts emails, the rest 140 were informative e-mails. The e-mail client application implementation has been modified to include automatic calculation of performance measures, accuracy, precision, recall and F-measures. Table 3: The Best Textual E-Mail Features Using CPD and TFV.
Table 2: The Problems of TFV. Term frequency T1 Term frequency T2 Term frequency T3 Term frequency T4
Alert
Informative
20 10 10 2
10 0 16 6
TFV ( Normalized ) (50/100) =0.5 (50/100)= 0.5 (18/100)= 0.18 (8/100)=0.08
CPD 0.66 1 0.61 0.75
This variation can be explained by the fact that the TFV is more sensitive to the variation of term frequency in each category from the mean frequency for that term in all categories. To show this variation in both features weighting schema we extracted the most important 38 features using both CPD and TFV, the order of each feature is assigned using (CPD) which was used as our reference and the TFV is used for a comparison. Table 3 shows the best 38 features selected using CPD; with each feature the TFV value is also included. The TFV was normalized to be consistent with CPD values. The CPD values are ranged from 0 to 1, where higher value is better. For TFV values the higher value means better variation which helps in discriminating ability for each feature. When the features above were ordered according to their CPD and then re-ordered using CPD, Some variation between CPD and TFV ranking for some features was very clear, about 92% has different order in both feature weighting schemas. This variation in the feature importance shows some differences in performance of these two feature selection techniques, mainly for the rule based filter, this variation has been discussed in the next section.
*
Term
CPD
TFV*
بضرب اَتحاري تستهذف يفخخت تغتال باَفجار يعهىياث رعاياها بقتم اغتيال بسيارة
1 1 0.94 0.85 0.72 0.83 0.83 0.73 0.79 0.72 0.50
3.38 2.88 .2.2 22.. 0.77 1.28 0.72 220 22.. .222 3.12
باستهذاف قذ تفجيراث اعتذاء هجًاث اَتحاريت انسفارة يقتم خطف بجروح تُظيى انقاعذة ٍبش يصرع تفجير استهذف ٌانبيا يصادر شبكت هجىو عُاصر سهسهت اَفجاراث هز ضذ هذد جًاعت شبكت
0.92 0.50 0.92 0.86 0.76 0.93 0.76 0.95 0.56 0.82 0.77 0.56 0.78 0.82 0.86 222. 22.2 0.59 0.86 222. 0.58 0.65 0... 0.82 22.2 0.86 0.81 0.86
.2.2 02.2 .222 22.. .222 220 2 22.. 22.2 22.. 2.20. 16.60 22.2 22.2 0.77 0.5 1.80 0.32 0.32 2 0.40 0.98 2200 0.66 4.20 0.98 0.86 0.97
This is the normalized TFV (i.e., TFV/100)
The results of the first experiment performed on the ID3 filter, in which different number of features was used, are shown in Table 4. The first finding was that the number of used features has a significant effect on filtering performance; this result is clear in both feature selection techniques. When the feature size is increased from 10 to 15, the filtering accuracy has been enhanced by more than 6% in both TFV and CPD. The F-measure was enhanced by an approximately 2% for the TFV and it was enhanced by 5% for the CPD.
Rule-Based Filter Evaluation To evaluate the rule based e-mail filter, the ID3 implementation in KNIME information miner tool [18] was used. The rule generator settings; shown in Table 3, where no average split points nor binary nominal split, where chosen and the quality measure used is the Gain ratio. No pruning was used. To address the features order variation problem mentioned in the previous section different numbers of features have been used to measure the ID3 filter performance on the e-mail client application [19]. From Table 3, the best 10, 15, 20 features using CPD and TFV respectively were selected and used in the rule generation step. The set of generated rules were exported manually to
The second finding was that the CPD outperformed the TFV in terms of accuracy and detection rates. To compare the performance of TFV and CPD, the feature size of 15 were used as a reference value, due to its ability to achieve the best results in both TFV and CPD.
5
Table 4: ID3 Experimental Results Using Different Feature Size. FSM
F-Size
# E-Mails
#A
#I
A-A
I-I
A-I
I-A
[A%]
[P%]
[R%]
[F%]
TFV
10
300
160
140
103
104
57
36
0.69
0.74
0.64
0.69
15
300
160
140
107
104
53
36
0.70
0.75
0.67
0.71
20
300
160
140
107
107
53
33
0.71
0.76
0.67
0.71
10
300
160
140
114
110
46
30
0.75
0.79
0.71
0.75
15
300
160
140
118
122
42
18
0.8
0.87
0.74
0.80
20
300
160
140
112
116
48
24
0.76
0.82
0.7
0.76
CPD
Note: FSM = Feature Selection Method, F-Size = Feature Set Size, A = Alert e-mail, I = Informative e-mail, A% = Accuracy, P% = Precision, R% = Recall, F% = F-Measure.
The accuracy level achieved by CPD using 15 features was 80%; which has outperformed the accuracy achieved by the TFV, which is about 80%. The F-measure gave an indication that the CPD also has outperformed the TFV by more than 9%. Such results can be explained by the ability of CPD to give higher weights for those terms that exist only in one category documents, and not in others. While this mechanism was absolutely absent in the term frequency variance (TFV) feature selection method. It was clear from the former results that the CPD with the reference 15 feature size has achieved competitive results. The best 15 features extracted using CPD were used to build the decision tree used in the testing phase. The tree was built by the (KNIME) information miner tool in which the Gain ratio measure was used to select the best features upon which the splitting points were determined. With each rule, its accuracy value was used to indicate the rule strength. The set of generated rules extracted using the
decision tree learner was transformed into a set of If-then rules. Table 5 shows the set of generated rules and the accuracy of each rule within the training samples. Each rule antecedent indicates a set of terms; those are the features that decide whether the e-mail is alert or informative. If the rule antecedent is triggered, the rule is then applied to filter the e-mail of interest. The rule consequent which has the class label ("yes") indicates that the e-mail carries alert terms; such e-mails are then transferred directly into the alert e-mails folder. On the other hand, if the e-mail body text does not contain such terms and most of its text is in the past tense, it will be directly filtered as informative e-mail and hence, transferred into the informative e-mails folder.
Table 5: Sample of the Decision tree rules using feature size of 20. R# 1 2 3
4
5
6
7
Rule Antecedent IF " =" بض بYes &"="انتح ةيNo English Translation: IF " will strike" = Yes & "suicide" = No IF "="بض بNo &" ="تﺴتدYes &"="هجم تNo English Translation: IF " will strike" = No & "target" = Yes & "attacks" = No IF"="بض بNo &" ="تﺴتدNo &" ="ةع ٌ هNo &" ="بقتNo &" ="تغتNo &"="معل م تYes &"="انفج ةNo English Translation: IF "will strike "= No & "target" = No & "nationals" = No & "kill" = No & "assassinate" = No & "Information" = Yes & "explosion" = No IF"="بض بNo &" ="تﺴتدNo &" ="ةع ٌ هNo &" ="بقتNo &" ="تغتNo &"="معل م تYes &"="انفج ةYes English Translation: IF " will strike" = No & "target" = No & "nationals" = No & " will kill" = No & "assassinate" = No & "Information" = Yes & "explosion" = Yes IF"="بض بNo &" ="تﺴتدNo &" ="ةع ٌ هNo &" ="بقتNo &" ="تغتNo &"="معل م تNo English Translation: IF " will strike" = No & "target" = No & "nationals" = No & " will kill" = No & "assassinate" = No & "Information" = No IF"="بض بNo &" ="تﺴتدNo &" ="ةع ٌ هNo &" ="بقتYes &" ="فجNo &"="ض بNo English Translation: IF "" will strike "= No & "target" = No & "nationals" = No & " will kill" = Yes & "Dawn" = No & "hit" = No IF"="بض بNo &" ="تﺴتدYes &" ="ةع ٌ هYes English Translation: IF " will strike” = No & "target" = Yes & "nationals" = Yes
Rule Consequent Alert =Yes
Rule Accuracy 100%
Alert =No
95.5%
Alert=Yes
100%
Alert =No
100%
Alert =No
74%
Alert=Yes
79%
Alert=Yes
100%
dependent. At the end, a set of decision rules have been generated by involving (TFV) and (CPD).
5. CONCLUSION AND FUTURE WORK In this paper, the rule based filter for alert e-mail detection for Arabic language has been evaluated by involving two feature selection methods. Two Arabic e-mail corpora were collected and preprocessed. The two corpora consisted of 1500 e-mails; about which 55% of them were alert e-mails and the rest were informative e-mails. For the feature selection methods, the first one was the categorical proportional difference (CPD) while the second was the term frequency variance (TFV) and both were category
The CPD has achieved better performance in terms of accuracy than TFV, especially when the best 15 features were used. The F-measure achieved by the rules generated using CPD was about 80%; compared with only 71% when using TFV. The best achieved performance by CPD over TFV can be explained by the inability of TFV in handling the problem of giving the same weight to the terms that has the same variance value without taking into 6
[8] Sheu J., (2009), “An Efficient Two-phase Spam Filtering Method Based on E-mails Categorization”, International Journal of Network Security, Vol. 9, No 1, pp 34-43. [9] Chen C., Tian Y. and Zhang C., (2008), ”Spam Filtering with Several Novel Bayesian Classifiers”, IEEE 19th International Conference on pattern recognition, ICPR, pp. 1-4. [10] Nazir k., Mirza Y. and Asim Y., (2006), “A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering”, ECML/PKDD Discovery Challenge Workshop, 2006), pp 125-132. [11] Hershkop S., and Stolfo SJ., (2005), “Combining Email Models for False Positive Reduction”, The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, pp. 98-107. [12] Chiemeke SC. and Longe OB., (2008), "Probability Modeling for Improving Spam Filtering Parameters", Journal of Information Technology Impact, Vol. 8, No. 1, pp. 1-10. [13] Liu P., Zhang W. and Zhu F., (2009), "Research on E-mail Filtering Based On Improved Bayesian", Journal of Computers, Vol. 4, No.3, pp. 271-276. [14] Blanzieri E., Bryl A., (2007). "Highest Probability SVM Nearest Neighbor Classifier for Spam Filtering", fourth Conference on Email and AntiSpam, Mountain View, California USA. [15] Simeon M. and Hilderman RJ., (2008), "Categorical Proportional Difference: A Feature Selection Method for Text Categorization", the Seventh Australasian Data Mining Conference AusDM), pp. 201-208. [16] Abu El-Khair, I., (2006). Effects of Stop Words Elimination for Arabic Information Retrieval: A Comparative Study. International Journal of Computing & Information Sciences, Volume 4, Number 5, p.p. 119 – 133. [17] Salton, G., McGill, M. J. (1983): Introduction to Modern Information Retrieval. McGraw Hill, ISBN 9780070544840. [18] KNIME (2009), Information Miner Tool, University of Konstanz, Germany. [19] Ahmed R., (2009), “E-mail Client Application: An Implementation of SMTP and POP3 Protocols Using C#”, http://www.codeproject.com/KB/cs/email_client_applicatio n.aspx
account the absence of such terms from the document of one category. Two experiments were performed on the ID3 filter using the e-mail client application and CPD as a feature weighting method. As this research is the first attempt to build a filter for Arabic alert e-mails, several other issues can be addressed in the future such as studying the effects of stemming on the performance of the filter. The second issue is to use the boosted decision tree implementation in the e-mail client application in order to re-generate a set of decision rules in a periodic fashion.
REFERENCES [1] Appavu S. and Rajaram R. (2007), "Association rule mining for Suspicious Email Detection: A Data mining approach", Proceedings of the IEEE International Conference on Intelligence and Security Informatics, New Jersey, US. pp 316 – 323. [2] Appavu S., Rajaram R., Muthupandian M, and Athippan G (2008), “Automatic mining of Threatening e-mail using Ad Infinitum algorithm”, International Journal of Information Technology, Vol 14, No 2, pp 81-108). [3] Imothy B. and Jessica S., (2007), "When the Inbox Breaks: An Exploratory Analysis of Online Network Breaking News E-mail Alerts", Journal of Electronic News, Vol1, No 4 , pp 197 – 210. [4] Quinlan, R., (1986), Induction of Decision Trees. Mach. Learn. 1, 1, 81-106. [5] Aris K. and Georgios P., (2008), “Adaptive Spam Filtering Using Only Naive Bayes Text Classifiers”, CEAS Fifth Conference on Email and AntiSpam, Mountain View, California USA, pp 355-362. [6] Youn S. and McLeod D., (2007). A Comparative
Study for Email Classification, In Elleithy K., editor. Advances and Innovations in Systems, Computer Sciences and Software Engineering. Berlin:Springer Verlag; pp. 387-391. [7] Gajewski P. (2006), "Adaptive Naïve Bayesian Anti-Spam Engine", International Journal of Information Technology, Vol 3, No 3, pp 153-159.
7