Term Discrimination Based Robust Text Classification ...

1 downloads 0 Views 904KB Size Report
Advisor: Dr. Asim Karim. Department of Computer Science. Syed Babar Ali School of Science and Engineering. Lahore University of Management Sciences ...
Term Discrimination Based Robust Text Classification with Application to E-mail Spam Filtering

PhD Thesis

Khurram Nazir Junejo 2004-03-0018

Advisor: Dr. Asim Karim

Department of Computer Science Syed Babar Ali School of Science and Engineering Lahore University of Management Sciences

Dedicated to my beloved family

Lahore University of Management Sciences

School of Science and Engineering

CERTIFICATE

I hereby recommend that the thesis prepared under my supervision by Khurram Nazir Junejo titled Term Discrimination Based Robust Text Classification with Application to Email Spam Filtering be accepted in partial fulfillment of the requirements for the degree of doctor of philosophy in computer science.

Dr. Asim Karim (Advisor) Recommendation of Examiners’ Committee:

Name

Signature

Dr. Mian Muhammad Awais

———————————

Dr. Shahid Masud

———————————

Dr. Hamid Abdul Basit

———————————

Dr. Haroon Atique Babri

———————————

Acknowledgements I am highly grateful to Allah Almighty Who enabled me to achieve this milestone of my PhD work. I would like to acknowledge the kind help, encouragement and valuable guidance provided by my advisor Dr. Asim Karim. I also wish to thank my whole family for their continuous patience, encouragement and support to accomplish this task. Support of fellow PhD students is also greatly acknowledged. I am also thankful to the committee members and the reviewers for their precious time and suggestions.

This research was funded by my parents, Higher Education Commission, Islamabad, Pakistan and Lahore University of Management Sciences, Lahore, Pakistan. Their support is gratefully acknowledged.

Abstract The Internet has touched every part of our lives, including our interactions and communications. Printed books are being replaced by electronic books (e-books), personal and official correspondences have shifted to electronic mail (e-mail), and news is now being read online. This is generating huge volumes of unstructured textual data that needs to be analyzed, filtered, and organized automatically in order to harness its wealth of information for profitable gains. By 2013, it is projected that the worldwide volume of e-mails will reach 507 billion e-mails per day out of which 89% will be spam e-mails [Radicati (2009)]. In 2008, the cost of spam to businesses in terms of hardware, software, and human resource cost was around $140 billion [Research (2008)]. Content-based text classification can automatically organize text documents into predefined thematic categories. However, text classification is challenging in the modern Internet environment. Firstly, text documents are sparsely represented in a very high dimensional feature space (easily in hundred thousands), making learning and generalization difficult. Secondly, due to the high cost of labeling documents researchers are forced to collect training data from sources different from the target domain, which results in a distribution shift between training and test data. Thirdly, although unlabeled data is easily available its utilization in practical text classification for improved performance remains a challenge. One important domain for text classification, which embodies these challenges, is that of e-mail spam filtering. A typical e-mail service provider (ESP) caters to thousands to millions of users where each user can have his own interests of topics and preferences for spam and non-spam e-mails. Personalized service-side spam filtering provides a solution to this problem; however, for such solutions to be practically usable they must be efficient, scalable, and robust to distribution shifts. In this thesis, we propose a robust text classification technique that combines local generative models and global discriminative classifiers through the use of discriminative term weighting and linear opinion pooling. Terms in the documents are assigned weights that quantify the discrimination information they provide for one category over the others. These weights, called discriminative

term weights (DTW), also serve to partition the terms into two sets. An opinion pooling strategy consolidates the discrimination information of terms in the sets to yield a two dimensional feature space, in which a discriminant function is learned to categorize the documents. In addition to a supervised technique, we also develop two semi-supervised variants for personalizing the local and global models using unlabeled data. We then generalize our technique into a classifier framework that integrates different feature selection criteria, discriminative term weighting schemes, information pooling strategies, and discriminative classifiers. We provide a theoretical comparison of our proposed framework with existing generative, discriminative, and hybrid classifiers. Our text classification framework is evaluated with five discriminative term weighting strategies, six opinion consolidation techniques, and four discriminative classifiers. We employ nine real-world datasets from different domains in our experimental evaluation, and the results are compared with four benchmark text classification algorithms via accuracy and AUC values. Our framework is also evaluated under varying distribution shift, on gray e-mails, on unseen e-mails, and under varying classifier size. Scalability of our spam filter is also demonstrated for personalized service-side spam filtering. Statistical significance tests confirm that our technique performs significantly better than the compared techniques in both supervised and semi-supervised settings, and in global and personalized spam filtering. In particular, it performs remarkably well when distribution shift is high between training and test data, a phenomenon common in e-mail systems. Additional contributions of this thesis include a systematic analysis of the spam filtering problem and the challenges to effective global and personalized spam filtering at the service side. We formally define key characteristics of e-mail classification such as distribution shift and gray e-mails, and relate them to machine learning problem settings. The concept of term discrimination introduced in this work has also found applications in text clustering, visualization, and feature extraction, and it can be extended for keyword extraction and topic identification from textual documents.

6

Contents 1 Introduction 1.1

1.2

1.3

1.4

20

Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.1.1

Spam Filtering and Message Management

1.1.2

Opinion Mining and Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . 23

1.1.3

Targeted Advertisement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.1.4

Product Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.1.5

Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.1.6

Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.1.7

Authorship Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Motivation of This Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.2.1

Different Training and Test Data Distribution . . . . . . . . . . . . . . . . . . 25

1.2.2

Evolving Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.2.3

Availability of Unlabeled Data . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.2.4

Personalized Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 1.3.1

Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.3.2

Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2 Background and Related Work 2.1

. . . . . . . . . . . . . . . . . . . 21

33

Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7

2.2

2.3

2.1.1

Document Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.1.2

Term Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.1.3

Discriminative Term Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.1.4

Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.2.1

Network Level Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2.2

Postage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.2.3

Disposable e-mail addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.2.4

Collaborative filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.2.5

Honeypotting and e-mail traps . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.2.6

Content based classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.2.7

Personalized Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.2.8

Concept Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.2.9

Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Data Mining and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.3.1

Global Models from Local Patterns (LeGo) . . . . . . . . . . . . . . . . . . . 49

2.3.2

Generative, Discriminative and Hybrid Methods . . . . . . . . . . . . . . . . 50

2.3.3

Semi-Supervised Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3 Discovery Challenge Problem

56

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2

The Spam Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3

3.2.1

Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.2.2

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Participation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.3.1

3.4

Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.4.1

3.5

Techniques Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 8

4 Personalized Service Side Spam Filtering

68

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2

The Nature of the Spam Filtering Problem . . . . . . . . . . . . . . . . . . . . . . . 70 4.2.1

4.3

4.4

4.5

4.6

4.7

Distribution Shift and Gray E-mails . . . . . . . . . . . . . . . . . . . . . . . 72

Global Versus Personalized Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . 75 4.3.1

Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.2

Semi-Supervised Global Versus Personalized Spam Filtering . . . . . . . . . . 77

DTWC/PSSF: A Robust and Personalizable Spam Filter

. . . . . . . . . . . . . . . 79

4.4.1

Local Patterns of Spam and Non-Spam E-mails . . . . . . . . . . . . . . . . . 80

4.4.2

Global Discriminative Model of Spam and Non-Spam E-mails . . . . . . . . . 84

4.4.3

Personalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.4.4

Interpretations and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 89

Evaluation Setup: Datasets and Algorithms . . . . . . . . . . . . . . . . . . . . . . . 93 4.5.1

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.5.2

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6.1

Global Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.6.2

Personalized Spam Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.6.3

Varying Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.6.4

Gray E-mails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.6.5

Effect of Multiple Passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.6.6

Generalization to Unseen Data . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5 Discriminative Term Weighting Based Text Classification

109

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.2

The Nature of the Text Classification Problem . . . . . . . . . . . . . . . . . . . . . 110

5.3

From Two-Class to Multi-Class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3.1

Discriminative Term Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . 116 9

5.4

5.3.2

Term Space Partitioning and Term Selection . . . . . . . . . . . . . . . . . . 120

5.3.3

Linear Opinion Pool and Linear Discrimination in Feature Space . . . . . . . 121

Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.4.1

5.5

5.6

Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.5.1

Classifier Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.5.2

Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 Classifier Properties and Generalizations

132

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

6.2

Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.3

6.4

6.2.1

Feasibility for Service Side Personalized Spam Filtering . . . . . . . . . . . . 134

6.2.2

Term Selection for Supervised Model . . . . . . . . . . . . . . . . . . . . . . . 136

Statistical Significance Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.3.1

Multiple Comparison on Single Datasets vs. Comparison on Multiple Datasets138

6.3.2

Comparisons of Two Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.3.3

Multiple Classifier Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 142

6.3.4

Post-Hoc Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.3.5

Significance Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

DTWC as a Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.4.1

Feature Weighting Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.4.2

Feature Selection and Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 148

6.4.3

Opinion Consolidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.4.4

Discriminative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

6.5

Relation to LeGo Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6.6

Conclusion

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

10

7 Conclusion and Future Work

156

7.1

Conclusion

7.2

Extensions of our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

7.2.1

Discriminative Document Clustering . . . . . . . . . . . . . . . . . . . . . . . 159

7.2.2

Dimensionality Reduction/Feature Extraction . . . . . . . . . . . . . . . . . . 159

7.2.3

Feature Weighting Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.2.4

Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

11

List of Figures 4.1

Shift in p(x|y) between training and test data (individual user’s e-mails) (ECML-A data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.2

Difference in p(x|y) for e-mails from two different time periods (ECUE-1 data) . . . 74

4.3

Shift in p(x|y) between training and test data (Movie Review) . . . . . . . . . . . . . 75

4.4

The two-dimensional feature space and the linear discriminant function for e-mail classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.5

Depiction of Algorithm 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.6

Depiction of Algorithm 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.7

Generalization performance on ECML-A dataset . . . . . . . . . . . . . . . . . . . . 106

5.1

Linear classifiers for three class problem of two dimensions when 1 vs All (a) and 1 vs 1(b) settings are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.2

The two-dimensional feature space and the linear discriminant function for a spam classification problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.3

Performance in terms of area under the curve (AUC) . . . . . . . . . . . . . . . . . . 128

5.4

Comparison between different weighting schemes. . . . . . . . . . . . . . . . . . . . . 129

6.1

Number of significant terms versus term selection parameter t for PSSF1/PSSF2 on ECML-A dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.2

Number of terms selected versus threshold t for Spam dataset (DTWC-RR) . . . . . 138

6.3

Average accuracy versus threshold t for Spam dataset (DTWC-RR) . . . . . . . . . 139

12

6.4

Histogram of the differences between the percentage accuracy of the DTWC/PSSF and ME/ME-SSL classifiers on the 44 training testing pair sets. The mean of the distribution is 9.16, with standard deviation of 7.28. . . . . . . . . . . . . . . . . . . 141

6.5

Performance of discriminative classifiers on the transformed two dimensional feature space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

13

List of Tables 3.1

Top teams of ECML Task A. The values are in percentages. . . . . . . . . . . . . . . 62

3.2

Top teams of ECML Task B. The values are in percentages. . . . . . . . . . . . . . . 64

3.3

Performance results of our algorithm with parameters t = 8 and s = 13 tuned on the tuning datasets. The values are in percentages. . . . . . . . . . . . . . . . . . . . 64

3.4

Performance of our algorithm with various parameter combinations. Average AUC of the submitted filter (with t = 400 and s = 8) is 95.07%. Values are in percentages. 65

3.5

Performance comparison of algorithm using word frequencies and occurrences. . . . . 65

4.1

Global and personalized spam filtering options . . . . . . . . . . . . . . . . . . . . . 76

4.2

Evaluation datasets and their characteristics . . . . . . . . . . . . . . . . . . . . . . . 94

4.3

Global spam filtering results for ECML-A dataset . . . . . . . . . . . . . . . . . . . . 97

4.4

Global spam filtering results for ECML-B dataset . . . . . . . . . . . . . . . . . . . . 97

4.5

Global spam filtering results for ECUE-1 and ECUE-2 datasets . . . . . . . . . . . . 97

4.6

Global spam filtering results for PU1 and PU2 datasets . . . . . . . . . . . . . . . . 97

4.7

Personalized spam filtering results for ECML-A dataset . . . . . . . . . . . . . . . . 100

4.8

Personalized spam filtering results for ECML-B dataset . . . . . . . . . . . . . . . . 100

4.9

Personalized spam filtering results for ECUE-1 and ECUE-2 datasets . . . . . . . . . 100

4.10 Personalized spam filtering results for PU1 and PU2 datasets . . . . . . . . . . . . . 100 4.11 Comparison with other published personalized spam filtering results (all numbers are average percent AUC values) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.12 Performance under varying distribution shift. Average percent accuracy and AUC values are given for ECML-A dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 102 14

4.13 Performance on gray e-mails identified from user 1 and user 2 e-mails in ECML-A dataset. For DTWC and PSSF2, the table gives the number of gray e-mails that are misclassified. s = similarity threshold; GE = gray e-mails. . . . . . . . . . . . . . . . 104 4.14 Effects of multiple passes of PSSF1 on ECML-B. The values are AUC values in percentages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.1

Classification errors (in %). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.2

Accuracy results for Movie dataset. Means plus/minus standard deviations are computed from 5 runs with randomly drawn training sets of sizes specified in the first column and randomly selected test sets of size 800. . . . . . . . . . . . . . . . . . . . 126

5.3

Accuracy results for SRAA dataset. Means plus/minus standard deviations are computed from 5 runs with randomly drawn training sets of sizes specified in the first column and randomly selected test sets of size 4000. . . . . . . . . . . . . . . . . 127

5.4

Accuracy results for the ECUE, PU and 20 Newsgroup dataset. . . . . . . . . . . . . 127

5.5

Accuracy results for Spam dataset. The training set and each user’s inbox contain 4000 and 2500 e-mails, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.1

Scalability of PSSF: impact of filter size on performance. The results are averages for the three test sets in ECML-A dataset. . . . . . . . . . . . . . . . . . . . . . . . . 136

6.2

Selected terms and accuracy at different values of threshold t for Spam dataset (DTWC-RR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

6.3

Post-Hoc Analysis of Friedman’s test for the accuracy measure. Homogeneous subsets are based on asymptotic differences. The significance level is 0.05. . . . . . . . . 146

6.4

Post-Hoc Analysis of Friedman’s test for the AUC measure. Homogeneous subsets are based on asymptotic differences. The significance level is 0.05. . . . . . . . . . . 146

6.5

Classification results for local model. Winning performance for each dataset is highlighted in bold letters. Values are percentage accuracies. . . . . . . . . . . . . . . . . 148

6.6

Combining Experts. All values are percentage accuracies. 1, 2 and 3 in the first column refer to the three users of ECML-A dataset . . . . . . . . . . . . . . . . . . . 150

15

Nomenclature α+

Discriminative Model Parameter

α0

Discriminative Model Parameter

¯ Φ

Learned Target Function

Φ

True Target Function

f ()

Discriminant Function

G

Holdout Set

g()

Discriminative Term Weight Function

L

Training Data

Ly

Set of E-mails in L Belonging to Class y

M

Number of Users

N

Number of Examples/Records

T

Dictionary Size

t

Modified Threshold Parameter

t

Threshold Parameter

U

Test Data

16

U

U Labeled with DTWC

Ui

Test Set of ith User

wj

Weight of j th Term

X

Set of All Possible Examples/Records

xi

ith Document

Y

Set of Labels

yi

Label of ith Document

Z+

Score of Spam Class

Z−

Score of Non-spam Class

20 NG 20 Newsgroups Acc

Accuracy

ANOVA Analysis Of Variance AUC Area Under the Curve Avg

Average

BW

Balanced Winnow

CDIM Clustering via Discrimination Information Maximization CPC Category pivoted categorization DPC Document pivoted categorization DTW Discriminative Term Weights DTWC Discriminative Term Weight Classifier ECML European Conference on Machine Learning 17

EM

Expectation Maximization

ESP

E-mail Service Provider

FAvg Frequency Average FLD

Fisher Linear Discriminant

FLOP Frequency Linear Opinion Pooling FP

False Positive

FPR

False Positive Rate

FSUM Frequency SUM FWC Feature Weighting Classifier GE

Gray E-mail

IG

Information Gain

k-NN k Nearest Neighbor KL

Kullback Leibler

KLD Kullback Leibler Divergence KM

K Means

LDA Linear Discriminant Analysis LeGo Global Model from Local Patterns LOP

Linear Opinion Pooling

LOR Log Odds Ratio LR

Logistic Regression

LRR

Log Relative Risk 18

ME

Maximum Entropy

ML

Machine Learning

NB

Naive Bayes

OR

Odds Ratio

PSSF Personalized Service Side Spam Filter ROC Receiver Operating Characteristic RR

Relative Risk

SMS

Short Messaging Service

SMTP Simple Mail Transfer Protocol SRAA Simulated/Real/Auto/Annealing SSL

Semi-Supervised Learning

SVC

Support Vector Classifier

SVM Support Vector Machines TC

Text Classification

TF

Term Frequency

TF-IDF Term Frequency Inverse Document Frequency TP

True Positive

TPR True Positive Rate TVD Total Variation Distance TW

Term Weights

UCE Unsolicited Commercial E-mail 19

Chapter 1

Introduction 1.1

Text Classification

Suppose you are maintaining an online digital library that adds to its collection many papers, articles or books every day. A user of this library might find new documents relevant to his area by browsing through them. This could take a lot of time and effort because the user will have to look at every document to sort out the ones that are of interest to him. However, it would be quite helpful to have a system that automatically examines each document and places it into a folder according to its topic, or an automatic system that notifies the user whenever a document of his interest is added to the library. The automatic process of organizing natural language text (based only on their textual content) into predefined thematic categories is known as content-based text classification or more commonly referred as just text classification (TC) [Borko and Bernick (1963)]. Text classification has witnessed a booming interest in the past decade and its practical importance is increasing by each passing day because of the availability of massive volume of electronic text through the World Wide Web, electronic mail (e-mail), web blogs, Internet news feeds, digital libraries, social networking websites, online advertisements, corporate databases, product reviews and much more [Gupta and Lehal (2009)]. Many applications based on these different data sources can be posed as TC problems. Because of these enormous applications, this medium is being ex-

20

ploited through unsolicited e-mail, messages, blogs, websites, etc., commonly referred to as spam. E-mail now being the most used medium of online communication suffers from this problem the most. Identification and quarantining of this spam is referred to as e-mail spam filtering. Earlier approaches to text classification and spam filtering were of knowledge engineering [Hayes et al. (1990), Goodman (1991)], i.e. defining logical rules such that a document is assigned to a category if it satisfies the rules for that category. This requires the rules to be defined manually by the knowledge engineer and the domain expert. If the domain is changed or the categories in the domain evolve then new rules are to be redefined. These problems gave way to the popularity of Machine Learning (ML) approaches which are now the dominant approaches in this field [Sebastiani (2002)]. In the ML framework, a classifier or a learner is built from the set of documents previously categorized or labeled. The model learned by the classifier is then used to assign the new unseen document to its most relevant category. This technique therefore builds a classifier automatically without the intervention of a domain expert, capturing the characteristics of the data hidden even from the domain expert. If the categories in the domain evolve or even the whole domain changes, the classifier can be automatically updated from the labeled data. Therefore the set of preclassified documents, also known as the training data, is very important in learning the model in ML. Text classification has found many applications in various fields and for various types of data. From this large pool of applications the following have gained a special interest in the community:

1.1.1

Spam Filtering and Message Management

Human communication over the WWW is mostly done in the form of text messages such as, e-mails, instant messages, tweets, short messaging service (SMS’s) and news-feeds. This medium is plagued by the phenomenon of unsolicited messages, known as spam. Of these, e-mail spam filtering is one of the most challenging instance, mainly because of its adversarial nature. It is the processes of assigning incoming e-mail to one of the two predefined categories, junk mail (spam) and legitimate mail (non-spam). According to Radicati (2009), by 2013 the world wide volume of e-mails would reach 507 billion e-mails per day. They estimate the spam rate to be around 84%, Symantec (2010) puts the global spam rate at more than 89% while Microsoft (2009) states that 97% of e-mails are

21

unwanted and 94% of e-mails intercepted by Microsoft’s Forefront Online Protection for Exchange (FOPE) in January 2011 was spam [Microsoft (2012)]. Spam attacks both user and the computer through viruses, keyloggers, phishing attacks and malware. Ipsos (2010) reports that half of e-mails users in North America and Western Europe have opened spam e-mails, more than 62% of these users have also either clicked on a link, or opened an attachment or replied to the sender or even forwarded it to their family and friends. Spam results in losses in billions of dollars, Research (2008) estimated it to be $140 billion in year 2008, since then the volume of spam has more than doubled [Radicati (2009)]. Some spammers (those who send spam) are known to be earning millions of dollars a year [Sanz et al. (2008)]. A number of labels or a set of labels can be suggested for a new e-mail (instant message, SMS) from a set of predefined labels, such as family, friends, spam, etc. or be classified as activities [Dredze et al. (2006)]. Similarly newsfeeds can be filtered according to their genre [Gabrilovich et al. (2004)], tweets can be divided into important and not important tweets to save time [Genc et al. (2011)]. In addition to content based filtering, e-mails can be filtered at the network level as well. In this approach, meta-information of the messages, such as, routing path, senders IP address, spammers behavior, details of e-mail header, etc. are used to maintain white and black lists. White lists is list of senders that are considered safe because they have not generated spam before. On the other hand, black lists is a list of senders who are known for generating spam and therefore all e-mails generated by them is considered spam [Boykin and Roychowdhury (2005)]. These lists could be maintaining e-mail addresses, ip addresses or domain names and can be deployed at the client side and the server side as well. These lists are updated manually so it is cumbersome to maintain them. This approach mainly suffers because of the ephemeral nature of IP addresses. This is due to the ease of dynamic renumbering of IP addresses, hijacking of IP addresses and IP address space, and compromised machines (botnets). More than half of sender IP addresses appear less than twice and as much as 35% spam was sent from IP addresses that were not listed by either SpamCop or Spamhaus, the two most reputable blacklists [Ramachandran et al. (2007)]. These approaches also exhibit a high false positive rate which is more costlier for spam filtering [Esquivel et al. (2009)], because of which the classification threshold is lowered. As a result a large number of spam e-mails

22

are missed by these filters and reach the user’s e-mail account. At this point, as a last line of defense, content based e-mail filters effectively redirect the spam messages to the junk (or spam) mail folder while exhibiting a very low false positive rate.

1.1.2

Opinion Mining and Sentiment Analysis

What are other people thinking, what are they saying about a particular thing or a person, or how do they feel about them? With the availability of opinion-rich online resources such as personal blogs, review sites and social networks, some of these questions can be answered through opinion mining and sentiment analysis [Pang and Lee (2008)]. Many users find reviews to be helpful in deciding to purchase a product online. An automatic ranking of these reviews based on their favorability helps both the company and the user to better assess the products on sale. Similarly, reading reviews can be skipped by a user if they are analyzed by a sentiment analyzer to be assigned a score of favorability on some scale. Some readers are interested in knowing the attitude of the writer with respect to some topic before purchasing a book. Systems for these types of analysis employ TC as these blogs (book or reviews) mostly contain text in addition to few emoticons or pictures.

1.1.3

Targeted Advertisement

Chat sessions, e-mails, blogs and tweets play a vital role in web based targeted advertisement which generates most of the revenue for search engines [Ford et al. (2003)]. Subscribers of a telecom service can be categorized into different interest groups based on their SMS’s text, to be targeted with promotions and other special offers. Similarly users are targeted with advertisements related to the text in the e-mail or web pages that they are viewing [Broder et al. (2007)].

1.1.4

Product Categorization

Large online shopping portals such as Amazon and eBay automatically arrange their products into categories for easy browsing. Different types of features including, the title and description of the product are used for this purpose [Cortez et al. (2011)]. Yahoo maintains a large topic hierarchy

23

of web pages to organize its overwhelming amount of data, that not only enables easy document browsing but also increases the search accuracy [Zhu (2009)].

1.1.5

Summarization

With the phenomenon of the information overload, interest in an automatic shortened version of a document or a set of documents has increased [Barzilay et al. (1999)]. The summary can vary from being a recommendation of the keywords, topic of a document or a full summary. Documents are also clustered for a better understanding and summarization of large document collections. Tweets, SMS’s and e-mails can be summarized to identify events or activities [Chakrabarti and Punera (20011)].

1.1.6

Information Retrieval

The demand for fast and speedy searching of documents, or some information within a document, is growing daily with the growth of WWW. Google Scholar1 has categorized the research articles into different subject areas to facilitate the document searching process. Similarly a search engine could cluster similar documents in categories for easier browsing, quicker retrieval and higher accuracy [Manning et al. (2008)]. Instead of a text query, search engine could find web pages or documents similar to a document you have. To deliver search personalization, search engines keep track of the queries and user browsing activities [Noll and Meinel (2007)]. They may also try to find the sense of polysemous or homonymous words in the queries to improve the search experience.

1.1.7

Authorship Detection

The process of identifying the most likely author of an anonymous or a disputed document is knows as authorship detection. Classically it was employed for books and reports, more recently it is being used for online text documents such as blogs, e-books and e-mails to such an extent that it has served as a proof in several digital crimes and fraud detections [Tan and Tsai (2010)]. 1

http://scholar.google.com/

24

1.2

Motivation of This Research

With human communication shifting more and more towards the cyber world, the volume of textual content in the form of e-mails, chat sessions, tweets, instant messages, short messaging services (SMS’s), blogs, e-books, research articles, web pages, etc. has become quite enormous and is rapidly growing. These documents contain a huge wealth of information that is being harnessed for profit gains by companies. This task faces quite a few challenges including high dimensionality and sparsity of documents in the feature space, need for user-oriented or personalized solutions, use of informal speech, semi structure and unstructured nature of text, large volumes of data and its evolving nature, and high cost of labeling these documents. Some of the challenges that we address in this work are as follows:

1.2.1

Different Training and Test Data Distribution

The training data is the building foundation of the classifier. Two of its much desired characteristics are; a) its should be sufficient enough and b) it should be a good representative of the distribution of the test data. In other words, high quantity and good quality of the training data is necessary. If the former requirement is not satisfied then the classifier gives a high error on both the training and unseen data, this situation can be remedied by accumulating more data. If the latter requirement is not met then the classifier gives a low training error but a much higher test error, the reason being that the distribution of both the data is different. This situation is difficult to remedy as the distribution of the test data can not be tempered with to conform it with the distribution of the training model. In most cases it is not possible to gather the labeled documents representing the required distribution before the system is deployed, or it is quite expensive to get the data labeled a priori. In such scenarios, the training data is constructed by gathering labeled documents from previously available public data for either the same domain or a related domain. This gives us the privilege of re-using a training data to learn a classifier for a related problem. As an example, if we want to build a classifier to judge movie reviews but we do not have labeled data for it, but we have the training data of a classifier built for classification of reviews of sitcoms or documentary feature films. We can use this data as the training data for learning our classifier for movie reviews. 25

This inadvertently results in the difference between the distribution of the training and the test data [Quionero-Candela et al. (2009)]. Other factors that exacerbate this difference are label noise, biased sample etc. Therefore a classifier that is robust to these differences in distribution can reduce the need of labeled data, which is a much desirable characteristic.

1.2.2

Evolving Distributions

At times, even though the model is learned on a good representative data and the classifier performance is satisfactory, after a while its performance starts to degrade. This is because the distribution of the test data changes with time and as a result no training data could be good enough [Moreno-Torres et al. (2012)]. For an example, with time, the interests of a digital library user may change from document clustering towards document classification or he might develop an interest in natural language processing. The learned classifier should be able to easily adapt to these evolving distributions (interests).

1.2.3

Availability of Unlabeled Data

Usually, the more of the training data you have the better your learner is suited for unseen or new data but this gives rise to the problem of labeling the documents manually. This process is time consuming and requires a domain expert. To make things worse, the requirement of training data for many algorithm is very large and sometimes even prohibitive. Therefore the need for large quantities of data for training and the difficulty of obtaining such a data has led to use of unlabeled data in the classifier learning phase. Even though not obvious, but the use of unlabeled data in training has shown significant gains in classifier performance for the TC problems and other problems as well and has thus emerged as a separate class of learning paradigm that is usually referred to as Semi-Supervised Learning (SSL) in the Machine Learning literature [Chapelle et al. (2010)]. SSL contrasts from the previously mentioned supervised learning approach in which only the labeled documents are available for training. The collection of these unlabeled documents is easily available and inexpensive, especially for the domains involving online resources for which the number of unlabeled documents could easily reach hundred thousands, e.g. there are millions

26

of personal blogs that are unlabeled but publicly accessible and thus can be used in learning classification models for blogs.

1.2.4

Personalized Spam Filtering

Spam filtering is a special instance of text classification with much more challenges and diversity. In electronic mail (e-mail) spam filtering, the incoming e-mail is assigned to one of the two predefined categories: junk mail (spam) and legitimate mail (non-spam). It differs from general TC in that the people who send out the spam mail (called spammers) constantly try to evade the spam classifier which makes it necessary for the spam classifier to adjust itself to the evolving distribution of the spam mails. The distribution of the legitimate e-mails users also changes with time and from user to user. A spam filter needs to cater millions of users, each having his own interests and understanding of what is a legitimate e-mail or not. Oftentimes an e-mail that is spam for one user might be legitimate for some other user (called gray e-mail). Thus it is quite difficult to correctly model the understanding of each user in a single filter, which opens the door to user personalization; i.e. to tune the filter for each user according to his interest and understanding. This raises issues of scalability as it might be quite expensive to maintain a separate filter for each of the millions of users. Furthermore, training data for spam filtering is very hard to get because the relevance and legitimacy of an e-mail is subjected to only the recipient of that particular e-mail who does not like to manually label e-mails in contrast to other domains such as in a corporation or in a news agency where a domain expert identifies the relevance of a document to a category. Apart from that, the cost of false positives in spam filtering domain is much higher than in other domains mainly because it is highly undesirable to have a legitimate e-mail be put in the junk mail folder which is seldom checked by a user. The size of an e-mail is also small compared to the size of research articles, corporate documents, newsfeeds etc. The topic and language of discussion in e-mails can vary greatly than that of newsfeeds, research papers or corporate documents. The frequency of arrival of e-mails is much higher than that of other domains so much so that a typical e-mail server has to process more than millions of e-mails a day. The aforementioned differences not only constraint the e-mail classifier to be highly scalable and efficient but also requires that the e-mail filter adapt

27

the changing styles of different e-mail users. At the same time, it should identify the mails sent by the spammers who are constantly trying to evade the classifier. As a result spam filtering becomes a much more difficult problem than the conventional document classification [Fawcett (2003)].

1.3

Our Contribution

In this thesis we propose a text classification algorithm that is robust to the distribution shift between the training and test data, is highly scalable and demonstrates the ability to harness the information available in the unlabeled data as well. Therefore, we develop, evaluate, and compare a learning approach based on local and global discrimination modeling techniques for the problem of text classification in general and spam filtering in particular. The output of the local model serves as the input feature space for the global model that is responsible for the final classification. The technique is very robust to the distribution shift between the training and the testing data and is also extendable to a SSL approach. We also propose and evaluate a general classifier framework that can be used to develop a series of classifiers by combining different feature selection criteria, generative weighting schemes, information pooling strategies and global discriminative classifiers. The proposed algorithm combines a local generative model with a global discriminative classifier. The terms in the dataset are partitioned based on a selection criterion (e.g. class conditional probabilities) into sets, with one set for each class. These terms are assigned weights based on statistical and information theoretic measures such as odds, relative risk, Kullback-Leibler divergence, etc. The discrimination information provided by these terms is consolidated through an information pooling strategy that not only transforms the feature space but also reduces its dimensionality from number of terms to the number of classes. This space is more robust to distribution shift between the training and test sets. A discriminative classifier is then learned on this space to obtain the final labels. In case of large distribution shift between the training and the testing data, we extend the algorithm onto the semi-supervised domain using a naive SSL approach that learns a model on the combined training and test data (without the original labels). It handles the problem of distribution shift exceptionally well by using a linear transformation of the input space to a new feature space 28

where the documents are more easily discriminable by a global discriminative classifier. We introduce the concept of discriminative term weights (DTW), which quantifies the discrimination information that the terms provide for classification in contrast to the usual term weights (TWs) (such as tf-idf) that quantify the significance of a term within a document. Second, DTWs are defined globally for every term in the vocabulary while TWs are defined locally for every term in a document. Third, DTWs are computed in a supervised fashion rather than the usual unsupervised computation of TWs. DTWs are not a substitute for TWs; they are defined for a different purpose of classification rather than representation. We experiment with different weighting schemes, information pooling strategies and discriminative classifiers. The results of the algorithm are quite promising, and it has significantly outperformed all other contemporary text classification and spam filtering techniques for both supervised and semi-supervised setting on most of the benchmark datasets. The algorithm is not only linear in time and space but also demonstrates a very low false positive rate, we also demonstrate its effectiveness in handling the gray e-mails. Even with very small feature vectors it shows remarkable performance which makes this technique highly scalable and convenient for user personalization. We not only develop the theoretical foundations of the algorithm and compare it with different generative and discriminative techniques but I also look at the optimization of the parameters of our model so as to provide an efficient technique that is scalable to million of users for personalized service side spam filtering.

1.3.1

Thesis Contributions

The following paragraphs highlight the major contributions of this thesis: We develop a new text classification technique based on local and global discrimination modeling (LeGo) approach that performs exceptionally good on problems with distribution shift. We introduce the concept of discriminative term weights (DTW). DTWs are defined globally in a supervised fashion and quantify the discrimination information that terms provide for classification. In our approach different operations are decoupled to facilitate partial updation of the model (in case of distribution shift).

29

We demonstrate the effectiveness of our algorithm for the problem of content based e-mail spam filtering. We analyze and evaluate various issues of spam filtering such as gray e-mail, global and personalized filtering. Furthermore our spam filtering is fully automatic requiring no user feedback. We develop two semi-supervised versions of our filter for effective personalization. Even with very small filter sizes, our algorithm performs better than the benchmark filters on some datasets. Based on our approach we also develop and evaluate a framework that combines different selection criteria, local generative weighting schemes, opinion pooling strategies and global discriminative classifiers. This framework is demonstrated to develop efficient text classifiers. We compare our approach theoretically and empirically with popular text classification algorithm on various benchmark datasets in both supervised and semi-supervised setting. We also demonstrate the scalability, robustness and generalization aspects of our algorithm. We also provide detailed discussion on distribution shift problem with evaluation of various algorithm on varying distribution shift. We perform various statistical significance tests for verification of our results and show that our approach in statistically significantly better then most of the benchmark approaches. Apart from the aforementioned major contributions, we give discussion, formulation and results for multi-class formulation of our algorithm. We also quantify distribution shift between e-mail inboxes through information theoretic measures to study the impact of distribution on the performance of the algorithm. We also do feasibility and scalability analysis of our approach for service side personalized spam filtering. Comparison of various discriminative models based on our two dimensional local model is also performed. Lastly, a comprehensive survey of related text classification, spam filtering, data mining, and machine learning techniques is also presented.

1.3.2

Related Publications

Part of the work presented in this thesis has been published in the following research articles. 1. Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim, “A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering”, Discovery Challenge Workshop, 17th European Conference on Machine Learning, 2006, Berlin, Germany. 30

2. Khurum Nazir Junejo and Asim Karim, “Automatic Personalized Spam Filtering through Significant Word Modeling”, In Proceeding of 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2007, Greece. 3. Khurum Nazir Junejo and Asim Karim, “PSSF: A Novel Statistical Approach for Personalized Service-side Spam Filtering”, In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2007, California, USA. 4. Khurum Nazir Junejo and Asim Karim, “A Robust Discriminative Term Weighting based Linear Discriminant Method for Text Classification”, In Proceedings of IEEE International Conference on Data Mining (ICDM), 2008, Italy. 5. Khurum Nazir Junejo and Asim Karim, “Robust Personalizable Spam Filtering via Local and Global Discrimination Modeling”, Knowledge and Information Systems, pages 1-36. SpringerVerlag 2012. DOI 10.1007/s10115-012-0477-x, ISSN 0219-1377.

1.4

Thesis Outline

The rest of the thesis is organized as follows; Chapter 2, develops the background for this thesis and surveys the related work in text classification, spam filtering and machine learning domains to establish our technique in perspective with the already accomplished work. In Chapter 3 we discuss our contribution to the Discovery Challenge 2006 which laid the foundation of this research work. Chapter 4 adds to this previous chapter by formally defining the problem of spam filtering and personalized spam filtering and introduce the various problems and issues within this application area. We refine the model and present it with more detail along with theoretical and empirical comparison with various well known algorithms on benchmark spam filtering datasets. Results of various scenarios within the scope of spam filtering such as distribution shift and gray e-mail are also presented in this chapter. In Chapter 5 we discuss how spam filtering is different from general text classification and what generalizations are required by our algorithm to perform well on various text classification tasks. We report the results on text classification benchmark datasets in this chapter as well. Chapter 6 discusses the scalability of our approach and its suitability for 31

text classification in general and personalized spam filtering in particular. We also perform the statistical significance test for our approach with other well known algorithms. Furthermore we also introduce a generic framework that arises from our approach, we explore and experiment with its different aspects in this chapter as well. Finally in Chapter 7 we conclude the dissertation and propose several avenues for future research suggested by our research work.

32

Chapter 2

Background and Related Work In this chapter, we build the contextual background by discussing the related work in the literature. For presentational convenience, the discussion is divided into three subsections: the first subsection covers work that focuses on text classification, the second covers work on spam filtering and the third covers work that focuses on relevant data mining and machine learning approaches. In the text classification section we discuss about the issues related to TC such as document representation, feature selection and weighting, and discriminative term weights. In the spam filtering section we relate to the existing spam filtering approaches, both global spam filtering and the personalized spam filtering. In the last section we relate our approach to the various, machine learning approaches such as generative approaches, discriminative approaches, hybrid approaches and as well as the LeGo framework. We also relate with the existing supervised and semi-supervised techniques.

2.1

Text Classification

A text classification (TC) problem can be stated as grouping of documents into categories or classes based on their textual content given a set of labeled documents. The idea is to learn a classifier or a filter from a sample of labeled documents which when applied to unlabeled documents assigns the correct label to them. The categories are just symbolic labels which could very well be replaced by numbers. 33

Formally, if X is the set of all possible documents and Y = {1, 2, . . . , k} be the set of possible labels where k is the total number of categories i.e. |Y |. The label of a document x ∈ X is given by the target function Φ(x) : X → Y (which is unknown), then the problem of supervised text classification can be defined as follows: Given a set of training documents {(xi , yi )}N i=1 where xi ∈ L ⊂ X and yi = Φ(xi ), and a set of ¯ ¯ test documents U ⊂ X, learn the function Φ(x) : U → Y such that Φ(x) = Φ(x), ∀x ∈ U . The sets of documents L and U , correspond to the documents in the train (labeled) and test (unlabeled) data, respectively, and it is generally understood that U ∩ L is a small or null set. TC as defined above is referred to as single-label text classification as opposed to the multilabel where more than one label can be assigned to a document. Binary TC is a special case of single-label TC in which |Y | = 2. Document pivoted categorization (DPC) can be defined as, given a document, we want to find all the categories (or labels) that should be assigned to it, whereas category pivoted categorization (CPC) is defined as, given a category we want to find all the documents that should be assigned to it [Sebastiani (1999)]. DPC is suitable when documents become available at different moments in time such as e-mail filtering. CPC is suitable when a new category is created for which the documents already classified are considered for this new category. Oftentimes both of these techniques can be applied for TC, but in general DPC technique is more common and therefore in this thesis we define and assume DPC. As defined above, classifiers assigning a label to a document are said to be doing hard categorization. If for a given document x, the classifier outputs a real (or a natural) value against each category according to their estimated appropriateness for x, then this is referred to as category ranking. The output can be sorted to suggest to a user the most relevant label for the document, or in case of CPC, suggest the top ranked documents e.g. the most relevant articles for a user under the sports category. Therefore ranking is semi-automatic requiring a human expert to take a final decision. Some techniques tend to do category ranking or document ranking followed by a hard categorization e.g. Junejo and Karim (2008) rank the categories according to their estimated relevance to a document followed by the selection of the category with the maximum score (rank).

34

Text classification has been studied extensively in the literature. A comprehensive review of text classification methods is given in Sebastiani (2002). Here we focus on information extraction, document representation, term weights and performance measures.

Information Extraction and Preprocessing Most of the available data for TC is in the form of unstructured or semi-structured documents. The feature space can easily go up to a million dimensions. The Oxford English dictionary alone defines more than 600,000 words, while it does not include most names of people, locations, products, or scientific entities. According to Heaps law (for information retrieval), the dimensionality of this feature space is directly proportional to the size of the data set [Manning et al. (2008)]. Furthermore, documents are represented sparsely in this space. Therefore, in addition to the traditional dimensionality reduction techniques, various natural language pre-processing steps are employed to reduce the feature space by not only removing redundant and irrelevant features but also map multiple features to a single feature. These techniques are also known to increase the accuracy of the system significantly. Some of these techniques are mentioned below.

Stop Words Words that are too common in the language are of very little help in classification because such words occur in almost all documents, irrespective of the category. These words are called stop words and generally include articles, prepositions, etc. such as “is”, “the”, “a” and so on. Use of these words in classification rarely decreases the performance and at times may even increase it slightly [Fox (1989)]. According to the Zipf’s law these words constitute the bulk of the dataset, so removing them slightly decreases the dimensionality of the feature space but reduces the size of the dataset significantly [Manning et al. (2008)].

Stemming Stemming is the process for reducing derived or inflected words to their base or root form by removal of suffixes, which is then called a stem [Frakes (1992)]. This helps in mapping multiple

35

words to a single word. For example, bank, banker, banking and banks are all mapped to a single stem of bank. Stemming is helpful in information retrieval and in classification when data size is small.

Lemmatization Lemmatization is similar to stemming except that stemming operates on only individual words without looking at the context, whereas lemmatization analyzes a word in its context and semantic meaning before mapping it to its base form called lemma [Alkula (2001)]. For example, a stemmer may stem “better”, “best” and “good” to “bett”, “bes” and “go”, resp., whereas the lemmatizer would map all the three words to “good”.

Case Folding Case folding is the processing of reducing all letters to lower case. Often this is a good idea because a word such as “classification”, occurring at the start of a sentence, would have a capital first letter. While calculating probabilities, “classification” will not match “Classification” thus both will be considered separate features. The draw back of this approach is that words like “Windows” and “Bush” will be transformed to “windows” and “bush” respectively, which can have a different meaning. Nonetheless it is successfully employed in TC and information retrieval [Shen et al. (2005), Ogilvie and Callan (2001)].

Token Normalization Some tokens, despite meaning the same thing, do not match because of superficial changes in the character sequence, e.g. “anti-discriminatory” and “antidiscriminatory”. These changes could also occur because of spelling differences (color vs. colour), diacritics on characters (naive vs. na¨ıve), and also because of punctuation marks (U.S.A will not match USA). The process of standardization or canonicalization of these tokens is referred to as token normalization [Corston-Oliver and Gamon (2004)]. There are many ways to accomplish this, equivalence classes is the most used approach. It maintains a list of tokens that are mapped to a unique token.

36

In addition to the above mentioned pre-processing, html or xml tags also need to be removed from documents like e-mails, blogs, web pages, etc. Other artifacts that may require cleaning are images, audio and video clips. Header and routing information of e-mails is also removed.

2.1.1

Document Representation

Since text can not be directly interpreted by a classifier, therefore an indexing procedure that maps a document onto a compact representation of its contents needs to be uniformly applied. This gives rise to the question of what constitutes the meaningful units of text and the meaningful natural language rules that combine these units which is known as the problem of compositional semantics while the former is known as the problem of lexical semantics. Sebastiani (2002), Forman and Kirshenbaum (2008) provide a very comprehensive study for tackling this problem of document representation. Most commonly a document is represented as a vector of term weights in a vector space, the dimensionality of which is the number of unique terms that occur at least once in the corpus after the information extraction step. The ith document is defined by the vector xi = xi1 , xi2 , . . . , xiT , where T is called the dictionary and is equal to the dimensionality of the vector space and xij ≥ 0, ∀(i, j) is the value of the jth attribute. Typically, each attribute is a distinct term or token in the e-mails’ contents. Its value is defined by a term weighting strategy [Sebastiani (2002)]. Various approaches of document representation differ from each other in respect to their interpretation of the terms and how to assess their weights. Some researchers have identified the terms with phrases based on the syntactic and statistical motivation [Fuhr and Buckley (1991), Tzeras and Hartmann (1993), Sch¨ utze et al. (1995)]. The syntactic benefit advocated by Lewis (1992) states that the phrase is according to the grammar of a language. Caropreso et al. (2001) argues that the phrase in not grammatically such, but is composed of a set or sequence of words whose patterns of contiguous occurrences in the collection are statistically significant. This leads to the well know representation in the TC domain known as the “bag of words” (words or set of words) approach. In this approach, each word is taken as a single feature and the relative position or ordering of the words in not considered. The vector space model is a form of bag of words represen-

37

tation, a representation that does not capture the structure of the textual content and nor does it try to capture the semantic meaning of terms. Therefore the two phrases “James is stronger than John” and “John is stronger than James” are equivalent in this representation. Lewis (1992), Apt´e et al. (1994), Dumais et al. (1998) have found that representations more sophisticated than this do not yield significantly better effectiveness; there results have also been confirmed by Salton and Buckley (1988) from the information retrieval domain. Lewis (1992) tries to explain the discouraging result of the phrase representation of the documents, despite having superior semantic qualities they however have inferior statistical qualities, the reason for this behavior being that a phrase only representation has “more terms, more synonyms or nearly synonymous terms, lower consistency of assignment (since synonymous terms are not assigned to the same documents) and lower document frequency for terms”. Sebastiani (2002) argues that although the above stated remarks of Lewis are about syntactically motivated phrases, they also apply to the statistical motivation phrases as well, but to a lesser degree. Tzeras and Hartmann (1993) by using a combination of the two approaches obtained significant improvements. In terms of probabilistic model, a word only representation can be viewed as the unigram model while a phrase only representation with phrases of only n number of words can be viewed as a n gram model. Russel and Norvig (1995) (at page: 835), for application of approximating the subject matter of a document, show that the trigram model performs better than the bigram model, which in turn performs better than the unigram model. They argue that despite better results of high order n gram models they require a prohibitive amount of documents and time, they explain it with an example of a book having a dictionary size of 15000 words. It will require 225 million word pairs (150002 ) for a bigram model alone, because of that about 99.8% of these pairs will have a count of zero. Therefore some approaches have resorted to take three (or more) consecutive characters as a term.

2.1.2

Term Weights

Three approaches are most commonly used in the literature for term weights. One such approach is binary weighing, where 1 denotes the presence and 0 denotes the absence of the term i.e. xij ∈

38

{0, 1}. The second approach is the term frequency (TF) approach, where the weight of the term is equal to the number of times it has occurred in that document. Since TF approach carries more information than the boolean values, one is tempted to think that classifiers that use TF values perform better than those that use boolean values, but it is not always the case. McCallum. and Nigam (1998) have shown experimentally that the multinomial NB performs, in general, better than the multivariate Bernoulli NB in text classification. Their finding has been verified by Schneider (2003), Hovold (2005) and Junejo and Karim (2007a) with experiments in the spam filtering domain. Schneider (2003) found that the multinomial NB surprisingly performed better with boolean values in respect to the TF values. Eyheramendy et al. (2003) has proven that the multinomial NB with TF weights is equivalent to a NB version with terms modeled as following Poisson distributions in each category assuming that the document length is independent of the category. Therefore multinomial NB may perform better with Boolean weights if the terms in the TF version do not follow the Poisson distribution. The above mentioned two weighting techniques also preserve the compositional semantics of the document. Another technique that is widely used in TC but does not preserve the compositional semantics is the standard term frequency inverse document frequency function (tfidf ) [Salton and Buckley (1987), Montanes et al. (2005)]. Intuitively this function states that the more documents a term occurs in, the less discriminating it is and the more often a term occurs in a document, the more will be its weights for that document. Our method can work with any of the aforementioned weighting techniques as long as a term vector representation is used. In this thesis, however, we restrict ourselves to the binary term vector representation that has been shown to produce more accurate classifiers in many settings [Bickel (2006)].

2.1.3

Discriminative Term Weights

Regardless of the information extraction and the document representation approach, the dimensionality of the term space is very large for text classification problems. Therefore feature selection and dimensionality reduction techniques have been extensively studied for text classification [Blum and Langley (1997), Forman (2003b), Dasgupta et al. (2007)]. Techniques for feature selection and dimensionality reduction can be supervised or unsupervised depending on whether they require

39

class information. However, for the text classification problem setting discussed in this paper, supervised techniques are more commonly used [Baker and McCallum (1998), Kullback and Leibler (1951), Dhillon et al. (2003)]. These techniques rely on class information and information theoretic measures, such as entropy, to identify high relevance terms. These approaches can be classified into wrappers approach and filters approach. The wrappers techniques use the classification accuracy of a learning algorithm as their evaluation function and try to maximize it, thus requiring to train a classifier for each feature subset to be evaluated. This makes these approaches computationally expensive especially for high dimensional problems such as text classification. The filters approach performs feature selection independently of the learning algorithm, making it more suitable for text classification. Our term weighting and selection technique belongs to this latter category of approach. In particular, we weigh each term by the discrimination information it provides for discriminating between one category and the rest, we call these weights discriminative term weights (DTWs). The weights also serve to partition the terms into two sets (for two class problem), and they can be thresholded for term selection and dimensionality reduction. A novel information pooling technique is adopted to aggregate the discrimination information of each set to form a two-dimensional feature space in which a linear discriminant function is learned. It is worth pointing out the distinction between our proposed discriminative term weights (DTWs) and (document) term weights (TWs) as described in the previous subsection. First, DTWs quantify the discrimination information that terms provide for classification while TWs quantify the significance of a term within a document. Second, DTWs are defined globally for every term in the vocabulary while TWs are defined locally for every term in a document. Third, DTWs are computed in a supervised fashion rather than the usual unsupervised computation of TWs. DTWs are not a substitute for TWs; they are defined for a different purpose of classification rather than representation. To perform feature selection based on these DTWs, whereas these weights are calculated based on the relative risk (odds ratio) of each term. Relative risk (RR) and odds ratio (OR) have been used extensively for disease diagnosis in clinical trials [LeBlanc and Crowley (1992), Hsieh et al. (1985)]. RR is the risk of developing a disease relative to exposure, mathematically RR is a ratio of

40

the probability of the event occurring in an exposed group versus a non-exposed group. Whereas OR is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. In medical research, RR is favored for cohort studies and randomized controlled trials. Whereas OR is used in retrospective studies and case-control studies. Many interesting properties of OR have made it appealing to machine learning and data mining computing such as its symmetry under variable permutation, row/column scaling invariance, inversion invariance, null invariance and many more [P. N. Tan and Srivastava (2004)]. Usually OR has been used for feature selection [Forman (2003b), Z. Zheng and Srihari (2004)] and has only occasionally been used for classification [Turney (2002)], the reason being that at times its performance degrades significantly (as we shown in Fig. 6.4.1). Recently, there is growing reliance on such statistically sound measures for quantifying the relevance of patterns [Li et al. (2005)]. Efficient algorithms for discovering risk patterns, defined as itemsets with high relative risk, are discussed by Li et al. (2007), while direct discovery of statistically sound association rules is presented by H¨ am¨al¨ a¨ınen (2010). These measures have also been used in the language processing literature for quantifying term association [Chung and Lee (2001)]. Even though RR is a more intuitive measure, it has mostly been neglected in feature selection and classification approaches because it does not have some of the interesting properties that OR enjoys. But we show in Sect. 6.4.1 that RR is either equal or better than the OR for text classification. Like OR and RR, techniques such as KL divergence, chi-square statistic, information gain, mutual information, Hellinger distance and many more, have also been used for feature selection and some have even been used for classification in text classification literature [Turney (2002), Forman (2003b), Pang et al. (2002)], but none has given consistent results. We overcome this deficiency buy using them as term weights for two partitioned termed sets on which we perform information pooling followed a discriminative classifier. To the best of our knowledge, RR, OR, or KL (or other discussed measures), have not been used for feature transformation that serves as an input to a discriminant classifier. By doing so we overcome the problems of using these measures for building competitive and robust text classification methods.

41

2.1.4

Performance Measures

Accuracy is the measure of how much a classifier is free from mistakes in predicting the label. It is the fraction of the number of documents that are correctly classified as positive plus those that are correctly classified as negatives of the total number of documents in the test data. In other words it is the measure how close a match our predicted labels are to the original labels. A receiver operating characteristic (ROC) curve is a graphical plot to evaluate the performance of a binary classifiers as their threshold is varied. It is created by plotting the true positive rate (TPR) vs. the false positive rate (FPR) at various threshold settings [Bradley (1997)]. TPR is the fraction of true positive (documents which have been correctly labeled as positive) to the total number of positively labeled documents in the test data, whereas, FPR is the faction of false positives (documents which have been incorrectly labeled as positive) out of the total number of negatively labeled documents in the test data. The assumption is that the classifier can set different thresholds on their final decision to tradeoff TPR to FPR ratio and vice versa. So if a classifier has a ROC curve that is highly skewed towards the top left corner then it means that it achieves a high TPR at very little expense of FPR which is a desirable property. The more the skew the better the classifier is. The area under curve (AUC) is a summary statistic of the ROC curve and is used for model comparison. It is a more robust measure to evaluate the classifier performance as compared to accurracy, specially for imbalanced class distributions. The value of AUC ranges from 0 to 1, where 1 is the maximum (100%) score. AUC when used with normalized units, is equal to the probability that a classifier will rank a randomly chosen positive document higher than a randomly chosen negative one. The AUC value is considered to be a robust measure of classifier performance that is often utilized for evaluating and comparing spam filters [Cortes and Mohri (2004), Bickel (2006)].

2.2

Spam Filtering

Spam is basically unwanted, unsolicited e-mail sent directly or indirectly to a recipient. In the literature it is also know as junk mail, bulk mail and unsolicited commercial e-mail (UCE). E-mails

42

that are not spam are referred to as legitimate, non-spam or ham e-mails. Spam filtering is the processing of filtering incoming e-mail in to one of the two predefined categories spam and nonspam. Apart from being unsolicited, spam is sent by an unknown sender to a very large amount of users. E-mail spam continues to be a menace that costs users, businesses, and service providers billions of dollars annually [Goodman et al. (2007), Atkins (2003)]. The problem of e-mail spam has engendered an industry sector that provides anti-spam products and services [Leavitt (2007)]. Some of these challenges are highlighted by Fawcett (2003) as: (1) Changing proportions of spam and non-spam e-mails, with time and with usage context (e.g. specific user). (2) Unequal and uncertain costs of misclassifications making it difficult to evaluate the utility of spam filtering solutions. (3) Differing concepts of spam and non-spam e-mails among users, and, (4) adversarial nature of spam, where spammers are continuously trying to evade spam filters. Challenges 1, 3, and 4 can be grouped under the general challenge of handling distribution shift in e-mail systems, it is discussed in latter in this section. Challenge 2, refers to the high cost of false positive (FP) in spam filtering. Cost for FP is higher because having a spam e-mail land in your inbox is less disturbing than having an important e-mail classified as junk e-mail because it undermines the trust of the user. We do not address this challenge explicitly, we do use the AUC (area under the ROC curve) value, in addition to filtering accuracy, for evaluating filtering performance. The AUC value is more sensitive to false positives and is considered to be a robust measure of classifier performance and is often used for evaluating and comparing spam filters [Cortes and Mohri (2004), Bickel (2006)]. The spam generating community is also vibrant, as it develops new strategies and tools to distribute spam [Stern (2008)]. Many technological and non-technological measures have been developed to combat spam [Cormack (2007a)]. Most of the spam can be divided in four types. The most common form is UCE, they are commercial advertisements trying to sell products, such as, medical drugs, software, mortgage, pornography etc. Some spammers are know to be earning millions of dollars a year through UCE [Sanz et al. (2008)]. Other type of spam is pyramid schemes, which ask the user to give a small amount now to get a huge return in a short duration. In third type of spam is known as advance fee fraud, also known as Nigerian fraud, the spammer poses to be a business man or a government official who is not able to get his money out of his country

43

and offers to deposit all of it in your account. If the user shows interest the spammer asks for some money upfront to make this transaction, once you deposit money into their account, they disappear. The fourth type is chain letters and internet hoaxes, that try to emotionally trap you to forwarding them to every one you know. Various measures are employed at various levels to check for spam. Some of them are as follows:

2.2.1

Network Level Approach

This approach relies on the network level information of the messages to identify spam, such as, routing path, IP addresses, spammers behavior identification, details of e-mail header, etc. This approach suffers because of the ephemeral nature of IP addresses. More than half of sender IP addresses appear less than twice and around 10-35% of spam is unlisted at the time of receipt [Ramachandran et al. (2007)]. Every day, 10% of senders are from previously unseen IP addresses [Ramachandran et al. (2007)]. This is due to the ease of dynamic renumbering of IP addresses, hijacking of IP addresses and IP address space, and compromised machines (botnets). Furthermore, these approaches also exhibit a high false positive rate which is more costlier for spam filtering [Esquivel et al. (2009)]. These drawbacks make content based filtering more effective and easier to maintain. Few of the mostly used network level approaches are mentioned below.

White and Black Listings White lists is a list of senders that are safe and e-mail generated by them is not to be considered spam. On the other hand, black lists is a list of senders who are known for generating spam and therefore all e-mails generated by them is considered spam [Boykin and Roychowdhury (2005)]. These lists could be maintaining e-mail addresses, ip addresses or domain names and can be deployed at the client side and the server side as well. These lists are updated manually so it is cumbersome to maintain them. Ramachandran et al. (2007) show that as much as 35% spam was sent from IP addresses that were not listed by either SpamCop or Spamhaus, both being very reputable blacklists. Secondly they are based on the assumptions that the e-mails from white listed users is always safe which is not always true because trojans or viruses could send e-mails without

44

the knowledge of the user. Similarly, if such e-mails are generated by a genuine user or an ip may be black listed by a server at the internet service provider (ISP).

Greylisting Greylisting is used in addition to white and black lists. The term grey originates because the e-mail is not permanently considered safe or spam, rather an e-mail received from an unknown sender then is temporarily rejected and is greylisted. The assumption is that temporary failures are defined in SMTP and the server will re-send it and at that time it will be cleared for the inbox [Levine (2005)]. This works because the spammers do not follow the protocol and standards properly so the rejected e-mail is not re-sent. Some spammers evade this by sending the e-mail a second time to all users after a short while, but till that time many of these e-mails have been identified as spam and hence are not forwarded to the inbox. This technique is effective but delays the delivery of the e-mail to the inbox from fifteen minutes to up to four hours. Secondly, some SMTP clients may consider a temporary rejection as a permanent failure. Lastly, this may double the traffic.

Digital signatures Digital signatures on e-mail using public key cryptography techniques allows server to filter out unsigned e-mails [Tompkins and Handley (2003)]. In case the e-mails are signed but are sent by an un-trusted senders then it is rejected. A user can trust the signer of an e-mail if the sender is also trusted by contacts of the user. The disadvantage of this approach is that signatures are rarely used in legitimate e-mails.

2.2.2

Postage

Spam is not a problem in snail mail because the cost of printing and delivering is higher, but comparatively for e-mail, its inexpensive to send millions of e-mails in a short duration, so if the spammer is able to strike a few sales against a millions of e-mails then he is still able to make profit. So the main idea of postage is to assign some cost to an e-mail, either monetary cost or a computational delay or through CAPTCHAs (Completely Automated Public Turing test to tell

45

Computers and Humans Apart) to take up some human time, or any thing that will make sending spam cumbersome or non-profitable [Kraut et al. (2002)]. The major problem with this approach is that to enforce postage a major change in the SMTP protocol is required.

2.2.3

Disposable e-mail addresses

The idea is that you have a separate automatically generated e-mail address for each sender which routes the e-mails to your permanent e-mail address [Yegenian (2008)]. So if a sender sends a spam e-mail then the address assigned to him is deleted. This approach has problems of its own, e.g. how to give new addresses to new senders, we might end up deleting addresses of legitimate senders if a spam is sent from their compromised system.

2.2.4

Collaborative filtering

Collaborative filtering is based on the principal of a community of users that trust each other, if one (or more) user tags an e-mail as spam then all the other users are notified about that e-mail as being spam [Gray and Haahr (2004)]. The problem with this approach is that what is spam for some users may be legitimate for others. Secondly spammers bypass this by sending a slightly modified e-mail to each user.

2.2.5

Honeypotting and e-mail traps

According to Spitzner (2003) a Honeypot is a security resource whos value lies in being probed, attacked or compromised. Honeypots in the e-mail domain are known as e-mail traps or spam traps, essentially they are open e-mail relays or public open proxies intentionally made vulnerable so that they can be used to discovery spam activities in advance [Provos (2004)]. Unfortunately spammers have found a way to detect honeypots in advance and avoid them [Zou and Cunningham (2006), Krawetz (2004)].

46

2.2.6

Content based classification

The most successful form of e-mail spam filtering is through content based classifiers. Two type of approaches are used for content based filtering, knowledge engineering and machine learning based approaches. Knowledge engineering [Hayes et al. (1990), Goodman (1991)], is the building of logical rules by domain experts or knowledge engineers to classify the incoming e-mail. This approach requires manual building of rules which need to be changed continuously to adapt to the changing tactics of the spammer. These problems gave way to the popularity of Machine Learning (ML) approaches which are now the dominant approaches in this field [Sebastiani (2002)]. In the ML framework, a classifier or a learner is built from a set of e-mails previously labeled through spam traps or users. The model learned by the classifier is then used to assign the new unseen e-mails to their most relevant category. This technique therefore builds a classifier automatically without the intervention of knowledge engineer or a domain expert, capturing the characteristics of the data hidden even from the domain expert. If the domain evolves, the new classifier can be learned automatically.

2.2.7

Personalized Spam Filtering

In recent years, there has been growing interest in personalized spam filtering where spam filters are adapted to individual user’s preferences [Gray and Haahr (2004), Junejo et al. (2006), Cheng and Li (2006), Junejo and Karim (2007b), Segal (2007), Chang et al. (2008)]. The adaptation is done in an effort to handle distribution shift between training e-mails and individual user’s e-mails, which can cause a single global filter for all users to perform poorly. Personalized spam filtering solutions can be deployed at the client side (user) or at the service side (e-mail service provider or ESP). Effective service-side personalized spam filtering requires that the personalized filter be lightweight and accurate for it to be practically implementable for millions and billions of users served by the ESP [Kolcz et al. (2006)]. Some solutions for personalized spam filtering rely upon user feedback regarding the labels of their e-mails [Gray and Haahr (2004), Segal et al. (2004), Segal (2007), Chang et al. (2008)]. This strategy burdens the e-mail user with the additional task of aiding the adaptation of the spam 47

filter. Semi-supervised approaches that do not require user feedback have also been presented [Junejo et al. (2006), Cheng and Li (2006), Junejo and Karim (2007b), Bickel and Scheffer (2006)]. Junejo et al. (2006) and Junejo and Karim (2007b) describe a statistical approach, called PSSF, that transforms the input space into a two-dimensional feature space in which a linear discriminant is learned. Personalization is done by updating the transformation and the discriminant function on the (unlabeled) user’s e-mails. Extremely lightweight spam filters, for large-scale service-side deployment are investigated by Sculley and Cormack (2009). They report good personalization performance when lightweight filters are combined with a global filter (hybrid approach). However, their approach is supervised and requires user feedback for personalization.

2.2.8

Concept Drift

The notion of concept drift is closely related to distribution shift with the implicit understanding that distribution shift occurs continuously over time. Delany et al. (2005b) propose a case-based reasoning approach for handling concept drift in e-mail systems. Their strategy for maintaining the case base requires knowledge of the correct labels. Ensemble approaches are known to be robust to drifting data. The ensemble update process can be supervised [Delany et al. (2006)] or semi-supervised [Cheng and Li (2006), Katakis et al. (2010)]. Cheng and Li (2006) present a graph-based semi-supervised approach combining support vector machine (SVM), naive Bayes classifier, and rare word distribution for personalized spam filtering. Katakis et al. (2010) address the problem of recurring contexts in e-mail systems by presenting an ensemble method in which each classifier in the ensemble models a particular context. Semi-supervised SVM (or transductive SVM) is adopted by Mavroeidis et al. (2006) for personalized spam filtering. Our local model uses an ensemble approach for constructing features from discriminating terms while we adopt the naive semi-supervised learning approach for adaptation [Xue and Weiss (2009)].

2.2.9

Domain Adaptation

The notion of distribution shift between the training and test data has also been studied in the literature as a domain adaptation problem as well. Recently, there has been much interest in the

48

domain adaptation problem, where a classifier is adapted from a source domain to a target domain [Jiang (2007)]. The target domain data may be labeled or unlabeled. Blitzer et al. (2006) present a domain adaptation approach based on learning frequently occurring features (pivot features) in the source and target domains. The weights of the learned classifiers are then used to transform the input representation into a low dimension feature space in which the task classifier is built. The importance of having a feature space in which the two domains are less different is highlighted by Ben-David et al. (2007). Glorot et al. (2011) perform domain adaptation for sentiment classification. They propose a deep learning approach (discovering intermediate representations) which learns to extract a meaningful representation for each review in an unsupervised fashion. Zhang and Zhou (2012) use domain adaptation for multi-task clustering. Lai and Fox (2010) employ domain adaptation for object recognition in 3D point cloud. It has also been applied to summarization, entity recognition, dialog act tagging, parsing, etc. Daum´e et al. (2010).

2.3

Data Mining and Machine Learning

The data mining and machine learning literature has extensive coverage of learning paradigms and problem settings appropriate for spam filtering. A comprehensive discussion of all related areas is beyond the scope of this thesis. Therefore we focus on key contributions and areas that have direct relevance to our work in this thesis. We propose an ensemble to construct local features

2.3.1

Global Models from Local Patterns (LeGo)

Building global models from local patterns is a promising approach to classification [Knobbe et al. (2008), Bringmann et al. (2009), Knobbe and Valkonet (2009)]. This approach, often called the LeGo (from local patterns to global models) framework for data mining, focusses on finding relevant patterns in data that are then used as features in global models of classification. Local patterns can be model-independent or model-dependent [Bringmann et al. (2009)]. Also, it is desirable that these patterns form a non-redundant and an optimal set or a pattern team [Knobbe and Valkonet (2009)]. Recently, LeGo approach is being used in vast variety of classification and clustering approaches [Atzmuller et al. (2009), Dembczynski et al. (2010), Azevedo and Jorge (2010), Malik 49

et al. (2010b), Nijssen and Fromont (2010)]. Our local patterns are model-dependent as they are based on discrimination information and form a pattern team. Furthermore, our local patterns form pattern teams based on the relative risk statistical measure. Our global model is a linear classifier that operates on features derived from the pattern teams. We also relate our local and global approach to hybrid generative/discrimination models for classification.

2.3.2

Generative, Discriminative and Hybrid Methods

Supervised text classifier can be based on generative or discriminative learning. Generative approach to supervised learning is to produce a probability density model over all variables of the training data (both input and label variables) and manipulate it to compute classification or regression for the unseen data. This is done by estimating the class priors and class-conditional distributions. It is called generative because these two distributions can be used to randomly generate the training data. However in the discriminative approach to supervised learning the posterior distribution of the class given the document is estimated directly. The class conditional and prior probabilities are not necessarily computed, therefore, the training data can not be generated, only the labels can be generated. Therefore a generative model is a full probabilistic model of all input variables and labels, whereas a discriminative model provides a model only for the target (label) variable conditional on the observed input variables. The former approach is more flexible, elegant and explanatory, whereas the latter approach can yield better results and can be used as black boxes. Generative models perform poorly when training data is low but surpass discriminative models on large training data.

Generative Methods The most common generative classifier is naive Bayes [Seewald (2007), Kolcz and Yih (2007)]. This classifier results from the application of the Bayes rule with the assumption that each variable is independent of the others given the class label. In the literature there are three types of naive Bayes classifiers used for text classification: Multi-Variate Bernoulli, Multi-nomial with term frequencies weights (TF), Multi-nomial with Boolean weights. Since multinomial NB with TF weights contains

50

more information than the one with only Boolean weights, it is quite reasonable to think that it would perform better, but results show otherwise. McCallum. and Nigam (1998) have shown experimentally that the multinomial NB performs in general better than the multivariate Bernoulli NB in text classification. Their finding has been verified by Schneider (2003), Hovold (2005) and Junejo and Karim (2007a) with experiments in the spam filtering domain. Schneider (2004) found that the multinomial NB surprisingly performed better with Boolean values in respect to the TF values. Eyheramendy et al. (2003) has proven that the multinomial NB with TF weights is equivalent to a NB version with terms modeled as following Poisson distributions in each category assuming that the document length is independent of the category. Therefore multinomial NB may perform better with Boolean weights if the terms in the TF version do not follow the Poisson distribution. Metsis et al. (2006) give a good comparison of the above mentioned versions of the naive Bayes with the addition of two more models, namely, Multivariate Gauss NB and Flexible Bayes, both of which assume an underlying probability distribution for the terms in the documents. He concludes that multinomial NB with Boolean weights and flexible Bayes are the better of the lot, and prefers the former approach for textual content based spam filtering. We compare our results to this later version of naive Bayes. Another successful probabilistic classifier, which has similarities to naive Bayes [Juan et al. (2007)], is maximum entropy [Nigam et al. (1999)]. The maximum entropy (ME) classifier estimates the joint probability distribution by maximizing its entropy constrained by the empirical distribution. It is commonly used as an alternative to the naive Bayes classifier because it does not assume statistical independence of the features. However, it assumes collinearity to be relatively low becomes it becomes difficult to differentiate between the impact of several features if they are highly correlated. Learning a model of maximum entropy is slower than that of a naive Bayes classifier but has been shown to sometimes perform significantly better than naive Bayes [Nigam et al. (1999)].

51

Discriminative Methods The decision tree induction algorithm and the rule induction algorithm are simple to understand and interpret, however, they do not work well when the number of distinguishing features between documents is large, as is the case in text classification. On the other hand, the k-NN algorithm is easy to implement and shows its effectiveness in a variety of problem domains, but its major drawback is that it is computationally expensive. Therefore the most popular discriminative text classifier in literature is SVM [Joachims (1998), Peng et al. (2008)]. SVM, which is based on statistical learning theory and structural risk minimization, learns a maximum margin linear discriminant in a (possibly) high dimensional feature space. According to Joachims (1998), for SVM feature selection is often not needed as they tend to be robust to over fitting and can scale up to considerable dimensionalities, thus making it suitable for textual data. He further emphasizes that no human and machine effort in parameter tuning on validation set is needed because there is a theoretically motivated default choice of parameter settings, which has also shown to provide the best effectiveness. Even then we experiment to find the best values for the parameter to compare our technique. One of the well known discriminative classifier for computing good class boundaries is Rocchio classification [C. D. Manning (2009)]. The centroid of a class is computed as the center of mass of its members. The boundary between two classes is then the set of points with equal distance from the two centroids. As a result, the boundaries of the class regions are hyperplanes. The classification rule is quite simple: a point is classified in accordance with the region it falls into. This algorithm is efficient in computation and takes the same time as navie Bayes classifier, and has relevance feedback mechanism but has low classification accuracy. Furthermore, the classes must be approximate spheres with similar radii, a condition rarely met in TC. It also performs poorly for multi-modal classes. A better discriminative classifier than Rocchio classifier is KNN classifier, it deals better with classes that have non-spherical, disconnected or other irregular shapes because it determines the decision boundary locally. It is a memory based learning and therefore not so efficient where there is a distribution shift between the training and test data or where the distribution changes with

52

time (concept drift). Being a lazy learning method without pre-modeling, it has a high cost to classify new documents when training set is large [Miao et al. (2009)]. The balanced winnow is another example of a discriminative classifier that learns a linear discriminant in the input space by minimizing the mistakes made by the classifier [Dagan et al. (1997)]. It is very similar to the perceptron algorithm however, the perceptron algorithm uses an additive weight-update scheme, while it uses a multiplicative scheme. This allows it to get down quickly to the desired weight vector in spite of the astronomical size of the available features. Carvalho and Cohen (2006) show that it has performance comparable and sometimes even better than linear SVM.

Hybrid Methods Multiple classifiers can be combined to form a single classifier. This technique is referred to as classifier ensembles, combiners, or committees. They aggregate several classifiers by combining their individual predictions through voting or averaging [Oza and Tumer (2008)]. They generally provide better performance than their constituent models (also known as base models) [Ho et al. (1994), Tumer and Oza (2003), but are generally slower because multiple models are learned. There are four ways to build classifier ensembles [Ikonomakis et al. (2005)] i) Partitioning the training data based on examples (with or without replacement) and learning a separate model for each partition. ii) Partitioning the training data based on features (with or without replacement) and learning a separate model for each partition. iii) Without partitioning use different training parameters with a single training method and iv) Without partitioning use different learning methods; this type is commonly referred to as hybrid methods. Our approach belongs to this category. Most common form of hybrid method is combining generative and discriminative models for classification [Jaakkola and Haussler (1998), Raina et al. (2003), McCallum et al. (2006), Liu et al. (2007), Junejo and Karim (2008)], For example, the output of naive Bayes classifier may be used as an input to a support vector machine (SVM) that makes the final decision [Isa et al. (2008)]. These classifiers try to exploit the strengths of generative and discriminative learning by first learning the data distribution and then building a discriminative classifier by using the learned distribution.

53

Several variants of this general concept have been explored with promising results. Although the algorithms proposed in this thesis are not truly hybrid generative/discriminative in nature but they have close correspondence to such algorithms as discussed in .

2.3.3

Semi-Supervised Methods

Supervised learning is the traditional form of machine learning in which a model is learned on a set of labeled data that is knows as training data. This model is then evaluated on unlabeled data, also known as the test data. During the learning process the model is unaware and independent of the test data. Semi supervised learning (SSL) is a new class of machine learning algorithm that makes use of both the labeled and the unlabeled data at the time of learning. The learned model is then evaluated on the unlabeled data. The SSL paradigm was originally proposed for improving classifier performance on a task when labeled data are limited but unlabeled data are plentiful for the task, but oftentimes a large amount of training data is also used. SSL can be transductive or inductive learning. Transductive learning refers to the scenario when the goal is to learn the correct labels for the unlabeled data, whereas, the goal of the inductive learning is to learn a model that can map a document xi onto Y . This makes inductive SSL more general than transductive SSL. Our technique belongs to the inductive SSL approach. Several SSL approaches have been proposed such as generative mixture models, self-training, co-training, and graph-based propagation [Zhu (2005)]. In recent years, however, all approaches that rely upon labeled and unlabeled data are considered to be semi-supervised, regardless of whether their problem settings and motivations are identical to those proposed for the paradigm originally. As such, semi-supervised learning has been applied to problems involving distribution shift. Xue and Weiss (2009) investigate quantification and semi-supervised learning approaches for handling shift in class distribution. Bickel et al. (2009) present a discriminative learning approach for handling covariate shift. Note that for handling covariate shift, quantification is not necessary as p(x) can be estimated from the unlabeled data. Another machine learning paradigm with relevance to personalized spam filtering is transfer learning. Transfer learning involves the adaptation of a classifier for one task to a related task, given data for both tasks. Many transfer learning approaches

54

are semi-supervised requiring only unlabeled data for the new task [Xing et al. (2007), Raina et al. (2007)]. Other semi-supervised learning methods proposed include, co-training [Blum and Mitchell (1998)], the EM method [Nigam et al. (2000)], the bootstrap method [Collins and Singer (1999)], SVM, TSVM [Joachims (1999b), Collobert et al. (2006)], information-based regularization [Szummer and Jaakkola (2003)], Bayesian network [Cohen et al. (2004)], Gaussian random fields [Zhu et al. (2003)], manifold regularization [Belkin et al. (2006)], and discriminative-generative models [Ando and Zhang (2005), Bouchard and Triggs (2004), Kang and Tian (2006)], Dirichletenhanced NB [Bickel and Scheffer (2006)], Multi-Conditional Learning [Kelm et al. (2006)], multiinstance learning [Jorgensen et al. (2008)], self taught learning [Raina et al. (2007)] etc. Our technique uses the naive semi-supervised learning approach (SSL) used in Xue and Weiss (2009). The motivation for using naive SSL instead of more sophisticated approaches (e.g. selftraining SSL and other mentioned above) are two-fold. First, most of the other approaches are proposed for situations where the training and test data follow the same distribution. In e-mail systems, on the other hand, the distribution of e-mails in training data can be significantly different from that in the test data, and this difference is in general not easily quantifiable. Second, in e-mail systems distribution shift occurs continuously over time and it is better to personalize a filter from one time period to another rather than from training data to current time period. The naive SSL approach decouples the adaptation from the training data and as such can be performed on the client side in addition to the service side.

55

Chapter 3

Discovery Challenge Problem 3.1

Introduction

Recently there have been many e-mail spam competitions and challenges. Text retrieval conference (TREC) held spam track for three consecutive years [Cormack and Lynam (2005), Cormack (2006a; 2007b)], conference on e-mail and anti spam (CEAS) held two spam filter challenges [Segal and Cormack (2007), Segal et al. (2008)] and European conference on machine learning (ECML) also conducted one such competition as a part of the discovery challenge workshop (DCW) [Bickel (2006)]. All of these challenges addressed the content based e-mail spam filtering problem but under different scenarios. The discovery challenge spam filtering problem was unique; it provided a standard pre processed dataset. Stemming, lemmatization, stop words removal, word-phrase structure techniques and many other natural language processing (NLP) techniques were applied beforehand. This standardized the feature set so that the participants could only focus on the machine learning (ML) aspects of spam filtering. Therefore the DCW’s offered for the first time a direct comparison of different learning techniques on e-mail spam dataset. Secondly, the previous challenges were based on the classical machine learning assumption (i.e. test data is drawn from the same distribution of the training data). This assumption is rarely true for spam filtering problems because of three reasons; a) the user e-mails are not easily available because of privacy concerns so training data is consolidated from various different resources, b)

56

spammers continuously change their e-mails to bypass the filters and c) the interests and behaviors of the users change over time. Doing away with this classical assumption, encouraged techniques that could accommodate not only for this distribution shift and but should also be robust to the evolving distributions. The challenge focused for the first time on personalized spam filtering i.e. a filter tuned for each individual user. Personalized filters are the only solution for gray e-mails, i.e., e-mails that are spam for some users while they are not spam for others. Since the number of users of an e-mail service provider is in millions, the filter should be scalable and efficient. 57 teams from 19 countries participated in the competition. Most of the techniques were semi-supervised learning (SSL) techniques and their variants including graph-based algorithms, self-training approaches, large-margin-based methods, multi view learning methods, and positiveonly learning methods. We also participated in one of the two tasks and won the first position (Performance Award) with the highest AUC (area under the ROC curve) value. Our technique was very efficient in terms of performance and with small running time and memory footprint. The DCW’s Spam Filtering challenge has since then been of central importance in the area of e-mail spam filtering, with its datasets serving as a benchmark for evaluation for spam filters to this date, and its learning techniques being cited in text classification and spam filtering, especially personalized spam filtering. The rest of this chapter in broken into four sections. Section 3.2 looks at the details of the DCW Spam Challenge, the two different tasks that were posed, their evaluation criteria and the datasets that were provided followed by the details of the results and a brief look at the successful techniques at the challenge in Sect. 3.3. In Sect. 3.4, we discuss in detail our solution with mention of the improvements and additional results. We conclude this chapter in the Sect. 3.5.

3.2

The Spam Challenge

Every year billions of dollars are lost by companies because their employees and servers are bombarded by spam e-mails; many are forced to read the spam e-mails so as to make sure that they are not deleting a legitimate e-mail while others read them to label them appropriately so that their 57

local spam filters get trained. In this competitive market, e-mail service providers (ESPs) spend a lot of time and money to lessen this burden on the users, and try to come up with better and efficient server side spam filters. But these filters face a daunting task: they cannot rely on the labels from the users. They have to find a way to circumvent this limitation and they do so by gathering labeled e-mails from publicly available sources such as spam traps and newsgroups to form a huge training data. This labeled data poses new challenges of its own, primarily, it is not a good representative of e-mails that are received by the individual users, mainly because the interests of each user varies. Secondly what is spam for one user might be a legitimate for another, and lastly the training data was collected some time ago while the distribution of the e-mails changes every day. This gives rise to a need for a mechanism to cope for the change in the distribution so that the learned filters can adapt to the individual user’s e-mail characteristics. Conventional service side e-mail filters do not take these issues into considerations even though the ESP’s have just the right resource for it and it is the unlabeled data of that particular user. Each user has a huge number of unlabeled e-mails which ESPs can utilize to adapt the filter for him, and if the number of e-mails is not sufficient, a generalization can be done over many users. DCW covered this real world problem; a labeled training data gathered from publicly available sources and unlabeled user e-mail inboxes as test data. The inboxes differ in distribution from the training data and from each other as well. The goal of the challenge is to learn a filter for each user such that it classifies its e-mails correctly as spam and non spam. This setting puts certain constraints on the learning algorithm, one that it can not rely on the training data alone because the difference of distribution between the training data and user inboxes will prevent it from learning the correct model, secondly, it should provide a mechanism to make use of the e-mails in the unlabeled inboxes to learn the true model for each user. Even though every user has different interests, their inboxes are not likely to be completely independent because same spam e-mails are sent to many users. So the inboxes are neither identical nor independent, a property that could be exploited by the learning algorithm as well. Many machine learning techniques use cross validation for parameter tuning. This is not possible

58

is this scenario as no labels are provided for the inboxes. The DC challenge inheritably motivates the use of SSL techniques, but it differs from the conventional SSL scenarios in three ways: i)The distribution of the training data is significantly different from the distribution of the test data. ii)More than one distinct unlabeled test datasets are given, a transfer learning or multi task learning approach can exploit the similarity between the inboxes. iii) The number of labeled e-mails in the training data is greater than the number of unlabeled e-mails

3.2.1

Evaluation Measure

The predicted labels for all the inboxes were submitted to the challenge chair, who calculated AUC values against the true labels that were only with him. The winners for each task were decided on the basis of average AUC of all the users for that respective task. AUC value is defined as the area under the Receiver Operating Characteristic Curve (ROC) where as ROC is a plot of true positive rate vs. false positive rate while sweeping through all possible values of thresholds of the classifier’s output. The value of AUC ranges from 0 to 1, where 1 is the maximum (100%) score. The choice of AUC value as the evaluation measure is due to the cost sensitive nature of the e-mail spam prediction problem, as a false positive error is more costlier than false negative. Furthermore AUC is considered to be a more robust measure for evaluating the performance of a filter [Cortes and Mohri (2004)].

3.2.2

Datasets

There are separate datasets for the two tasks, each having one training data, and several user inboxes for evaluation. The size of training data for task A and task B is 4000 and 100 respectively, while the size of an individual user inbox is 2500 and 400 for task A and task B respectively. The number of inboxes for evaluation in task A is 3 while for task B is 15. Each training data and user inbox consists of 50% spam and 50% non-spam e-mails. Both the training sets have same composition, with spam e-mails from blacklisted servers of the Spamhaus project

1

while the non

spam e-mails are consolidated from two different sources; 40% from the SpamAssassin corpus and 1

www.spamhaus.org

59

10% from the e-mails sent from around 100 different subscribed German and English newsletters. Henceforth the dataset for task A and task B are referred to as ECML-A and ECML-B, respectively. Correct classification of task B is more difficult than that of task A as the training set for learning is very small in terms of both the labeled and the unlabeled e-mails. The inboxes for evaluation have been constructed from real users whose e-mails had been made public. The non-spam portion is selected from e-mails received by distinct Enron employees from the Enron dataset [Klimt and Yang (2004)]. The spam portion of these inboxes is also collected from different spam sources, some share the same source but, but special consideration is taken not to overlap the e-mails. For details of the source of each user inbox, interested reader is requested to take a look at the proceedings of the DCW [Bickel (2006)]. E-mails in these datasets are represented by a list of tokens and their counts (bag of words representation) within the e-mail content (Subject field in the e-mail headers is also included). Therefore each word in an e-mail is represented by an id-value pair i.e. the id of the word and the number of times that word occurs in that e-mail. Each e-mail is thus represented as a feature vector with term frequencies. This representation is based on a common dictionary (vocabulary) for all e-mails in ECML-A and a common dictionary for all e-mails in ECML-B. Stop words and HTML tags have also been removed. Tokenization was done through X-tokenizer proposed by Siefkes et al. (2004).

3.3

Participation and Results

A total of 57 teams from 19 different countries participated in the competition but only 26 successfully submitted results. Only a few teams submitted results for both the tasks. 20 of the teams were from academia while 6 participants were from commercial companies. We submitted results for task A only, and topped the ranking with average AUC value of 95.07%. The top entries with their average AUC value in decreasing order for task A and task B are listed in the Tables 3.1 and 3.2 respectively. For complete results the reader is advised to refer to Bickel (2006). Some participants in their workshop and conference papers have reported a higher AUC, that is because they improved their results after the submission deadline of the challenge. In the subsequent subsection 60

we take a brief look at the winning techniques.

3.3.1

Techniques Used

Most of the techniques used were based on SSL techniques and their variants such as graph-based algorithms [Pfahringer (2006)], self-training approaches [Junejo et al. (2006), Cormack (2006b)], large-margin-based methods [Kyriakopoulou and Kalamboukis (2006), Mavroeidis et al. (2006), Trogkanis and Paliouras (2006)], multi view learning methods [Mavroeidis et al. (2006)], and positive-only learning approach [Trogkanis and Paliouras (2006)]. Most of the techniques submitted worked on the classical machine learning assumption that the unlabeled and labeled data share the same distribution. Even though this assumption is violated in this problem setting, but nevertheless SSL techniques reduced the error compared to methods that did not utilize the unlabeled data. Pfahringer (2006) accounts for the bias between training and evaluation quite interestingly. To predict an e-mail from the user inbox, he transforms the whole training set by only selecting those features which are actually present in that particular e-mail. This transformed training set is then used to learn the classification model to predict the label for the e-mail message in question. This forces the learner to concentrate only on the features that are actually present in the e-mail. Furthermore he uses a learning algorithm proposed by Zhou et al. (2003) which is one of the best known graph based SSL algorithms. Cormack (2006b)], uses dynamic Markov modeling as a sequential bit wise prediction technique to label e-mails. He induces an initial maximum likelihood classifier by using logistic regression on the training data, followed by iteratively applying dynamic Markov modeling to calculate the successive log likelihood estimates on the combined set of user inboxes ordered in decreasing magnitude of the log likelihood ratio. Successive estimates are averaged to form an overall estimate. There performance severely lacked for the task A. Trogkanis and Paliouras (2006) used a new technique called TPN2 (they do not mention what TPN2 stands for). TPN2 is a four-stage approach that combines fully-supervised and positive-only learning methods. It works on the underlying assumption that the positive examples in the labeled

61

and unlabeled data are more similar than the negative ones. Based on this assumption, TPN2 iteratively selects the positive examples that are most confidently identified from the unlabeled data by the classifier trained on the labeled data. These positive examples iteratively extend the positive set. Finally they refine their classifier by using a different positive only learner that finds strong negative examples. Gupta et al. (2006) having the third highest AUC in both the tasks, and rest of the teams did not submit the details of their solution. Table 3.1: Top teams of ECML Task A. The values are in percentages. Avg. AUC Teams Khurram Nazir Junejo, Mirza Muhammad Yousaf and Asim Karim 95.07 Lahore University of Management Sciences, Pakistan Bernhard Pfahringer 94.91 University of Waikato, New Zealand Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah and Chirag Taneja 94.87 Inductis India Pvt Ltd Nikolaos Trogkanis and Georgios Paliouras 93.65 National Technical University of Athens, Greece National Center of Scientific Research “Demokritos”, Greece, resp. Chao Xu and Yiming Zhou 92.78 School of Computer Science and Engineering, Beijing University, China Lalit Wangikar, Mansi Khanna, Ankush Talwar, Nikhil Marwah and 92.77 Chirag Taneja Inductis India Pvt Ltd Dimitrios Mavroeidis, Konstantinos Chaidos, Stefanos Pirillos, 91.44 Dimosthenis Christopoulos and Michalis Vazirgiannis DB-NET Lab, Informatics Dept., Athens University EB, Greece

3.4

Our Approach

Our personalized spam filtering algorithm consists of two phases of processing. In the first phase, called the training phase, the algorithm learns a statistical model of spam and non-spam words from the training set in a single pass over the training set. The second phase, called the specialization phase, adapts the general statistical model to the characteristics of the individual users inbox. It consists of two or more passes over the users inbox. In the first pass, the statistical model developed in the training phase is used for the initial classification of the users e-mails. Subsequently, the 62

Algorithm 1 Our automatic personalized spam filtering algorithm N = Total number of e-mails NS = Number of spam e-mails NN = Number of non-spam e-mails T = number of words in dictionary (indexed from 1 to T ) CSi = count of word i in all spam e-mails CN i = count of word i in all non-spam e-mails ZS = set of significant spam words ZN = set of significant non-spam words t = threshold s = scale factor WSi = weight associated with significant spam word i WN i = weight associated with significant non-spam word i Ti = word i Training Phase (Phase 1 on Training Set) Build Signif icant W ord M odel Procedure -For each distinct word i in dataset find CSi and CN i -Find the significant spam words ZS such that for each word Ti in ZS , CSi /NS − CN i /NN > t -Find the significant non-spam words ZN such that for each word Ti in ZN , CN i /NN −CSi /NS > t -For each significant spam and non-spam word find their weight as follows: -WSi = [CSi /CN i ] ∗ [NN /NS ], for all words in ZS -WN i = [CN i /CSi ] ∗ [NS /NN ], for all words in ZN Specialization Phase (Phase 2 on Evaluation Set) Initial Passes Score Emails Procedure -For each e-mail in the evaluation dataset -spam score = W Si (over all significant spam words in e-mail) -nonspam score = WN i (over all significant non-spam words in e-mail) -If (s ∗ spam score > nonspam score) then classify as spam; other wise classify as non-spam - output s ∗ spam score − nonspam score -Build statistical model concurrently Build Signif icant W ord M odel given above)

with

scoring

e-mails

(using

Last Pass -Score and classify e-mails using the updated statistical model (procedure is identical to the Score Emails procedure given above)

63

Table 3.2: Top teams of ECML Task B. The values are in percentages. Avg. AUC Teams Gordan Cormack 94.65 University of Waterloo, Canada Nikolaos Trogkanis and Georgios Paliouras 91.83 National Technical University of Athens, Greece National Center of Scientific Research “Demokritos”, Greece, resp. Kushagra Gupta, Vikrant Chaudhary, Nikhil Marwah and Chirag Taneja 90.74 Inductis India Pvt Ltd Dyakonov Alexander 89.92 Moscow State University, Russia Wenyuan Dai 89.33 Apex Data and Knowledge Management Lab, Shanghai Jiao Tong University Table 3.3: Performance results of our algorithm with parameters t = tuning datasets. The values are in percentages. Inbox AUC Precision Recall Accuracy 1 0.9832 85.53 98.88 91.08 2 0.9896 91.87 98.56 94.92 3 0.9898 98.68 90.24 94.52 Average 0.9875 92.02 95.89 93.50

8 and s = 13 tuned on the FP 16.72 7.76 1.20 8.56

TP 98.88 98.08 90.24 95.73

statistical model is updated to incorporate the characteristics of the users inbox. In the last pass, the updated statistical model is used to score and classify the e-mails in the individual users inbox. The pseudo-code of our algorithm is given in algorithm 1. The statistical model is developed as follows: For each distinct word in the labeled set (i.e. training or initial passes of evaluation), determine its estimated probability in spam and non-spam e-mails. Then, find the difference of these two values for each word. Now choose the significant words by selecting only those words for which the absolute difference between their spam and nonspam probabilities is greater than some threshold t. This partitions the significant words as either a spam word or a non-spam word. Each spam and non-spam word is assigned a weight based on the ratio of its probability in the spam and non-spam e-mails. This statistical model of words is used to compute spam score and non-spam score values, where the spam score (non-spam score) of an e-mail is the weighted sum of the words of that e-mail that belong to the significant spam (nonspam) words set. If the spam score multiplied by a scaling factor (s) is greater than the non-spam score then the e-mail is labeled as spam; otherwise, it is labeled as non-spam. This statistical model 64

Table 3.4: Performance of our algorithm with various parameter combinations. Average AUC of the submitted filter (with t = 400 and s = 8) is 95.07%. Values are in percentages. Inbox (t) (s) AUC Precision Recall Accuracy FP TP 1 0 9 98.44 97.42 94.00 95.76 2.48 94.00 1 10 9.5 98.45 96.80 94.64 95.76 3.12 94.64 1 11 9.5 98.48 96.81 94.88 95.88 3.12 94.88 1 100 7.5 98.16 95.55 91.20 93.44 4.24 91.12 1 400 8 95.50 86.94 93.20 89.6 14.00 93.20 2 0 11 99.30 94.96 96.56 95.72 5.12 96.56 2 4 11.5 99.30 94.76 97.04 95.84 5.36 2.96 2 10 11.5 99.29 95.51 95.44 95.48 3.92 94.72 2 100 9 98.21 95.60 90.48 93.16 4.16 90.48 2 400 8 97.17 91.36 88.88 90.24 8.4 88.88 3 0 14.5 99.24 97.02 96.56 96.80 2.96 96.56 3 2 16 99.28 96.29 97.68 96.96 3.76 97.68 3 10 16.5 99.18 96.19 97.12 96.64 3.84 97.12 3 100 14 97.51 89.73 94.40 91.80 10.18 94.40 3 400 8 93.50 85.86 83.12 84.72 13.68 83.12 Table 3.5: Performance comparison of algorithm using word frequencies and occurrences. Inbox Based on Frequency Count Based on Occurrence Count 1 0.9848 0.9920 2 0.9930 0.9941 3 0.9928 0.9936 Average 0.9902 0.9932 is developed in the training phase as well as in the initial passes of the specialization phase. In the final pass of the specialization phase, the final scores and classifications of e-mails are outputted. The motivation for using significant words are: (1) a word that occurs much more frequently in spam e-mails (or non-spam e-mails) will be a better feature in distinguishing spam and non-spam e-mails than a word that occurs frequently in the dataset but its occurrence within spam and nonspam e-mails is almost similar, and (2) this approach greatly reduces the number of words that are of interest, simplifying the model and its computation. It is worth noting that this approach of significant word selection is related to the information theoretic measure of information gain. The scale factor is used to cater for the fact that the number of non-spam words, and their weighted sum in a given e-mail, is usually greater than the number of spam words and their weighted sum. The purpose of the weighting scheme for the significant words is to give an advantage to words for 65

which either the spam probability or the non-spam probability is proportionally greater than the other. For example, if the word with ID 10 has spam and non-spam counts of 0 and 50, respectively, and the word with ID 11 has spam and non-spam counts of 950 and 1000, respectively, then even though their difference in counts is the same (50) the word with ID 10 gives more information regarding the classification of the e-mail than word with ID 11.

3.4.1

Results

Our results reported in Table 3.1 were for t = 400 and s = 8. These parameter values were highly skewed, primarily because we wanted the filter size to be small. However with further experimentation we found that for the initial values of t, there was an increase in the performance of the classifier, but for large values, the performance dropped. The reason being that for small values of t, terms that had a similar chance of occurring in both the spam and non spam classes were filtered out, but when the value grew very large the performance deteriorated because highly discriminatory terms also start to get removed. Experimentation on the tuning data led us to t = 8 and s = 13, which gave an AUC value 3.5% more than our previous highest result in the challenge, see Table 3.3. To determine the optimal value for the parameters, we exhaustively enumerated on the values of t and s. The performance of some selected parameter combinations is given in Table 3.4. The highest AUC value (the optimal filter) for each evaluation dataset is in bold. The average AUC of these three optimal combinations is 99.02%, which means that our parameters are not converging during the training phase. We also experimented to see which feature representation was more suitable. Therefore instead of maintaining the frequencies of words in e-mails we used only their occurrences (i.e. count 1 if a word occurs in an e-mail and 0 otherwise). This variation resulted in a slight increase in the AUC value for the datasets as shown in Table 3.5. This observation is consistent with results from Forman (2003b), Montanes et al. (2005).

66

3.5

Conclusion

Discovery Challenge gave direct comparison of SSL techniques for personalized spam filtering. A large number of solutions based on SSL and their variants such as graph-based algorithms, self-training approaches, large-margin-based methods, multi view learning methods, and positiveonly learning methods performed significantly better than conventional supervised spam filtering techniques such as Naive Bayes and SVM’s. We discussed briefly all the winning techniques followed by their comparison. We discussed the details of our technique that won the performance award at the challenge. It is a statistical algorithm that is highly robust and scalable for automatic personalized spam filtering that does not require any user feedback. It builds an adaptable significant word model to capture the differing distributions of e-mails in users inboxes. This model is first built from the training set and is subsequently adapted to the unlabeled e-mails in individual users inboxes. This adaptation is done in one or more passes over the users inboxes. Personalization had widely been disregarded in server sided spam filtering, but DCW opened a promising door for development of personalized spam filters. Our results and that of others at the workshop confirmed the benefits of personalization with significant performance gains over a filter that assumes same distribution of evaluation (users inboxes) sets as that of the training data. The workshop also provided a rich cache of semi-supervised approaches for future investigations in this area. The workshop did not however address the additional challenges that are faced by a real application in a server-sided setting such as the scalability of the methods but we demonstrate later in Chapter 4 that our technique could generate extremely light weight filters with a minor tradeoff with performance.

67

Chapter 4

Personalized Service Side Spam Filtering 4.1

Introduction

The problem of e-mail spam filtering has continued to challenge researchers because of its unique characteristics. A key characteristic is that of distribution shift, where the joint distribution of e-mails and their labels (spam or non-spam) changes from user to user and from time to time for all users. Conventional machine learning approaches to content-based spam filtering are based on the assumption that this joint distribution is identical in the training and test data. However, this is not true in practice because of the adversarial nature of e-mail spam, differing concepts of spam and non-spam among users, differing compositions of e-mails among users, and temporal drift in distribution. Typically, a single spam filter is developed from generic training data which is then applied to emails for all users. This global approach cannot handle the distribution shift problem. Specifically, the generic training data do not represent the e-mails received by individual users accurately as they are collected from multiple sources. A global filter is unlikely to provide accurate filtering for all users unless it is robust to distribution shift. Furthermore the users do not like to manually label e-mails as spam or share their personal e-mails because of privacy issues, thus constraining

68

the solution to be fully automatic without requiring any user feedback. Recently, there has been significant interest in personalized spam filtering for handling the distribution shift problem (see Cormack (2007a)). In this approach, local or personalized filters are built for users from generic training data and their e-mails. For personalized spam filtering to be successful it must (1) provide higher filtering performance as compared to global filtering, (2) be automatic in the sense that users’ feedback on the labels of their e-mails is not required, and (3) have small memory footprints. The last requirement is critical for e-mail service providers (ESPs) who serve thousands and millions of users. In such a setting, implementing personalized spam filters at the service-side is constrained by the fact that all filters must reside in memory for real-time filtering. In the previous chapter we looked at the discovery challenge, the approaches of the participants followed by a detailed discussions of our approach. In this chapter we formally define the spam filtering problem, personalized spam filtering and the theoretical and practical issues that arise from these topics. We improve our previous algorithm in various aspects including: formulating the algorithm in terms of probabilities instead of word counts, using linear discriminant instead of scale factor, replacing the current score aggregation technique by a well known expert aggregation technique such as linear opinion pooling, use of term occurrence as opposed to term frequency and change in the learning process of the parameters. We also introduce two variants of the algorithm for the semi-supervised setting, namely, PSSF1 and PSSF2. Further, we elaborate how our filter satisfies the three desirable characteristic mentioned above, and the key ideas of the filter that contribute to its robustness and personalizability, including: (1) supervised discriminative term weighting, which quantifies the discrimination information that a term provides for one class over the other. These weights are used to discover significant sets of terms for each class. (2) A linear opinion pool or ensemble for aggregating the discrimination information provided by terms for spam and non-spam classification. This allows a natural transformation from the input term space to a two-dimensional feature space. (3) A linear discriminant to classify the e-mails in the twodimensional feature space. Items (1) and (2) represent a local discrimination model defined by two features while item (3) represents a global discrimination model of e-mails. The performance of

69

our filter here is evaluated on six datasets and compared with four popular classifiers. Extensive results are presented demonstrating the robustness, generalization and personalizability of the filter. In particular, our filter performs consistently better than other classifiers in situations involving distribution shift. It is also shown to be scalable with respect to filter size and robust to gray e-mails. We add to the previous chapters in following aspects (1) We define and discuss the challenges in spam filtering from statistical and probabilistic points of view, highlighting issues like distribution shift and presence of gray e-mails (Sect. 4.2). (2) We replace the scale factor parameter with a linear discriminant model. (3) Instead of adding the weights to find spam and non spam score, we use a well established expert opinion aggregation technique known as Linear Opinion Pooling. (4) We describe global and personalized spam filtering from an ESP’s perspective, and relate these filtering options supervised, and semi-supervised learning (Sect. 4.3). (5) We present DTWC, the supervised version of our filter for global spam filtering. (6) We develop a variant of our previous algorithm which we refer to as PSSF2, an algorithm that is more suited for problems of personalized spam filtering and distribution shift. (7) We develop the theoretical foundation of our filter for global and personalized spam filtering based on local and global discrimination modeling (LeGo), and compare it with popular generative, discriminative, and hybrid classifiers (Sect. 4.4.4). (8) We evaluate and compare our filter’s performance with others on global and personalized spam filtering settings (Sect. 4.6). (9) We evaluate the performance of our filter for varying distribution shift and robustness. (10) Identification and resolution of gray e-mails. (11) Results for additional e-mail spam datasets are provided side by side with four benchmark text classification algorithms such as naive Bayes, support vector machines, maximum entropy and balanced winnow. (12) Effect of multiple passes on performance. (13) Generalization of the model on to the unseen data. (14) Change in the learning of the parameter value.

4.2

The Nature of the Spam Filtering Problem

The classification of e-mails into spam or non-spam based on the textual content of the e-mails given a labeled set of e-mails represents a prototypical supervised text classification problem. The idea 70

is to learn a classifier or a filter from a sample of labeled e-mails which when applied to unlabeled e-mails assigns the correct labels to them. This high-level description of a spam filter, however, hides several issues that make spam filtering challenging. In this section, we define and describe the nature of e-mail classification, highlighting the key challenges encountered in the development and application of content-based spam filters. Let X be the set of all possible e-mails and Y = {+, −} be the set of possible labels with the understanding that spam is identified by the label +. The problem of supervised e-mail classification can then be defined as follows: Definition 1 (Supervised Spam E-mail Classification). Given a set of training e-mails L = {(xi , yi )}N i=1 and a set of test e-mails U , both drawn from X ×Y according to an unknown probability ¯ distribution p(x, y), learn the target function Φ(x) : X → Y that maximizes a performance score computed from all (x, y) ∈ U . The joint probability distribution of e-mails and their labels, p(x, y), completely defines the e-mail classification problem. It captures any selection bias and all uncertainties (e.g. label noise) in the concept of spam and non-spam in L and U . Thus, the e-mail classification problem can be solved by estimating the joint distribution from the training data. The learned target function can then be defined as ¯ Φ(x) = y = argmax p¯(x, v) v∈{+,−}

where p¯(x, v) denotes the estimate of the joint probability p(x, v). Since p(x, y) = p(x|y)p(y) and p(x, y) = p(y|x)p(x), it is customary and easier to estimate the component distributions on the right hand sides rather than the full joint distribution directly. Given these decompositions of the joint distribution, the learned target function can be written in one of the following ways: ¯ p(v) Φ(x) = y = argmax p¯(x|v)¯

(4.1)

v∈{+,−}

¯ Φ(x) = y = argmax p¯(v|x)

(4.2)

v∈{+,−}

Equation 4.1 represents a generative approach to supervised learning where the class prior and

71

class-conditional distributions are estimated. It is called generative because these two distributions can be used to generate the training data. Equation 4.2 represents a discriminative approach to supervised learning where the posterior distribution of the class given the e-mail is estimated directly. Notice that in the discriminative approach estimating the prior distribution of e-mails p¯(x) is not necessary because the classification of a given e-mail x depends on the posterior probability of the class given the e-mail only. The performance score quantifies the utility of the learned target function or classifier. Typically when evaluating spam filters, the performance score is taken to be the accuracy and/or AUC value (area under the ROC curve) of the classifier. The AUC value is considered as a more robust measure of classifier performance since it is not based on a single decision boundary [Cortes and Mohri (2004), Bickel (2006)]. In supervised learning, only the labeled e-mails in L are available to the learner, and the learned target function is evaluated on the e-mails in U .

4.2.1

Distribution Shift and Gray E-mails

In the previous subsection, it was assumed that the training and test data, L and U , follow the same probability distribution p(x, y). This assumption is not valid in many practical settings, where the training data come from publicly available repositories and the test data represent emails belonging to individual users. More specifically, the joint distribution of e-mails and their labels in L, pL (x, y), is not likely to be identical to that in U , pU (x, y). This arises from the different contexts (e.g. topics, languages, concepts of spam and non-spam, preferences, etc) of the two datasets. Similarly, if Ui and Uj are the test sets belonging to user i and j, respectively, then we cannot expect the joint distribution of e-mails and their labels in these two sets to be identical. Definition 2 (Distribution Shift). Given sets of labeled e-mails L, Ui , and Uj , there exists a distribution shift between any two two sets if any of the probability distributions p(x, y), p(x|y), p(y|x), p(x), and p(y) are not identical for the two sets. The extent of distribution shift can be quantified in practice by information divergence measures such as Kullback-Leibler divergence (KLD) and total variation distance (TVD). Because of distribution shift, it is likely that the learned target function will misclassify some 72

e-mails, especially when the change in distribution occurs close to the learned decision boundary. In any case, a distribution shift will impact performance scores like the AUC value that are computed by sweeping through all decision boundaries.

Probability

Training Data Spam Distribution 0.04

0.02

0.02

Probability

0 0

200 400 600 800 1000 Testing Data Spam Distribution

0 0

0.04

0.04

0.02

0.02

0 0 Probability Difference

Training Data Non−Spam Distribution

0.04

200 400 600 800 1000 Difference of Spam Probabilities

0 0

0.04

0.04

0.02

0.02

0

0

−0.02 0

200

400 600 Term Id

800

1000

−0.02 0

200 400 600 800 1000 Testing Data Non−Spam Distribution

200 400 600 800 1000 Difference of Non−Spam Probabilities

200

400 600 Term Id

800

1000

Figure 4.1: Shift in p(x|y) between training and test data (individual user’s e-mails) (ECML-A data)

Distribution shift can be quantified by KLD [Bigi (2003), Kullback and Leibler (1951)] and TVD [Kennedy and Quine (1989)], which are defined and used in Sect. 4.6.3. The extent of distribution shift can also be visualized from simple frequency graphs. Figure 4.1 illustrates shift in the distribution p(x|y) between training data and a specific user’s e-mails (test data). The top two plots show the probability of terms in spam (left plot) and non-spam (right plot) e-mails in the training data, while the middle two plots show the same distributions for an individual user’s e-mails. The bottom two plots show the corresponding difference in distributions between the training and individual user’s e-mails. Figure 4.2 shows the difference in distributions of p(x|y) 73

Spam Probabilities

−3

x 10

−3

8

6

6

4

4

2

2

Probability Difference

Probability Difference

8

0 −2 −4

−2 −4 −6

−8

−8

0.5

1 1.5 Term Id

2

−10 0

2.5 5 x 10

Non−Spam Probabilities

0

−6

−10 0

x 10

0.5

1 1.5 Term Id

2

2.5 5 x 10

Figure 4.2: Difference in p(x|y) for e-mails from two different time periods (ECUE-1 data)

over two time periods. These figures illustrate the presence of distribution shift in practical e-mail systems. Figure 4.3 shows the difference in distributions of p(x|y) for a sentiment mining dataset for which the training and testing data were randomly sampled. This difference is significantly less than that of Figure 4.1, indicating very little or no distribution shift. A specific consequence of distribution shift is that of gray e-mail. If the distribution p(y|x) in Ui and Uj is different to such an extent that the labels of some e-mails are reversed in the two sets, then these e-mails are called gray e-mails. They are referred to as gray because they are considered non-spam by some and spam by other users. Definition 3 (Gray E-mail). An e-mail x is called a gray e-mail if its label in Ui and Uj is different. Gray e-mails can be identified in practice by finding highly similar e-mails in Ui and Uj that have different labels. Typical examples of gray e-mails include mass distribution e-mails (e.g. newsletters, promo74

Probability Difference

0.04

Positive Sentiments

0.04

0.02

0.02

0

0

−0.02 0

500 Term Id

1000

Negative Sentiments

−0.02 0

500 Term Id

1000

Figure 4.3: Shift in p(x|y) between training and test data (Movie Review)

tional e-mails, etc) that are considered as spam by some but non-spam by others.

4.3 4.3.1

Global Versus Personalized Spam Filtering Motivation and Definition

The issue of global versus personalized spam filtering is important for e-mail service providers (ESPs). ESPs can serve thousands and millions of users and they seek a practical trade-off between filter accuracy and filter efficiency (primarily related to filter size). A personalized spam filtering solution, for example, may not be practical if its implementation takes up too much memory for storing and executing the filters for all users. A global spam filtering solution, on the other hand, may not be practical if it does not provide accurate filtering for all users. Thus, robustness and scalability are two significant characteristics desirable in a spam filter or classifier. If a filter is robust then a global solution may provide acceptable filtering; otherwise, a personalized solution may be needed, but then it has to be scalable for it to be practically implementable. In order to gain a better understanding of the suitability of personalized and global spam filtering solutions it is important to know the settings under which each can be applied. Table 4.1 shows 75

the learning settings under which global and personalized spam filtering can be applied. The first column indicates that a supervised global filter is possible whereas supervised personalized filtering is not possible. This is because only the training data (L) are available to the learner while the labels of the e-mails belonging to individual users are not. Semi-supervised learning can be adopted for both global and personalized filtering solutions. Under this learning setting, the e-mails belonging to the users (without their labels) are available during the learning process in addition to the training data L. The actual way in which the unlabeled e-mails are utilized can vary from algorithm to algorithm. The third column indicates that both personalized and global filtering can be applied when users are requested to label some e-mails that belong to them. These labeled e-mails are then used during the learning process to build global or personalized filtering solutions. This strategy, however, is not automatic and places an additional burden on users in providing feedback on the received e-mails. Certain hybrid global/personalized and supervised/semi-supervised filtering solutions are also possible. Table 4.1: Global and personalized spam filtering options Supervised Semi-Supervised Semi-Supervised + Feedback Hybrid Global     Personalized X   In this work, we focus on automatic supervised and semi-supervised learning for global and personalized spam filtering. This is the most desirable setting for an ESP. Semi-supervised global and personalized spam filtering can be defined as follows: Definition 4 (Semi-Supervised Global and Personalized E-mail Classification). Given a set of labeled e-mails L (training data) and M ≥ 1 sets of labeled e-mails Ui (e-mails belonging to user i = 1, 2, . . . , M ) drawn from X × Y according to (unknown) probability distributions pL (x, y) and pUi (x, y), respectively, then ¯ (a) a semi-supervised global filter learns the single target function Φ(x) : X → Y from L and Ui  (without their labels) that maximizes a performance score computed from all (x, y) ∈ Ui ; ¯ i (x) : X → Y from L and Ui (b) a semi-supervised personalized filter learns M target functions Φ (without their labels) that respectively maximizes a performance score computed from all (x, y) ∈ Ui . 76

The joint probability distributions pL (x, y) and pUi (x, y) are likely to be different from one another because of the differing contexts of the training data and the user’s e-mails. Given this reality, it is unlikely that a single global filter trained on L only (supervised learning) will perform highly for all users, unless it is robust to shifts in distributions. Similarly, a semi-supervised approach, whether global or personalized, is likely to do better than a supervised approach as it has the opportunity to adapt to the users (unlabeled) e-mails. However, such an approach needs to be scalable and effective for its implementation at the ESP’s end. For semi-supervised learning, in addition to the training data L, the e-mails in U (without their labels) are also considered by the learner. In such a setting, the learned target function can be evaluated on the e-mails in U , as in the supervised learning setting. A better evaluation strategy, especially considering that a spam filter may not be updated continuously, is to evaluate on a randomly sampled hold-out from U , G ⊂ U , with the learner seeing the unlabeled e-mails U  = U \ G, and the learned target function tested on the e-mails in G, which can be referred to as the generalization data.

4.3.2

Semi-Supervised Global Versus Personalized Spam Filtering

Which semi-supervised filtering solution is better: global or personalized? This question is difficult to answer in general as it depends on the algorithm and its robustness and scalability characteristics. It also depends upon the extent of the distribution shifts among L and Ui . In this section, we try to get a better understanding of this question under certain extreme distribution shift scenarios. Let pUi (x) and pUi (y|x) be probability distributions in set Ui belonging to user i. Given pUj (x), ∀i = j this probabilistic viewpoint, four different scenarios can be defined: (1) pUi (x) ∼ = pUj (y|x), ∀i = j, (2) pUi (x) ∼ pUj (y|x), ∀i = j, (3) and pUi (y|x) ∼ = = pUj (x), ∀i, j and pUi (y|x) ∼ = pUj (x), ∀i = j and pUi (y|x) ∼ pUi (x) ∼ = = pUj (y|x), ∀i, j, and (4) pUi (x) ∼ = pUj (x), ∀i, j and pUi (y|x) ∼ = pUj (y|x), ∀i, j. The binary operator ∼ = indicates that the shift between the two probability distributions is minor (i.e. the two distributions are approximately identical). Similarly, the binary operator ∼ indicates that the shift between the two probability distributions is significant. For pre= sentation convenience, operators ∼ are read as “identical” (“no distribution shift”) and “not = and ∼ =

77

identical” (“distribution shift exists”), respectively. Whenever a distribution is identical among the users it is assumed that it is the same distribution as that in the training data. Whenever a distribution is not identical among the users it is assumed that all are different from that in the training data. Scenario 1 is a difficult setting in which the users’ e-mails and labels given e-mails both follow different distributions. For this scenario, it is expected that a personalized spam filtering solution will perform better as learning multiple joint distributions is difficult for a single algorithm, as would be required in a global filtering solution. An extreme case of this scenario is when the vocabularies of e-mails for all the users are disjoint (e.g. when different users use different languages). In this setting, the size of the global filter will be identical to the sum of the sizes of the personalized filters. A similar statement can be made about scenario 2 regarding filtering performance. However, for this scenario, the size of each personalized filter will be identical to the size of the global filter. The problem of gray e-mails is likely to be present in scenarios 1 and 2 since in these scenarios the distribution pUi (y|x) is different among the users. The prevalence of gray e-mails suggests the preference of a personalized filtering solution. Scenario 3 does not involve concept shift while the distribution of e-mails is not identical across all users. For this scenario, which manifests as covariate shift in vector space representations, a global filter (built using all unlabeled e-mails) is expected to perform better especially when the distribution of e-mails is not very different. Scenario 4 represents a conventional machine learning problem with no distribution shift among the training and test data. Here, a global filter will be better in terms of both performance and space. The scenarios described in the previous paragraph are based on a discriminative viewpoint of the e-mail classification problem. Similar scenarios can also be defined with a generative viewpoint using probability distributions p(x|y) and p(y). Equations 4.1 and 4.2 define generative and discriminative probabilistic classifiers, respectively, for spam filtering. When there is no distribution shift, the probability estimates on the right hand sides are based on the training data. On the other hand, when distribution shift exists the probability estimates must be based on the test data (Ui ). However, the labels in the test data are not available to the learner, and it provides information about p(x) only. When there is no shift in

78

p(y|x) and training data is abundant, a discriminant function learned from the training data will perform well on the test data, irrespective of the shift in p(x). However, when the distribution p(y|x) changes from training to test data the discriminant function will require adaptation. Similarly, when the data generation process changes from training to test data, a generative classifier will require adaptation. Our approach (described in detail in the next section), uses local and global discrimination models with semi-supervised adaptation of the models on the test data. The local model takes advantage of generative probabilities to discover discriminating patterns. The global model then performs pattern-based classification. This approach is flexible and is better able to adapt to distribution shift between training and test data.

4.4

DTWC/PSSF: A Robust and Personalizable Spam Filter

In this section, we describe a robust and personalizable content-based spam filter suitable for global and personalized filtering of e-mails at the ESP’s end. The filter is robust in the sense that its performance degrades gracefully with increasing distribution shift and decreasing filter size. The filter can be used for global as well as personalized filtering, and can take advantage of each user’s e-mails in a semi-supervised fashion without requiring feedback from the user. The filter exploits local and global discrimination modeling for robust and scalable supervised and semi-supervised e-mail classification. The key ideas of the supervised filter are: (1) identification or discovery of significant content terms based on the discrimination information provided by them, quantified by their relative risk in spam and non-spam e-mails (called their discriminative term weights), (2) discrimination information pooling for the construction of a two-dimensional feature space, and (3) a discriminant function in the feature space. In a semi-supervised setting, either or both the local model (the discriminative term weights and the two features) and the global model (discriminant function) are updated by using the e-mails of each user to personalize the filter. We name the supervised filter as DTWC (discriminative term weighting based classifier) and its semi-supervised extension as PSSF (personalized service-side spam filter) in accordance with our previous work Junejo and Karim (2008; 2007b). 79

The approach of the discovery challenge workshop discussed in the previous chapter had a few shortcomings. The value of the scale factor parameter was determined exhaustively which had problems of too much time and bad convergence. Secondly, the spam and non-spam scores were calculated by simply adding the term weights. Thirdly, the model was not probabilistic, rather it only used word counts, the prior probability of the classes was not taken into account as well. This simplistic model worked nicely as can be seen from the results of the previous chapter, but there are two problems in not including the prior, one is obvious, it would not work for imbalanced classes, but the second problem is a subtle one and it occurs with balanced classes as well in the semi-supervised pass. In this pass we learn the model on the test data which has been previously labeled by the model learned on the training data; and this labeled test data has no guarantee of equal class ratio. This is because the model learned on the training data is not 100% accurate, hence it may be that 60% e-mails are classified as spam while only 40% are classified as non spam in the test data. So, if we learn the model on this labeled test data without the priors, the weights might get skewed in favor of one class than other. Switching to probabilities not only defines the upper range of t (i.e. up to 1) but it also makes comparison across different corpuses possible. Instead of learning the parameter values on the tuning data, which may not be always available and might be very different from the evaluation data, we now select those parameter values for which we get the minimum error on the training data. The local and global models in DTWC/PSSF are presented next. These models are discussed for a supervised setting first followed by their extension to a semi-supervised setting for building personalized filters. Finally, we interpret and compare our algorithms with popular generative, discriminative, and hybrid classifiers.

4.4.1

Local Patterns of Spam and Non-Spam E-mails

DTWC/PSSF is based on a novel local model of spam and non-spam e-mails. The terms in the vocabulary are partitioned into significant spam and non-spam terms depending on their prevalence in spam and non-spam e-mails, respectively. The e-mails are considered to be made up of significant spam and non-spam terms, with each term expressing an opinion regarding the classification of the

80

e-mail in which it occurs. The overall classification of an e-mail is based on the aggregated opinion expressed by the significant terms in it. The discriminative term weight quantifies the discrimination information (or opinion) of a term, while a linear opinion pool is formed to aggregate the opinions expressed by the terms. The aggregated opinions (one for spam and one for non-spam classification) represent local features or patterns of an e-mail, learned from the training data, that are input to a global classification model (described in the next subsection).

Significant Spam and Non-Spam Terms A term j in the vocabulary is likely to be a spam term if its probability in spam e-mails, p(xj |y = +), is greater than its probability in non-spam e-mails, p(xj |y = −). A term j is a significant spam (non-spam) term if p(xj |y = +)/p(xj |y = −) > t (p(xj |y = −)/p(xj |y = +) > t), where t ≥ 1 is a term selection parameter. Given the above, the index sets of significant spam and non-spam terms (Z + and Z − ) can be defined as follows:

Z

+

    p(xj |y = +)  > t , and = j p(xj |y = −)

(4.3)

    p(xj |y = −)  >t = j p(xj |y = +)

(4.4)

Z



where index j varies from 1 to T . Note that Z + ∩ Z − = ∅ indicating that a hard partitioning of terms is done. However, |Z + ∪ Z − | is generally not equal to T . The probability ratios in Eqs. 4.3 and 4.4 are the relative risks of term j in spam and non-spam e-mails, respectively. We discuss this aspect in more detail in the following subsection. The parameter t serves as a term selection parameter and can be used to tune the size of the filter. If t = 1, then all spam and non-spam terms are retained in the significant term model. As the value of t is increased, less significant (or less discriminating, as explained in the next subsection) terms are removed from the model. This is a supervised and more direct approach for term selection as compared to the common techniques used in practice like information gain and principal component analysis. Effective term selection is important for creating lightweight personalized filters for large-scale service-side deployment. 81

Alternatively, significant spam and non-spam terms can be selected by the conditions p(xj |y = +)−p(xj |y = −) > t and p(xj |y = −)−p(xj |y = +) > t , respectively (see Eqs. 4.3 and 4.4). Here, t ≥ 0 is a term selection parameter which has the same semantics as the parameter t discussed above. The probabilities p(xj |y) are estimated from the training data (the set L of labeled e-mails) as the fraction of e-mails in which term j occurs:  p¯(xj |y) =

i∈Ly xij |Ly |

where Ly denotes the set of e-mails in L belonging to class y and |Ly | is the number of e-mails in the set. To avoid division by zero, we assign a small value to the probabilities that are coming out to be zero (Laplacian smoothing).

Discriminative Term Weighting E-mails are composed of significant spam and non-spam terms. Each term in an e-mail is assumed to express an opinion about the label of the e-mail – spam or non-spam. This opinion can be quantified by discrimination information measures that are based on the distribution of the term in spam and non-spam e-mails. If an e-mail x contains a term j (i.e. xj = 1) then it is more likely to be a spam e-mail if p(xj |y = +) is greater than p(xj |y = −). Equivalently, an e-mail x is likely to be a spam e-mail if its relative risk for spam compared to non-spam is greater than one. This can be expressed mathematically as p(xj |y = +) p(y = +|xj ) ∝ >1 p(y = −|xj ) p(xj |y = −)

(4.5)

Given this observation, we define the discriminative term weight wj for terms j = 1, 2, . . . , T as ⎧ ⎪ ⎨ p(xj |y = +)/p(xj |y = −) ∀j ∈ Z + wj = ⎪ ⎩ p(xj |y = −)/p(xj |y = +) ∀j ∈ Z −

(4.6)

The discriminative term weights are always greater than or equal to 1. The larger the value of wj the 82

higher is the discrimination information provided by term j. The inclination of the discrimination information is determined by whether the term is a significant spam term or a significant non-spam term. It is worth pointing out the distinction between discriminative term weights (DTWs) and (document) term weights (TWs). First, DTWs quantify the discrimination information that terms provide for classification while TWs quantify the significance of a term within a document. Second, DTWs are defined globally for every term in the vocabulary while TWs are defined locally for every term in a document. Third, DTWs are computed in a supervised fashion rather than the usual unsupervised computation of TWs. DTWs are not a substitute for TWs; they are defined for a different purpose of classification rather than representation. Relative risk or risk ratio has been used in medical domains for analyzing the risk of a specific factor in causing a disease [Hsieh et al. (1985), Li et al. (2005)]. This is done as part of prospective cohort studies where two groups of individuals, with one group exposed and the other unexposed to the factor, are observed for development of the disease. The relative risk of the factor is determined by the ratio of the proportion of exposed individuals developing the disease to the proportion of exposed individuals not developing the disease. In data mining and machine learning research, relative risk has been investigated for feature selection [Forman (2003a)] and pattern discovery [Li et al. (2005; 2007)]. In this work, we adopt relative risk for feature selection and quantification of discrimination information. Our feature selection procedure (refer to Eqs. 4.3 and 4.4) finds both positively and negatively correlated features/terms but keeps them separated for global discrimination.

Linear Opinion Pooling The classification of an e-mail depends on the prevalence of significant spam and non-spam terms and their discriminative term weights. Each term j ∈ Z + in e-mail x expresses an opinion regarding the spam classification of the e-mail. This opinion is quantified by the discriminative term weight wj . The aggregated opinion of all these terms is obtained as the linear combination of individual

83

terms’ opinions:

 +

j∈Z +



Score (x) =

xj wj

(4.7)

xj

j

This equation follows from a linear opinion pool or an ensemble average, which is a statistical technique for combining experts’ opinions [Jacobs (1995), Alpaydin (2004)]. Each opinion (wj ) is  weighted by the normalized term weight (xj / xj ) and all weighted opinions are summed yielding an aggregated spam score (Score+ (x)) for the e-mail. If a term i does not occur in the e-mail (i.e. xi = 0) then it does not contribute to the pool. Also, terms that do not belong to set Z + do not contribute to the pool. Similarly, an aggregated non-spam score can be computed for all terms j ∈ Z − as

 Score− (x) =

j∈Z −



j

xj wj

xj

.

(4.8)

It is interesting to note that the sets of significant spam and non-spam terms (Z + and Z − ) represent pattern teams [Knobbe and Valkonet (2009)] for a given value of t. In other words, these sets are optimal and non-redundant when considering the sum of their discriminating term weights as the quality measure and given the parameter t. Nevertheless, we construct new continuousvalued features from the two sets rather than using the terms in the sets as binary features. This is done to improve robustness of the classifier in addition to reducing the complexity of the global model. Based on the local model described above, the classification (or decision) function is

y(x) = argmax Scorev (x).

(4.9)

v∈{+,−}

In words, the classification label of a new e-mail x is determined by the greater of the two scores, spam score or non-spam score.

4.4.2

Global Discriminative Model of Spam and Non-Spam E-mails

This section describes a discriminative model that is built using the features found in the local model. The use of both local and global learning allows greater flexibility in updating model parameters in the face of distribution shift. It also contributes to the robustness of the classifier 84

towards distribution shift.

15000 Spam Emails Non−Spam Emails

Score (x)

10000

+

α+ * Score+(x) − Score−(x) + α0

5000

0 0

0.2

0.4

0.6

0.8

1 1.2 − Score (x)

1.4

1.6

1.8

2

2.2 4 x 10

Figure 4.4: The two-dimensional feature space and the linear discriminant function for e-mail classification

The local model yields a two-dimensional feature space defined by the scores, Score+ (x) and Score− (x). In this feature space, e-mails are well separated and discriminated, as illustrated in Fig. 4.4. The documents of the two classes have nicely aligned themselves with the x and y axis respectively, they seem to have formed clusters. Each point in this space corresponds to a document while dimensions correspond to the consolidated opinion of the discriminative terms corresponding to that class. Documents in this space can be assigned the class for which it has the highest score (Eq. 4.9). At times this strategy works nicely giving comparable results but at other times it fails miserably. That is why relative risk and odds ratio have not found much success in text classification literature. This is because the class distributions within a dataset could be very different from each other, as a result of which the number of features and hence the computed class scores might get skewed in favor of one class, e.g. the spammers distort the spelling of the words to bypass the word based filters because of which the number of features selected for spam class

85

Algorithm 2 DTWC – Supervised Spam Filter Input: L (training data – labeled e-mails), U (test data – unlabeled e-mails) Output: labels for e-mails in U On training data L -Build local model form index sets Z + and Z − (Eqs. 4.3 and 4.4) compute wj , ∀j ∈ (Z + ∪ Z − ) (Eq. 4.6) transform x → [Score+ (x) Score− (x)]T , ∀x ∈ L (Eqs. 4.7 and 4.8) -Build global model learn parameters α+ and α0 (see Eq. 4.10) On test data U -Apply learned models for x ∈ U do compute Score+ (x) and Score− (x) (Eqs. 4.7 and 4.8) compute f (x) (Eq. 4.10) ¯ output Φ(x) (Eq. 4.11) end for might be very high as compared to that of non spam, this might skew the decisions in favor of spam e-mails. For some problems the documents of a class tend to be very short or terse as compared to the other classes resulting in small number of features for that class. We over come this deficiency by learning a linear discriminant function in this transformed feature space, which weighs the two opinions accordingly. Therefore DTWC/PSSF classifies e-mails in this space by the following linear discriminant

Build Local Model Training Data

L

Calculate: j



w ,Z ,Z



Build Global Model Calculate:

Apply Model

D  &D 

Score (x) & Score (x)

Test Data

U Figure 4.5: Depiction of Algorithm 2

86

Output Labels

function: f (x) = α+ · Score+ (x) − Score− (x) + α0

(4.10)

where α+ and α0 are the slope and bias parameters, respectively. The discriminating line is defined by f (·) = 0. If f (·) > 0 then the e-mail is likely to be spam (Fig. 4.4). The global discrimination model parameters are learned by minimizing the classification error over the training data. This represents a straightforward optimization problem that can be solved by any iterative optimization technique [Luenberger (1984)]. DTWC/PSSF’s learned target function is defined as ⎧ ⎪ ⎨ + f (x) > 0 ¯ Φ(x) = y = ⎪ ⎩ − otherwise

(4.11)

The supervised spam filter DTWC is shown in Algorithm 2 and Figure 4.5.

4.4.3

Personalization

The previous subsections described a supervised algorithm for spam filtering. This can be used to build a single global filter for all users given generic training data (the e-mails in L labeled by ˆ Φ(x)). A global filtering solution, however, is not appropriate when there is a distribution shift between the training data and the individual user’s e-mails. In such a situation, which is common in e-mail systems, personalization of the filter is required for each user. Since the e-mails of users are unlabeled, a semi-supervised approach is needed, as discussed in Sect. 4.3. DTWC is adapted to an individual user’s e-mails (Ui ) as follows. Use the learned target function from training data to label the e-mails in Ui . Since labels are now available, update the local and global models to form a new target function. Use this updated target function to finally label the e-mails in Ui . This approach corresponds to the naive semi-supervised learning (SSL) used in Xue and Weiss (2009). The motivation for using naive SSL instead of more sophisticated approaches (e.g. self-training SSL) are two-fold. First, most of the other approaches are proposed for situations where the training and test data follow the same distribution. In e-mail systems, on the other hand, the distribution of e-mails in training data can be significantly different from that in the test data, and this difference is in general not easily quantifiable. Second, in e-mail systems distribution 87

shift occurs continuously over time and it is better to personalize a filter from one time period to another rather than from training data to current time period. The naive SSL approach decouples the adaptation from the training data and as such can be performed on the client side in addition to the service side. Algorithm 3 PSSF1/PSSF2 – Personalized Spam Filter Input: L (training data – labeled e-mails), U (test data – unlabeled e-mails) Output: labels for e-mails in U

U  ← U labeled with DTWC(L, U )

On labeled e-mails U  -Build local model -Build global model [PSSF2 Only]

On test data U -Apply learned models

The local model can be updated incrementally as new e-mails are seen by the filter capturing the changing distribution of e-mails received by the user. The global model can be rebuilt at periodic intervals (e.g. every week) to cater for significant changes in the distribution of e-mails. The semi-supervised version of our filter is called PSSF (Personalized Service-side Spam Filter). We further differentiate between two variants of PSSF as PSSF1 or PSSF2. PSSF1 is the variant in which only the local model (the spam and non-spam scores) is updated over the user’s e-mails (Ui ) and the global model is not updated from the model learned over the training data. PSSF2 is the variant in which both the local and global models are updated over the user’s e-mails (Ui ). PSSF2 provides better personalization because the global model is also adjusted according to the user’s distribution but on the down side it takes more time. PSSF1 and PSSF2 are described in Algorithm 3 and Figure 4.6.

88

Test Data

U Build Local Model Training Data

Calculate: j

L



w ,Z ,Z





Build Global Model Calculate:

Score (x) & Score (x)

Calculate:

wj,Z,Z 

Re-build Global Model

PSSF2?

Output Labels

Apply Model

Output Labels

D  &D 



Re-build Local Model

Apply Model

Calculate:

D  &D 



Score (x) & Score (x) NO

Test Data

U Figure 4.6: Depiction of Algorithm 3

4.4.4

Interpretations and Comparisons

In this section, we provide a broader interpretation of our filter by comparing it with generative, discriminative, and hybrid classifiers.

Naive Bayes Classifier The naive Bayes classifier is popularly used for text and content-based e-mail classification. Using the Bayes rule, the odds that an e-mail x is spam rather than non-spam can be written as p(x|y = +)p(y = +) p(y = +|x) = p(y = −|x) p(x|y = −)p(y = −) 89

Assuming that the presence of each term is independent of others given the class, the e-mail risk on the righthand side become a product of terms’ risks. The naive Bayes classification of the e-mail x is spam (+) when

p(y = +) p(y = −)



j

p(xj |y = +) p(xj |y = −)

x j >1

(4.12)

Equivalently, taking the log of both sides, the above expression can be written as

log

p(xj |y = +) p(y = +)  + >0 xj log p(y = −) p(xj |y = −)

(4.13)

j

This equation computes an e-mail score and when this score is greater than zero the naive Bayes classification of the e-mail is spam. Notice that only those terms are included in the summation for which xj > 0. Comparing the naive Bayes classifier, as expressed by Eq. 4.13, with DTWC/PSSF yields some interesting observations. The global discriminative model of DTWC/PSSF is similar to Eq. 4.13 in that the structure of the spam and non-spam score computations (Eqs. 4.7 and 4.8) is similar to the summation in Eq. 4.13 and the bias parameter α0 corresponds to the first term in Eq. 4.13. However, there are also significant differences between DTWC/PSSF and naive Bayes. (1) DTWC/PSSF partitions the summation into two based on discrimination information, and then learns a linear discriminative model of the classification. Naive Bayes, on the other hand, is a purely generative model with no discriminative learning of parameters. (2) DTWC/PSSF, as presented in this work, involves summation of terms’ relative risks rather than terms’ log relative risks as in naive Bayes. However, as discussed in 5, other measures of discrimination information can be used instead of relative risk. (3) DTWC/PSSF does not require the naive Bayes assumption of conditional independence of the terms given the class label. (4) The spam and non-spam scores in DTWC/PSSF are normalized (for each e-mail) using the L1 norm. This normalization arises naturally from linear opinion pooling. Document length normalization is typically not done in naive Bayes classification, and when it is done, the L2 norm is used. It has been shown that performing L1 document length normalization improves the precision of naive Bayes for text classification [Kolcz and Yih (2007)].

90

Discriminative Classifiers Popular discriminative classifiers learn a hyperplane or linear discriminant in the space representing the objects to be classified (e-mails in our case). Let φ : X → V be the function that maps an e-mail x from the T -dimensional input space to v in a d-dimensional feature space. Then, a hyperplane in the feature space is defined by d 

αj vj + α0 = 0

(4.14)

j=1

where αj (j = 1, 2, . . . , d) are the parameters of the hyperplane. DTWC/PSSF’s global model is also a linear discriminant. However, this discriminant function is learned in a two-dimensional feature space defined by spam and non-spam scores and has only two parameters. Input-to-feature space transformation is typically not done for discriminative classifiers like balanced winnow/Perceptron and logistic regression. In SVM, this transformation is done implicitly through the inner product kernel k(x, x ) = φ(x), φ(x ), where φ(·) is the function that maps from input to feature space. The input-to-feature space transformation in DTWC/PSSF can be written as  T φ(x) = [φ1 (x) φ2 (x)]T = Score+ (x) Score− (x)

(4.15)

where the scores are defined in Eqs. 4.7 and 4.8. This represents a linear mapping from a T dimensional input space to a two-dimensional feature space. The kernel is then defined as follows (after substitution and using vector notations): ¯T W + x ¯ + x ¯T W − x ¯ k(x, x ) = φT (x)φ(x ) = x

(4.16)

¯ = x/xL1 , where W+ = w+ w+T and W− = w− w−T are T × T -dimensional matrices and x x ¯ = x /x L1 . The elements of vector w+ (vector w− ) are equal to wj (Eq. 4.6) when j ∈ Z + (j ∈ Z − ) and zero otherwise. Noting that terms in the vocabulary are hard-partitioned, we can write ¯T W¯ x k(x, x ) = x

91

(4.17)

where W = W+ + W− . The following observations can be made from the above discussion. (1) DTWC/PSSF performs a linear transformation from the input space to a lower dimension feature space. This feature space is formed in such a way that the discrimination between spam and non-spam e-mails is enhanced. Recently, it has been shown that feature space representations are critical to making classifiers robust for domain adaptation [Ben-David et al. (2007), Agirre and de Lacalle (2008)]. (2) The matrices W, W+ , and W− , which are symmetric, are smoothing matrices. The associated kernel is positive semi-definite. (3) The transformation is supervised, requiring information about class labels.

Hybrid Classifiers Jaakkola and Haussler (1998) discuss a hybrid classifier in which the kernel function is derived from a generative model. The input-to-feature space transformation is based on the Fisher score and the resulting kernel is a Fisher kernel. Our input-to-feature space transformation is based on discrimination scores computed from discrimination information provided by the terms in the e-mails. Raina et al. (2003), present a model for document classification in which documents are split into multiple regions. For a newsgroup message, regions might include the header and the body. In this model, each region has its own set of parameters that are trained generatively, while the parameters that weight the importance of each region in the final classification are trained discriminatively. Kang and Tian (2006) extend naive Bayes by splitting features into two sets. One of the feature set is maximized through discriminant model and the other through generative model. Bouchard and Triggs (2004) propose a method to trade-off generative and discriminative modeling that is similar to multi-conditional learning because it maximizes a weighted combination of two likelihood terms using one set of parameters. They learn a discriminative model and a generative model and present a combined objective function. Other works have been done by Kelm et al. (2006) (Multi-conditional Learning), Andrew (2006) (Hybrid Markov/Semi-Markov Conditional Random Fields) Suzuki et al. (2007) (Structured Output Learning (SOL)), and Liu et al. (2007) (Bayes

92

Perceptron Model). Isa et al. (2008) use naive Bayes approach to vectorize a document according to a probability distribution reflecting the probable categories that the document may belong to. SVM is then used to classify the documents in this vector space on a multi-dimensional level. There results show that this hybrid approach performs better than naive Bayes on most of the datasets. Raina et al. (2003) present another hybrid classifier in which the input space (defined by the terms) is partitioned into two sets (based on domain knowledge) and the weights for each set of class conditional distributions are learned discriminatively. DTWC/PSSF’s global model parameters are similar in purpose; however, the two sets of class conditional distributions in DTWC/PSSF correspond to the sets of significant spam and non-spam terms, which are determined from labeled e-mails.

4.5

Evaluation Setup: Datasets and Algorithms

We present extensive evaluations of DTWC, PSSF1, and PSSF2 on six spam filtering datasets and compare their performances with four other classifiers. We present results for a single global filter trained in a supervised fashion on the training data, as well as for personalized filters trained in a semi-supervised fashion. The robustness and scalability of our algorithms is evaluated in several ways: (1) by varying distribution shift, (2) by evaluating performance on gray e-mails, (3) by testing on unseen e-mails, and (4) by varying filter size. For all the algorithms we report filtering performance with the percentage accuracy and AUC values. In the previous chapter the term weights were calculated using term frequencies. Our experiments showed that term occurrences slightly improved the performance (more than 0.3% in AUC value). This observation is consistent with results from Forman (2003b), Montanes et al. (2005), therefore the results in this chapter and onwards are calculated using boolean feature vectors. In this section, we describe the datasets and the setup for the comparison algorithms. The details of the various evaluations and their results are presented in the next section.

93

4.5.1

Datasets

Our evaluations are performed on six commonly-used e-mail datasets: ECML-A, ECML-B, ECUE1, ECUE-2, PU1, and PU2. Some characteristics of these datasets, as used in our evaluations, are given in Table 4.2. The selected datasets vary widely in their characteristics. Two datasets have distribution shift between training and test sets and among different test sets. Two datasets have concept drift from training to test sets, and the remaining two datasets have no distribution shift between training and test sets. The number of e-mails in training and test sets vary greatly in the selected datasets. The number of e-mails vary from 100 to 4,000 in the training sets and from 213 to 2,500 in the test sets. In some datasets, the size of the training set is larger than the size of the test set while in others it is the opposite. The ECML-A and ECML-B datasets correspond to the datasets for task A and task B of the 2006 ECML-PKDD Discovery Challenge [Bickel (2006)] which have been discussed in great detail in the previous chapter, we describe the rest of the datasets in detail subsequently.

ECUE-1 and ECUE-2 Datasets The ECUE-1 and ECUE-2 datasets are derived from the ECUE concept drift 1 and 2 datasets, respectively [Delany et al. (2005a)]. Each dataset is a collection of e-mails received by one specific user over the period of one year. The order in which the e-mails are received is preserved in these datasets. The training sets contain 1,000 e-mails (500 spam and 500 non-spam) received during the first three months. The test sets contain 2,000 e-mails (1,000 spam and 1,000 non-spam) randomly sampled from the e-mails received during the last nine months. As such, concept drift exists from training to test sets in these datasets. These datasets are not preprocessed for stop word, stemming, or lemmatization. E-mail at-

Table 4.2: Evaluation datasets and their characteristics No. of training e-mails No. of users/test sets No. of e-mails per user/test set Distribution shift

ECML-A ECML-B 4000 100 3 15 2500 400 Yes

94

ECUE-1 ECUE-2 1000 1000 1 1 2000 2000 Yes (temporal)

PU1 PU2 672 497 1 1 290 213 No

tachments are removed before parsing but any HTML text present in the e-mails is included in the tokenization. A selection of header fields, including the Subject, To and From, are also included in the tokenization. These datasets contain three types of features: (a) word features, (b) letter or single character features, and (c) structural features, e.g., the proportion of uppercase or lowercase characters. For further details, refer to Delany et al. (2005a).

PU1 and PU2 Datasets The PU1 and PU2 datasets contain e-mails received by a particular user [Androutsopoulos et al. (2000)]. The order in which the e-mails are received is not preserved in these datasets. Moreover, only the earliest five non-spam e-mails from each sender are retained in the datasets. Attachments, HTML tags, and duplicate spam e-mails received on the same day are removed before preprocessing. The PU1 dataset is available in four versions depending on the preprocessing performed. We use the version with stop words removed. The PU2 dataset is available in the bare form only, i.e., without stop word removal and lemmatization. The PU1 dataset contains 481 spam and 618 non-spam e-mails available in 10 partitions or folders. We select the first 7 folders for training and the last 3 for testing. Within the training and test sets we retain 362 and 145 e-mails, respectively, of each class for our evaluation. The PU2 dataset is also available in 10 folders with the first 7 folders selected for training and the last 3 for testing. There are 497 training e-mails (399 are non-spam) and 213 test e-mails (171 are non-spam). For this dataset, we do not sample the e-mails to achieve even proportions of spam and non-spam e-mails because doing so produces very small training and test sets.

4.5.2

Algorithms

We compare the performance of our algorithms with naive Bayes (NB), maximum entropy (ME), balanced winnow (BW), and support vector machine (SVM). Two of these are generative (NB and ME) and two are discriminative (SVM and BW) in nature. For NB, ME, and BW we use the implementation provided by the mallet toolkit [McCallum (2002)]. For SVM, we use the implementation provided by SV M Light [Joachims (1999a)].

95

E-mails are represented as term count vectors for the NB, ME, BW, and SVM classifiers. The default algorithm settings provided by Mallet are adopted for NB, ME, and BW. The SVM (using SV M Light ) is tuned for each dataset by evaluating its performance on a validation set that is a 25% holdout of the training set. The SV M Light parameter C that controls the trade-off between classification error and margin width is tuned for each dataset and each evaluation. Similarly, we evaluate the performance of SVM with both linear and nonlinear kernels and find the linear kernel to be superior. This observation is consistent with that reported in the literature [Lewis et al. (2004), Druck et al. (2007), Zhang et al. (2004)]. We perform e-mail length normalization using L2 (Euclidean) norm. This improves performance slightly from the unnormalized case, as observed by others as well [Pang et al. (2002), Druck et al. (2007), Zhang et al. (2004)]. We keep the remaining parameters of SV M Light at default values. Semi-supervised versions of the comparison algorithms are obtained by adopting the naive SSL approach (see Sect. 4.4.3). The semi-supervised algorithms are identified as NB-SSL, ME-SSL, BW-SSL, and SVM-SSL.

4.6

Results and Discussion

We evaluate DTWC, PSSF1, and PSSF2 under several settings: global spam filtering, personalized spam filtering, varying distribution shift, gray e-mails, generalization to unseen data, and scalability.

4.6.1

Global Spam Filtering

In this section, we evaluate the performance of DTWC, NB, ME, BW, and SVM for global spam filtering. Each algorithm is trained on the training data and evaluated on the test set(s). As such, this represents a supervised learning setting where test data are not available to the algorithms during training. For this and subsequent evaluations (unless specified otherwise), the term selection parameter t of DTWC/PSSF is kept equal to zero (or, equivalently, t is kept equal to one). The effect of varying the term selection parameter is discussed in Sect. 6.2. Tables 4.3, 4.4, 4.5, and 4.6 show the percent accuracy and AUC values of the algorithms on ECML-A, ECML-B, ECUE (ECUE-1 and ECUE-2), and PU (PU1 and PU2) datasets, respectively. 96

User 1 2 3 Avg

User 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Avg

Table 4.3: Global spam filtering results for DTWC NB ME Acc AUC Acc AUC Acc AUC 91.00 96.35 81.24 82.71 62.20 77.32 92.36 97.37 83.80 86.40 68.16 81.19 87.52 94.59 87.88 94.40 78.92 91.13 90.29 96.10 84.30 87.83 69.76 83.21

Table 4.4: DTWC Acc AUC 76.00 72.16 74.50 73.52 84.50 91.24 93.50 98.14 74.25 82.11 72.25 80.71 75.25 72.42 74.50 86.78 78.75 79.62 80.00 75.20 80.25 85.82 79.75 86.69 88.75 91.28 64.75 83.12 73.25 75.49 78.01 82.29

Global spam filtering results for NB ME Acc AUC Acc AUC 57.00 50.24 67.75 87.88 54.50 45.22 67.75 82.69 67.00 69.77 72.50 81.41 73.75 74.52 67.25 81.26 68.75 79.26 71.25 83.39 63.25 64.49 74.50 81.83 62.25 56.43 59.00 73.01 58.50 65.27 75.75 86.98 61.50 58.97 79.25 89.98 58.25 52.56 67.25 87.18 62.75 70.71 70.00 83.26 69.25 74.99 67.25 81.13 70.25 68.70 67.50 80.70 58.50 69.62 78.25 86.16 65.75 62.64 62.00 81.92 63.41 64.22 69.81 83.52

ECML-A dataset BW SVM Acc AUC Acc AUC 61.00 66.01 64.56 72.55 64.76 69.46 70.08 79.24 73.44 78.52 80.44 90.59 66.40 71.33 71.69 80.79

ECML-B dataset BW SVM Acc AUC Acc AUC 39.75 35.63 49.75 57.8 26.50 21.14 44.25 51.73 49.00 46.95 63.75 69.20 46.50 52.39 53.00 65.33 60.00 67.90 64.75 86.28 62.25 69.89 68.00 81.78 60.00 60.89 54.25 64.11 40.75 37.87 65.50 70.53 36.00 33.46 62.75 68.28 36.00 32.80 46.75 56.57 57.75 65.65 65.75 77.29 60.25 67.75 64.75 77.94 55.00 54.13 66.25 80.19 59.75 66.58 71.75 82.19 57.50 60.42 54.25 67.93 49.80 51.56 59.70 70.47

Table 4.5: Global spam filtering results for ECUE-1 and ECUE-2 datasets DTWC NB ME BW SVM Dataset Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC ECUE-1 92.20 96.88 50.05 50.05 78.3 86.63 83.05 90.74 83.30 89.84 ECUE-2 83.45 98.24 50.00 50.00 79.5 84.54 77.50 83.95 76.95 85.62

Table 4.6: Global spam filtering DTWC NB Dataset Acc AUC Acc AUC PU1 98.27 99.65 96.55 97.70 PU2 97.18 97.06 87.32 71.24

results for PU1 and PU2 datasets ME BW SVM Acc AUC Acc AUC Acc AUC 96.89 99.48 97.24 99.27 96.21 99.51 94.36 96.65 90.61 89.34 88.26 93.60

97

Out of the 44 results (percent accuracy and AUC values) DTWC outperforms the rest of the algorithms in 29 results. The next best algorithm is ME with 13 out of 44 winning results, all of them on the ECML-B dataset. DTWC’s performance is significantly better than the other algorithms on all the datasets except ECML-B. The ECML-A dataset involves a distribution shift from training to test sets, where the test sets correspond to e-mails received by different users. For this dataset, the average percent accuracy and average AUC value of DTWC is 5.99% and 8.27% higher, respectively, than those of the next best filter (NB). The ECUE-1 and ECUE-2 datasets also involve a distribution shift from training to test sets in the form of concept drift. For these datasets, the percent accuracy and AUC values of DTWC are at least 3.95% and 7.04% higher, respectively, than the next best results. The PU1 and PU2 datasets involve no distribution shift as their training and test sets are drawn randomly from e-mails received by a single user. For these datasets too, DTWC outperforms the other classifiers for both, accuracy and AUC measures. Usually for such differing distribution classification problem settings, techniques that make use of the users’ inboxes (unlabeled data) during learning, i.e., transductive or semi-supervised learning, will achieve better results [Junejo and Karim (2007b)]. Nonetheless, DTWC, which uses supervised learning, appears to be little affected by this change in distribution of the two sets. The naive Bayes classifier appears to be second least affected by this change. The SVM performs poorly on this dataset. The superior performance of DTWC can be attributed to the discriminative term-based model of spam and non-spam and the simple and generalized discriminative model. The ECML-B dataset represents a more challenging spam filtering problem where (1) distribution shift exists between training and test sets and (2) the training and test sets are much smaller in size (100 and 400 e-mails, respectively). On this dataset, DTWC comes out on top in 16 out of 30 results. The next best algorithm is ME with 13 out of 30 winning results. This spam filtering setting, however, is unlikely to occur in practice because large quantities of generic labeled e-mails are readily available for training purposes. When the size of the training set is larger, as for ECML-A dataset, DTWC outperforms the other algorithms by a wide margin. It is worth noting that SVM performs poorly on datasets involving distribution shift. We study

98

the impact of varying distribution shift on algorithm performance in a subsequent section. These results demonstrate the robustness of our algorithm and its suitability as a global spam filter at the service-side. An extensive evaluation of DTWC for supervised text classification is given in Chapter 5 and Junejo and Karim (2008) where it is shown that DTWC performs superbly on other text datasets as well.

4.6.2

Personalized Spam Filtering

In this section, we evaluate the performance of PSSF, NB-SSL, ME-SSL, BW-SSL, and SVM-SSL for personalized spam filtering. In this setting, the filter that is learned on the training data is adapted for each test set in a semi-supervised fashion. Performance results are reported for the personalized filters on their respective test sets. Tables 4.7, 4.8, 4.9, and 4.10 show the personalized spam filtering results of the algorithms on ECML-A, ECML-B, ECUE (ECUE-1 and ECUE-2), and PU (PU1 and PU2) datasets, respectively. The results demonstrate the effectiveness of personalization for datasets with distribution shift (ECML-A, ECML-B, and ECUE). For ECML-A dataset, the performance of both PSSF1 and PSSF2 improves over that of the global spam filter (DTWC), with PSSF2 outperforming the others in 4 results and PSSF1 outperforming the others in the remaining two results. The performance of the other algorithms (except SVM-SSL) also improves from that of their supervised versions. For ECML-B dataset, PSSF1 has the highest average percent AUC value while BW-SSL tops the others in average percent accuracy. For ECUE-1 dataset, PSSF1 has the best results while MESSL outperforms the others on ECUE-2 dataset. ME-SSL and BW-SSL have the wining results on PU1 and PU2 datasets, respectively. In all, PSSF1/PSSF2 outperforms the others in 22 out of 44 results. In most cases where our algorithms do not win, the difference in performance with the winner is minor. The most surprising results are those of BW-SSL on ECML-B dataset. The performance of BW jumped by about 40% (on both average accuracy and AUC value) after semi-supervised learning. Balanced winnow learns a hyperplane by updating its parameters whenever mistakes in classification are made. In theory, the hyperplane learned by BW from the training data should not change during

99

Table 4.7: Personalized spam filtering results for ECML-A dataset PSSF1 PSSF2 NB-SSL ME-SSL BW-SSL SVM-SSL User Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC 1 87.84 98.60 96.68 98.96 81.84 83.71 81.48 81.06 76.28 81.22 64.56 70.66 2 89.92 98.78 97.24 99.60 80.96 82.54 85.76 86.65 77.80 82.71 70.08 79.30 3 97.00 99.43 93.36 98.92 86.36 88.14 85.16 86.02 81.24 85.42 80.44 90.78 Avg 91.58 98.94 95.80 99.16 83.05 84.79 84.13 84.57 78.44 83.11 71.69 80.24

Table 4.8: PSSF1 User Acc AUC 1 91.25 94.15 2 93.75 96.50 3 93.50 96.29 4 98.50 99.22 5 82.75 93.79 6 76.00 78.65 7 80.25 92.72 8 91.50 95.46 9 92.00 99.32 10 82.75 98.12 11 90.75 94.08 12 85.75 91.26 13 97.00 98.85 14 75.50 88.21 15 89.00 90.52 Avg 88.01 93.81

Personalized PSSF2 Acc AUC 76.75 77.79 78.50 79.54 86.00 93.73 98.25 98.91 74.75 93.12 71.25 78.39 76.25 70.79 80.50 91.62 80.00 92.74 80.00 83.55 80.50 88.91 79.25 86.82 90.50 94.98 65.00 84.46 77.00 79.31 79.63 86.31

spam filtering results for ECML-B dataset NB-SSL ME-SSL BW-SSL SVM-SSL Acc AUC Acc AUC Acc AUC Acc AUC 57.00 56.28 77.75 85.94 83.75 90.88 49.75 53.42 53.00 54.31 82.75 91.53 93.50 97.31 44.25 50.83 64.00 64.22 92.00 95.83 94.75 98.21 63.75 70.59 69.50 70.23 80.25 87.27 90.75 93.74 53.00 60.37 67.75 68.85 79.25 83.66 75.75 71.11 64.75 81.29 63.50 64.54 92.00 95.05 87.75 86.30 68.00 77.81 63.50 63.74 83.50 91.46 91.00 88.78 54.25 60.66 59.00 60.23 92.25 97.52 87.50 91.31 65.50 72.16 60.75 61.48 89.75 94.29 92.25 94.37 62.75 71.15 56.75 57.39 77.25 86.42 94.25 97.84 46.75 49.20 62.50 64.31 88.00 93.80 94.75 96.19 65.75 74.80 66.75 66.89 82.25 86.65 92.75 94.63 64.75 73.65 68.50 68.54 89.50 96.12 92.25 96.55 66.25 79.00 57.75 58.79 92.25 95.86 86.50 86.06 71.75 82.56 66.50 67.96 74.50 82.36 84.75 78.53 54.25 58.17 62.45 63.18 84.88 90.91 89.48 90.78 59.70 67.71

Table 4.9: Personalized spam filtering results PSSF1 PSSF2 NB-SSL Acc AUC Acc AUC Acc AUC ECUE-1 93.60 97.15 93.35 97.12 50.00 50.00 ECUE-2 84.7 96.55 83.45 96.68 50.00 50.00

for ECUE-1 and ECUE-2 ME-SSL BW-SSL Acc AUC Acc AUC 89.25 88.52 81.00 85.80 95.20 98.61 64.25 92.18

datasets SVM-SSL Acc AUC 83.30 89.36 76.95 85.20

Table 4.10: Personalized spam filtering results for PU1 and PU2 datasets PSSF1 PSSF2 NB-SSL ME-SSL BW-SSL SVM-SSL Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC Acc AUC PU1 98.62 99.28 98.27 99.28 97.93 97.87 99.65 99.89 97.58 97.37 96.21 98.66 PU2 96.24 97.43 97.18 96.46 82.62 60.48 99.06 99.63 99.53 99.88 88.26 93.00

100

Table 4.11: Comparison with other published personalized spam filtering results (all numbers are average percent AUC values) Algorithm ECML-A ECML-B PSSF1 98.94 94.57 PSSF2 99.16 86.31 Junejo et al. (2006) 95.07 — Pfahringer (2006) 94.91 — Gupta et al. (2006) 94.87 90.74 Quionero-Candela et al. (2009) 94.03 — Meng et al. (2010) 86.38 — Trogkanis and Paliouras (2006) 93.65 91.83 Cormack (2006b) 89.10 94.65 Kyriakopoulou and Kalamboukis (2006) 97.31 95.08 Cheng and Li (2006) 93.33 — naive SSL. However, in practice, significant improvement is seen which can be attributed to the poor convergence characteristics (and lack of robustness) of the BW learning algorithm. This is supported by the observation that SVM and SVM-SSL perform almost identically. We compare PSSF’s performance with published results on ECML-A and ECML-B datasets in Table 4.11. Junejo et al. (2006), Pfahringer (2006), Gupta et al. (2006) are the three top performances in decreasing order of task A of the 2006 ECML-PKDD Discovery Challenge [Bickel (2006)], and similarly, Cormack (2006b), Trogkanis and Paliouras (2006), Gupta et al. (2006) are the three top performances for the task B. PSSF outperforms all algorithms on ECML-A dataset and lags behind the winner by only 0.08% on ECML-B dataset, which is statistically insignificant. The value of 94.57% on PSSF1 on ECML-B refers to its value in Table 4.14. The reason why DTWC/PSSF was not able to take a significant lead over the technique of Cormack (2006b) and ME classifier on ECML-B while it has a significant lead on all the other datasets is that, for very small training data like that of ECML-B, the value of the parameters have trouble in converging, other wise if we know the values of the parameters then the average AUC reaches 99.89% (last column of Table 4.14). This not only shows the potential of our technique but also emphasizes on finding better ways to find the parameter values. Details of all most of these approaches have been mentioned in Chapter 3, we give details of the remaining approach here. Kyriakopoulou and Kalamboukis (2006) preprocess the dataset by

101

clustering the training data with each test set. The combined set is augmented with additional metafeatures derived from the clustering. This combined set is then learned using transductive SVM. This approach is computationally expensive and non-adaptive, further more they use a different learning setting therefore there results are not directly comparable. Quionero-Candela et al. (2009) evaluate their minimax approach named FDROP only on the ECML-A dataset. FDROP employees a feature elimination strategy that eliminates features for every sample individually. Meng et al. (2010) propose TRSVD, a transfer learning approach based on singular value decomposition. There result on ECML-A lags behind every other approach by a big margin, they do not report result for ECML-B dataset. Cheng and Li (2006) present a semi-supervised classifier ensemble approach for the personalized spam filtering problem. Their approach is also computationally expensive as compared to PSSF, and it lags in performance by more than 5% on ECML-A dataset (they do not report results for ECML-B dataset).

4.6.3

Varying Distribution Shift

In this section, we evaluate the performance of DTWC/PSSF, NB, ME, BW, and SVM under varying distribution shift between training and test data. This evaluation is performed on ECMLA dataset by swapping varying numbers of e-mails between training and test (user) data. By increasing the number of e-mails swapped, the distribution shift between training and test data reduces. To illustrate the evaluation procedure, suppose 100 randomly selected e-mails from user 1 are moved to the training data and 100 randomly selected e-mails from the training data are moved to user 1 e-mails. The filters are then trained and tested using the modified training and test data.

Table 4.12: Performance under varying distribution shift. Average percent accuracy and AUC values are given for ECML-A dataset. # 0 100 250 500 1500 Avg

DKL DT V (×10−7 ) (×10−7 ) 717 444 517 384 477 336 346 259 157 127

DTWC Acc AUC 90.29 96.10 93.30 98.04 94.68 98.72 96.60 99.38 97.63 99.55 94.5 98.35

PSSF1 Acc AUC 91.58 98.94 92.86 98.77 92.40 98.39 94.92 98.41 96.41 98.56 93.63 98.61

PSSF2 Acc AUC 95.80 99.16 96.01 99.11 96.44 99.02 96.56 99.14 96.94 98.97 96.35 99.08

102

NB Acc AUC 84.30 87.83 92.08 95.84 93.70 97.05 95.57 98.01 96.39 97.82 92.40 95.31

ME Acc AUC 69.76 83.21 87.90 94.53 92.80 97.36 96.18 98.68 97.97 99.01 88.92 94.55

BW Acc AUC 66.40 71.33 75.53 82.73 82.13 88.87 87.29 92.91 95.08 98.11 81.28 86.79

SVM Acc AUC 71.69 80.79 88.12 94.96 92.08 97.70 95.70 99.17 97.30 99.68 88.97 94.46

This procedure is repeated for each user and for different numbers of e-mails swapped. To quantify the distribution shift between training and test data we adapt the KL-divergence and total variation distance as follows: ⎡ ⎤   pL (xj |+) pL (xj |−) ⎦ 1 ⎣ DKL (L, U ) = + pL (xj |+) log pL (xj |−) log 2T pU (xj |+) pU (xj |−) j

j

⎡ ⎤   1 ⎣ |pL (xj |+) − pU (xj |+)| + |pL (xj |−) − pU (xj |−)|⎦ DT V (L, U ) = 2T j

j

where DKL (·, ·) and DT V (·, ·) denote the adapted KL-divergence and total variation distance, respectively. L and U identify training and test data, respectively, and T is the total number of distinct terms in the training and test data. The quantity DKL (DT V ) is computed as the average of the KL-divergence (total variation distance) for spam and non-spam conditional distributions normalized by T . The normalization ensures that these quantities range from 0 to 1 for all trainingtest data pairs irrespective of the numbers of terms in them. Table 4.12 shows the average performance for the three test sets in ECML-A dataset. It is seen from this table that as the number of e-mails swapped between training and test sets (given in the first column of Table 4.12) increases, the distribution shift between the sets decreases, as quantified by the values of DKL and DT V . More interestingly, it is observed that as the distribution shift decreases the performance gap between DTWC/PSSF and the other algorithms narrows down. The performance of all the algorithms improves with the decrease in distribution shift, especially for ME, BW, and SVM. For example, the average accuracy of ME jumps up by 28.21% from the case when no e-mails are swapped to the case when 1,500 e-mails are swapped. Our supervised spam filter, DTWC, comprehensively outperforms the other algorithms when distribution shift is large, while its performance compares well with the others at low distribution shift. Another observation from this evaluation is that as the distribution shift decreases the benefit of semi-supervised learning diminishes.

103

Table 4.13: Performance on gray e-mails identified from user 1 and user 2 e-mails in ECML-A dataset. For DTWC and PSSF2, the table gives the number of gray e-mails that are misclassified. s = similarity threshold; GE = gray e-mails. User 1 User 2 s No. of GE DTWC PSSF2 No. of GE DTWC PSSF2 0.5 1471 130 65 1443 132 46 0.6 1016 79 39 1109 97 40 0.7 622 46 34 588 45 22 0.8 174 12 10 216 17 6 0.9 34 3 1 71 6 1

4.6.4

Gray E-mails

Gray e-mails were introduced and defined in Sect. 4.2. They present a particulary challenging problem for spam filters because similar e-mails are labeled differently by different users. To evaluate the performance of DTWC/PSSF on gray e-mails, we devise an experiment based on emails for user 1 and user 2 in ECML-A dataset. Consider a graph consisting of two sets of vertices corresponding to the e-mails for user 1 and user 2. Edges in the graph exist between vertices in the two sets if the corresponding e-mails have a cosine similarity greater than a specified threshold, s. The cosine similarity between two e-mails xi and xj is defined as

sim(xi , xj ) = where x =



xTi xj xi xj 

xT x. Given this graphical representation and the true labels of e-mails, an e-mail

belonging to user 1 (user 2) is a gray e-mail if an edge exists from its vertex to one or more vertices belonging to user 2 (user 1) and the labels of the two e-mails are different. This procedure identifies the set of gray e-mails for user 1 and the set of gray e-mails for user 2. We evaluate the performance of DTWC/PSSF on these sets of e-mails by reporting the number of e-mails that are incorrectly labeled by the algorithms. The performance of DTWC and PSSF2 on gray e-mails is reported in Table 4.13. For each user, the table shows the total number of gray e-mails according to the specified similarity threshold s and the number of gray e-mails incorrectly classified by DTWC and PSSF2. The results show that personalization performed by PSSF2 results in significant reduction of errors on gray e-mails. This 104

is attributable to the adaptation of the local and global models on the user e-mails. The most significant improvement in classification on gray e-mails occurs when the discriminative model is adapted (therefore results of PSSF1 are not shown in the table). This observation is consistent with the fact that gray e-mails result from target shift which is better tracked by adapting the discriminative model.

4.6.5

Effect of Multiple Passes

We explored the effect of multiple passes of the semi-supervised step on the results. Table 4.14 shows the result of multiple passes of PSSF1 on ECML-B dataset. There is a slight increase in the performance from pass 1 to pass 2, and then it starts to decrease slightly from pass 3. Most of the inboxes benefit from the increase in passes but some user inboxes witness a drastic decrease thus bringing down the average e.g. the average AUC value is being reduced by one user (inbox 6) whose AUC value actually decreases with number of passes. The last column is the result that PSSF1 can achieve if the labels of test data are available, which is not the case in real problems, but this shows the potential of our technique and emphasizes on finding better ways to find the parameter values.

4.6.6

Generalization to Unseen Data

In the previous subsections, we presented results of PSSF1 and PSSF2 on the test sets where the unlabeled e-mails in the test sets were utilized during learning. However, in practice, once a personalized spam filter is learned using labeled and unlabeled e-mails it is applied to unseen emails for a while before it is adapted again. The performance over these unseen e-mails represents the generalization performance of the filter. We evaluate the generalization property of PSSF by splitting the test set into two: split 1 is used during semi-supervised learning and split 2 contains the unseen e-mails for testing. The generalization performance of PSSF1 and PSSF2 on ECML-A dataset is shown in Fig. 4.7. In general, the difference between the average percent AUC values for split 1 and split 2 is typically less than 1% for PSSF1 and less than 0.5% for PSSF2. Furthermore, the decrease in average AUC value with decrease in size of split 1 (increase in size of split 2) is

105

Table 4.14: Effects of multiple passes of PSSF1 on ECML-B. The values are AUC values in percentages. Inboxes DTWC PSSF1 Pass1 PSSF1 Pass2 PSSF1 Pass3 Optimal 1 72.16 94.15 96.89 96.72 99.93 2 73.52 96.50 96.98 96.61 99.99 3 91.24 96.29 96.72 96.75 99.99 4 98.14 99.22 99.12 99.12 99.98 5 82.11 93.79 94.70 95.05 99.66 6 80.71 78.65 74.11 69.90 99.96 7 72.42 92.72 91.81 90.79 100 8 86.78 95.46 95.96 96.16 99.70 9 79.62 99.32 99.39 99.24 100 10 75.20 98.12 99.19 98.05 100 11 85.82 94.08 95.88 96.24 99.97 12 86.69 91.26 92.54 92.36 99.61 13 91.28 98.85 99.60 99.51 99.88 14 83.12 88.21 90.23 90.74 99.84 15 75.49 90.52 95.50 97.92 99.92 Average 82.29 93.81 94.57 94.38 99.89

100 99.5

Average AUC

99 98.5 98 97.5 PSSF1 Split 1 PSSF1 Split 2 PSSF2 Split 1 PSSF2 Split 2

97 96.5 10

20

30

40 50 60 Size of Split 1 / Size of test set * 100

70

80

Figure 4.7: Generalization performance on ECML-A dataset

106

90

graceful. PSSF2, in particular, exhibits excellent generalization performance. It is able to learn the personalized filter for each user from a small number of e-mails for the user. This characteristic of PSSF2 stems from the realignment of the decision line after considering the user’s e-mails. These results also demonstrate the robustness of PSSF.

4.7

Conclusion

In this chapter, we presented and evaluated a new technique for robust text classification that combined local generative models and global discriminative classifiers through the use of discriminative term weighting and linear opinion pooling. Terms in the documents are assigned weights based on relative risk measures that quantify the discrimination information a term provides for one category over the others. These weights, named as discriminative term weights (DTW), also serve to partition the terms into two sets. An linear opinion pooling strategy consolidates the discrimination information of these sets yielding a new two dimensional feature space, which serves as the basis for a discriminant function. A strategy for personalizing the local and global models through two semi supervised variants of our technique is also presented. The de-coupling of the local and global model allows partial updation of the model. In addition to a supervised version, named DTWC, we also present two semi-supervised versions, named PSSF1 and PSSF2, that are suitable for personalized spam filtering. PSSF1/PSSF2 can be adapted to the distribution of e-mails received by individual users to improve filtering performance. We evaluate DTWC/PSSF on six e-mail datasets and compare its performance with four classifiers. DTWC/PSSF outperforms the other algorithms in 51 out of 88 results (accuracy and AUC) in global and personalized spam filtering. In particular, DTWC/PSSF performs remarkably well when distribution shift exists between training and test data, which is common in e-mail systems. We also evaluate our algorithms under varying distribution shift, on gray e-mails, on unseen e-mails, and under varying filter size (in Chapter 6). Our personalized spam filter, PSSF, is shown to scale well for personalized service-side spam filtering. A theoretical comparison of the proposed technique with popular generative, discriminative, and hybrid classifiers is also provided. The nature of the text spam filtering and the challenges to 107

effective global and personalized spam filtering are also discussed. We define key characteristics of e-mail classification such as distribution shift and gray e-mails and relate them to machine learning problem settings. The results demonstrate the robustness, scalability, adaptability and efficiency of our technique and its suitability for global and personalized spam filtering at the service-side. In the next chapter we evaluate the viability of DTWC/PSSF for the general problem of text classification, as TC is different from spam filtering in terms of distribution shift, dictionary size, datasets, test set’s size and many other aspects.

108

Chapter 5

Discriminative Term Weighting Based Text Classification 5.1

Introduction

Content based e-mail spam filtering despite of its importance remains just one of the many active application areas of text classification research. Content based spam filtering, whether it be for e-mails, instant messages, short messaging service (SMS), web pages or blogs, even though are different problems in their own right but share a fair number of similar challenges such as; the adversarial nature due to spammers, un-even cost of error, lack of labeled data and many others. On the other hand, typical applications of TC such as categorization or arrangement of documents in a database, filtering of news articles or documents, and classification of reviews, are different from the spam filtering problems for a number of reasons. These differences are due to the intrinsic characteristics of the data and as well as external influences from the problem domain. Therefore, each application area of TC has different characteristics and dimensions that need to be addressed separately. Consequently, an approach that exhibits good performance for one problem might not fare well for other text classification problems. In the previous chapter we demonstrated the performance of DTWC/PSSF for spam filtering problem, but its efficiency remains to be demonstrated for the general TC applications. For these

109

applications, how should DTWC/PSSF be used in a multi-class setting, will it retain its performance edge or will it be an over kill, will there be a need for some change or modification to the algorithm or the same will do, would it retain its scalability and robustness or not, will the current weighting scheme suffice for all the problems or not? Addressing these issues, in this chapter, we start off by defining the text classification problem followed by the its differences from spam filtering. We then look at the modifications to DTWC/PSSF for addressing the aforementioned issues. We provide the multi-class formulation followed by an evaluation with five term weighting strategies inspired from information theory and medical domain for discriminating and partitioning the terms: relative risk, log-relative risk, odds, log-odds, and Kullback-Leibler divergence. These strategies weight the terms based on the discrimination information they provide for one category over the others. A two-dimensional onecategory-versus-others feature space is constructed through linear opinion pool. We show the limitation of this transformed space and then learn the classification function through a simple discriminant function. In addition to the datasets of the previous chapters, we evaluate our method on three benchmark text classification datasets each belonging to different application areas, namely; 20 Newsgroups, Movie Review, and Simulated/Real/Aviation/Auto (SRAA). Here also, the results are compared with four common text classification methods, demonstrating the overall effectiveness of our method for general text classification.

5.2

The Nature of the Text Classification Problem

A prototypical supervised text classification problem can be stated as classification of documents into categories or classes based on their textual content given a set of labeled documents. The idea is to learn a classifier or a filter from a sample of labeled documents which when applied to unlabeled documents assigns the correct labels to them. This description of a text classifier holds for more than two class problems, hence the DTWC/PSSF described in Chapter 4 needs to be defined for multiple classes. If X is the set of all possible documents and Y = {1, 2, . . . , |Y |} be the set of possible labels where |Y | is the total number of categories. The true label of a document x ∈ X is given by the 110

target function Φ(x) : X → Y , which is unknown. The problem of supervised text classification can then be defined as follows: Definition 5 (Text Classification). Given a set of training documents {(xi , yi )}N i=1 where xi ∈ L ⊂ ˆ i ), and a set of test documents U ⊂ X, learn the function Φ(x) ¯ : U → Y such that X and yi = Φ(x ¯ Φ(x) = Φ(x), ∀x ∈ U . The sets of documents L and U , correspond to the documents in the training and test data, respectively, and it is generally understood that U ∩ L is a small or null set. A document is represented as a binary vector xi = xi1 , xi2 , . . . , xi|T |  where xij ∈ {0, 1} indicates whether the term j (typically a word) exists in document i or not. The integer |T | is the number of terms in the dictionary of L and U (after standard preprocessing of stop word removal and stemming). Just ˆ like in the case of spam filtering, here it is also reasonable to expect that the function Φ(x) by which the training data are labeled is different from the true target function Φ(x). This is because even though the adversarial element is missing in TC, the label noise is introduced during the process of producing the labeled training data. Additionally, the distinction between the classes is not strong enough and a document can be easily assigned to more than one categories because of subjectivity. In general, the nature of this noise is unknown and cannot be modeled accurately, and hence is ignored for most of TC problems. For our discussion in this chapter, it is assumed that each document is assigned to one class only and that the noise is not present in the document’s content and that the terms and categories are assumed to be just symbolic labels without semantics and that no additional knowledge of a procedural or declarative nature is available. Text classification, as defined above, is different from spam filtering for a number of reasons: Most of the TC problems are multi-class problems, with many having a hierarchy amongst the topics, e.g. a sports category may comprise of tennis, football, American football and cricket subcategories. Therefore the difference between the classes is not as prominent as that for spam filtering where each e-mail (or document) is either spam or not, infact, in some cases there might even be a an overlap between the classes. This is one of the reason the performance of general TC problems is usually lesser than that of spam filters. The generalization of a two-class approach to a multi-class problem in some cases is not straightforward either. 111

There is either no or very little distribution shift between the training and test data in a general TC problems, because in most of the cases the test data is drawn from the same distribution as that of the training data. The chronological order of the documents is not as important either. Since there is no adversarial problem, therefore the distribution also does not change as frequently with time as well. For example, in movie review classification, the reviewer is not trying to fool the reader in understanding the wrong label nor does he intentionally misspells words or does other tricks to fool the classifier. It does happen that certain words and expression tend to get out of fashion while new slang words get introduced. This phenomenon is not so frequent and takes place over such a longer period that the learning algorithm tend not to worry about it. This can be overcome by re-building the model after a year or two. The length of a text document in general is much larger than that of an e-mail. Therefore, by Heaps Law [C. D. Manning (2009)], the set of documents L are more sparsely represented in the term space. The dimensionality of the term space is much larger than that of spam filtering problem because, the spam e-mails lack in diversity of topics and most of them are concentrated around very topics/products. Whereas, in a digital repository, the documents not only belong to various topics but are linguistically richer as well. Multiple categories also increase the dimensionality of the term space substantially. Typically the misclassification cost is same for every class in TC problems while in spam filtering the cost of false positives is very high. Therefore, a good spam filter might not be suitable for general TC problem. Even though we provide the AUC values, primarily we compare the approaches in this chapter based on accuracy. Despite these challenges, text classification methods must be efficient in order to handle large volumes of data in various applications in real time. In view of the above differences, we look at the following issues in this chapter: -Evaluate DTWC/PSSF on general text classification datasets from domains other than spam. -Relative Risk measure used as a weighing measure, handled the distribution shift nicely. But there might be other measures that perform better for problems with small or no distribution shift (such as for typical TC problems). Since different weighing measures will perform differently for

112

different problems, therefore we generalize our previous weighting scheme to a function that assigns weights to these terms. We define the constraints on this function and evaluate four new weighting functions. -We provide the multi-class formulation of DTWC/PSSF. We look at a number of ways to generalize the two-class approach to the multi-class approach and weight their pros and cons before choosing one. -Due to lack of distribution shift between the training and test data, we try to determine whether semi-supervised learning helps or not.

5.3

From Two-Class to Multi-Class

Usually classifiers are developed to distinguish between only two classes or categories. Mostly a discriminant function is optimized such that examples with classification values greater than a certain threshold are assigned to one class while the remaining are assigned to the other class. For example, in the Bayes classifier, the discriminant function is the difference of the posterior probabilities of the two classes, in other words, the example is assigned to the class with the highest posterior probability. A benefit of using the posterior probabilities is that it can be easily generalized for multiple classes. After estimation of the probability density of all the classes the new example (just like in the case of two class problem) is assigned to the class with the largest posterior probability. If the discriminant functions are not based on estimated probability densities (i.e. are not generative) but on a discriminative approach, then the generalization to a multi-class problem is not so straight forward. There are many ways to convert a two-class approach to a multi-class approach, but two are commonly used, one is 1 vs 1 and the other is 1 vs All. In the first approach a classifier is trained between each pair of classes. This requires a total of |Y |(|Y | − 1)/2 models to be learned. Each pair represented with the double indices (ya , yb ), where ya , yb ∈ Y , outputs the label ya or yb depending on the proximity of document xj from the discriminant function. The discriminant between (ya , yb ) is learned on the subset S of the training data L such that {(xi , yi )}N i=1 where xi ∈ S ⊂ L and yi ∈ {ya , yb }. Intuitively, this approach should result in poor performance because when we learn 113

a classifier for class yi , then there will be (|Y | − 1)(|Y | − 2)/2 classifiers that have never seen documents from the class yi . Deciding on the votes of these unrelated classifiers may result in an almost random classification, but we will see shortly that this is not the case. This problem increases with the increase in the number of classes. In the second approach, a classifier between one class (ya ) and rest of the Y − ya classes is learned, this yields a total of |Y | classifiers. Now the discriminant function for class ya decides whether xj belongs to class ya or not. This is done by setting the labels of all classes other than ya as ya− i.e. not belonging to ya . For the above mentioned two scenarios, the classifier output can be determined by two ways. The first one is to apply a voting mechanism, where each classifier either votes in favor or votes against a certain class. In this scenario the classifiers only have to output a binary answer. The decisions of these classifiers are combined using some combining rule. One commonly used approach is voting. In voting, the document xi is assigned to the class that gets the highest number of votes. For the 1 vs 1 approach, this results in a casting of |Y |(|Y | − 1)/2 votes. A class can secure votes any where from 0 to |Y | − 1, while for the 1 vs All case, a class can get a maximum of one vote only. The second way to determine the classifier output is that the classifier outputs a real number. It can be the proximity to the discriminant, the estimated class probability, or something related to it, then document xi is assigned to the classifier with the maximal classification value or output. Both of these cases have a few problems. Voting can result in a tie between two or more classes or can result in a situation where a document does not get any class vote at all. This is elaborated through Fig. 5.1; region R1 is the territory which two classifiers claim to be theirs, while on the other hand the region R2 is left undefined with every classifier rejecting an instance which lies there. R3 corresponds to the region where each instance is claimed by all the three classifiers. Conflicts like these could be resolved by assigning a random class or the class with the highest prior probability to that document, but in a multi-class setting this results in a substantial error. In the second case where classifiers output the probability/distance/confidence, the problem of ties and rejections does not arise to begin with. The output of these classifiers could be combined

114

R1

yc ya

yc

yc

ya

ya R3

R2

~yb yb R1

R1

yc

~yc

ya

yb

yb

yc

~ya

yb

ya

(a)

yb

(b)

Figure 5.1: Linear classifiers for three class problem of two dimensions when 1 vs All (a) and 1 vs 1(b) settings are used.

using the maximum rule i.e. the document xi is assigned to the class with the maximal output. This case also gives us the flexibility to identify outliers and reject the least confident outputs. A different criteria or even a different classifier could be used to assign labels to such instances. Tax and Duin (2002) in their study, along with the above mentioned example, also give the empirical comparison of the preceding four scenarios i.e. 1 vs 1 voting, 1 vs 1 confidence output, 1 vs All voting and 1 vs All confidence output. He evaluated the performance of three linear classifiers; linear discriminant analysis (LDA), Fisher linear discriminant (FLD) and the linear support vector classifier (SVC) on 5 datasets (vowel, vehicle, crabs, waveform and digits datasets, all of which are available in the UCI ML Repository). We summarize there results in Table 5.1, where each value is an average over the percentage error of the five datasets. For detailed results for each dataset, the reader is advised to refer to Tax and Duin (2002). For decisions based on voting, 1 vs All classifies

Table 5.1: Classification errors (in %). Voting Confidence Dataset 1 vs All 1 vs 1 1 vs All 1 vs 1 LDA 33.50 29.30 22.78 23.32 FLD 36.24 19.18 22.72 57.28 SVC 31.30 14.82 25.76 60.98

115

the instances more accurately than 1 vs 1, but due to very high rejection rate, the over all error after random assignment of labels to the rejected instances results in a much higher error for the former scenario. Whereas, when a maximal real valued classifier output is used for classification, there is no rejection or tie over any instance, and 1 vs All fares significantly better than 1 vs 1. 1 vs 1 with voting outperforms 1 vs All with confidence for FLD and SVC techniques and claims the top spot, but for text classification problem, this observation may not hold. If we look at the finer details, 1 vs 1 with voting consistently performs worse on the digits dataset, with 256 dimensions or attributes, it is the biggest corpus of the 5 corpuses used, with the second biggest dataset having only 21 dimensions. Whereas the dimensionality of a typical text classification problems is usually hundred times larger than the dimensionality of the digits datasets and at times even much more. Secondly, for the above mentioned five datasets, the training and test data are drawn from the same distribution, where as in text classification the distribution between the training and test data may vary either because of temporal nature or depending on the source from where they were selected. Therefore, this would lead to greater number of rejects, a phenomenon which is badly catered by the voting mechanism. This analysis leads us to the conclusion that 1 vs All with confidence is the best choice for transforming a two class technique to a multi-class one for text classification problems. In the next section we work out the multi-class formulation with the provision of selecting different weighting schemes.

5.3.1

Discriminative Term Weighting

A document is represented as a term occurrence binary vector x = x1 , . . . , x|T |  and term weighting is viewed as a measure of the relevance of the term for the classification problem. In Sect. 4.4.1, we introduced our discriminative term weights (DTW), where each term is weighed by the discrimination information it provides for a specific category over the others. They are derived from the whole of the training data L in a supervised fashion and help in term space partitioning and term selection. In this chapter, in addition to the relative risk weighting measure, we present four more discriminative term weighting strategies: log-relative risk, odds, log-odds, and KL divergence while keeping in view the multi-class nature of text classification.

116

If a document x contains a term j (i.e. xj = 1) then it is more likely to belong to category y = k if the likelihood of xj occurring in k is greater than its likelihood of not occurring in k i.e. p(xj = 1|y = k, L) is greater than p(xj = 1|y = Y \k, L), where notation Y \k denotes all categories but k. Equivalently, a document x is likely to belong to category k if their ratio for category k is greater than one: p(xj = 1|y = k) p(y = k|xj = 1) ∝ > 1, p(Y \k|xj = 1) p(xj = 1|y = Y \k)

(5.1)

In the above and subsequent equations, we omit the conditioning on the labeled set L for brevity. We quantified the discriminative information that a term j provides regarding category k over categories Y \k by calculating weight using relative risk. From here onwards we consider the weighting strategy as a function on the terms of the training data L. Weight wj of a term j for class k is therefore defined as wjk = g(j)k

(5.2)

where j varies form 1 to |T | and k ∈ Y . The function g(j) can be any measure that captures the discrimination information or distance between the two distributions given a term j, it does not necessarily have to be a metric. There are certain properties that g(j) should follow to work in this framework 1. The function always assigns a non negative value i.e. g(j)k ≥ 0. 2. Smallest value (either 0 or 1) is assigned only if the distribution of the term is same in both classes y = k and y = Y \k i.e. g(j)k = g(j)Y \k . 3. Weight of the term j is proportional to the difference between p(xj = 1|y = k) and p(xj = 1|y = Y \k) with maximum value reaching infinity, but is avoided using smoothing. One way of defining g(j)k is to weigh the term by its likelihood of occurring in documents belonging to category k over documents of categories Y \k:

g(j)k =

⎧ ⎪ ⎨ aj /bj

when aj > bj

⎪ ⎩ bj /aj

otherwise

117

(5.3)

where aj = p(xj = 1|y = k) and bj = p(xj = 1|y = Y \k). Notice that the discrimination information the term j provides for categories Y \k over category k is bj /aj . Thus, the smallest weight assigned by Eq. 5.3 is one and the highest score assigned could be infinity which is avoided by smoothing that is discussed later. This corresponds to the relative risk (RR) measure that was used in Sect. 4.4.1. Another strategy for discriminative term weighting is log-relative risk (LRR). Using this strategy, the weight for term j is defined as

g(j)k =

⎧ ⎪ ⎨ log(aj /bj )

when aj > bj

⎪ ⎩ log(bj /aj )

otherwise

(5.4)

Following this strategy, the smallest weight is zero which is consistent with no discrimination information. Eqs. 5.3 and 5.4 are monotonically related, with the former always giving a larger value than the latter. This difference in values becomes greater with increasing difference between aj and bj . Similarly, the third and the fourth strategy is to use odds ratio (OR) and log-odds ratio (LOR) for DTWs. In statistics, odds ratio is the ratio of the odds of an event occurring in one group to the odds of it occurring in another group. Whereas, odds is the relative likelihood the event will happen, a multiple (if it is equal to or greater than one), or expressed as a fraction (if it is less than 1) of the likelihood that the event will not happen. Therefore the weight for the term j using OR and LOR are defined as k

g(j) =

and

⎧ ⎪ ⎨ ⎪ ⎩

aj bj 1−aj / 1−bj

when aj > bj

bj aj 1−bj / 1−aj

otherwise

⎧ ⎪ ⎨ log( aj / bj ) 1−aj 1−bj k g(j) = ⎪ ⎩ log( bj / aj ) 1−bj 1−aj

when aj > bj

(5.5)

(5.6)

otherwise

Like Eq. 5.3, minimum weight possible for Eq. 5.5 is one and similarly, Eq. 5.6 like Eq. 5.4 can have zero as the minimum weight. Eqs. 5.3 and 5.5 are monotonically related to their logarithm version of 5.4 and 5.6 respectively. These two equations always give a larger value than their logarithmic

118

versions and this difference becomes prominent as the difference between the values of aj and bj increases. The fifth and final weighting strategy introduced in this chapter for discriminative term weighting is a information theoretic measure known as Kullback-Leibler (KL) divergence . The KL divergence of probability distribution p(x) from q(x) is defined as

g(j)k = DKL (p(x)q(x)) =

 x

p(x) log

p(x) q(x)

The KL divergence can also be interpreted as the expected discrimination information for p(x) over q(x). In our context, the two probability distributions are p(xj |y = k) and p(xj |y = Y \k) where xj can take on values of zero and one. Then, the expected discrimination information provided by knowledge of term j for category k over other categories is given by the KL divergence as

g(j)k

= DKL (p(xj |y = k)p(xj |y = Y \k) = aj log

aj 1 − aj + (1 − aj ) log bj 1 − bj

(5.7) (5.8)

Except for RR and log-RR, the remaining three DTW strategies consider both the occurrence and the absence of a term. Equation 5.7 is also monotonically related with the other four but is not symmetric. Even though RR measure asymptotically approaches OR for small probabilities, both are quite different. If values of aj and bj are significantly small, then Eqs. 5.5 and 5.6 approximates to Eqs. 5.3 and 5.4 respectively. The distinction between the two becomes prominent in cases of medium to high probabilities. For example, if a word occurs in class k with 0.999 probability and occurs in Y \k with 0.99 probability, then the relative risk is just over 1, while the odds ratio is more than 10 times higher. In statistical modeling, approaches like poisson regression have relative risk interpretations where the estimated effect of an explanatory variable is considered as multiplicative on the rate which leads to relative risk. If we treat the effect of explanatory variable as multiplicative on the odds, then logistic regression can be interpreted in terms of odds ratio. In medical research, OR is favored for case-control studies and retrospective studies while RR is used in randomized controlled trials and cohort studies [Lu and Tilley (2001)]. 119

In any case, all five equations quantify the discrimination information provided by term j for discrimination between category k and categories Y \k, with larger weights signifying larger discrimination information. The probabilities aj and bj are estimated from the training data L by maximum likelihood estimation. A Laplacian prior is used for each event for smoothing (add-one smoothing), this helps prevents the weight of a term to become infinity, a situation that arises because of division by zero during computing the weights.

5.3.2

Term Space Partitioning and Term Selection

Words are partitioned in two sets: first set identified by the index set Z k , contains terms for which aj > bj and the other set Z Y \k , contains the remaining terms. All terms j ∈ Z k provide evidence for category k over the rest, and this evidence is quantified by their weights in the form of discrimination information. Previously the set Z k and Z Y \k corresponded to the spam and nonspam words, respectively, while here Z k corresponds to the terms of the kth class but the terms in Z Y \k correspond to the significant words of all the other classes, i.e. words that are not represent in k th class. Also, we have to learn |Y | − 1 such pairs of partitioned sets. Our technique provides a natural way of selecting highly discriminating and relevant terms. A term j is selected as relevant for the k versus Y \k classification problem only if |aj − bj | ≥ t where j ∈ Z k and t is a positive valued threshold. For the spam filtering problem, one threshold was used to select the highly discriminating terms, but for multi-class problem one threshold would not be enough. In fact, it could deteriorate the performance drastically. The reason is that in multi-class text classification problem some classes are more similar to each other than others. Therefore, a selection threshold t that discards only few terms of a class that is different from the rest will discard many of the terms of the the class that is less different from the rest of the classes. To make this point clearer, we give an example from the 20 NewsGroup dataset. When learning the classifier for the category atheism and the rest, the difference between the two document sets is lesser than document sets resulting from categories medical and the rest. Because, in the former 120

case, the “rest” set contains the categories of religion and christianity, while for the later, the ’rest’ set does not have any category remotely related to the medicine. Due to this phenomenon, the difference between the two weights of the term is going to be on average larger in the latter case than in the former, therefore the threshold for the former should be smaller. Using a separate threshold for each classifier would lead to |Y | − 1 thresholds. These thresholds can be obtained by either performing cross validation over the training data or by simply choosing manual values that are an acceptable tradeoff between performance and efficiency. They can be avoided all together by using a value of zero if there is no concern of time efficiency. In all our experiments we have used the value of zero for thresholding except for places where we have tried to demonstrate the effect of thresholding, at these places we explicitly mention the values that we have chosen. In contrast to the thresholding on the user inboxes (in previous chapter), this is a supervised setting and a more direct approach for term selection. Theoretically the performance should be best when all of the features are selected i.e. t = 0, but in the results sections we will see that some times the accuracy improves with reduced feature set. Effective term selection is important for creating lightweight filters for large-scale service-side text classifiers.

5.3.3

Linear Opinion Pool and Linear Discrimination in Feature Space

The two set partitioning of the term space, i.e. Z k and Z Y \k , is used to form a new two-dimensional feature space. Each term j in the document x expresses an opinion regarding the document’s Y \k

categorization. This opinion is captured by the discriminative term weights wjk and wj

. The

terms in the set Z k give their opinion regarding the documents membership to category k while the terms in the set Z Y \k give their opinion regarding the documents membership to category Y \k. The aggregated opinion of all these terms is obtained as the linear combination of individuals’ opinions:

 k

Score (x) =

j∈Z k



j

xj wjk xj

(5.9)

This equation follows from the linear opinion pool or an ensemble average method that was used in 4.4.1, it is a statistical technique for combining experts’ opinions [Jacobs (1995), Alpaydin (2004)].  Each opinion (wjk ) is weighted by the normalized term occurrence (xj / xj ) and all weighted 121

opinions are summed yielding an aggregated discrimination score for category k (Scorek (x)) of the document. If a term i does not occur in the document (i.e. xi = 0) then it does not contribute to the pool. Also, terms that do not belong to set Z k do not contribute to the pool. Similarly, an aggregated discrimination score can be computed for all terms j ∈ Z Y \k as  Y \k

Score

(x) =

j∈Z Y \k



j

Y \k

xj wj xj

.

(5.10)

These two scores, define the two-dimensional feature space and correspond to the Score+ (x) and Score− (x) of the previous chapter. Then a linear discriminant function in this new space is then given as: f k (x) = αk · Scorek (x) − ScoreY \k (x) + α0

(5.11)

where αk and α0 are the slope and bias parameters, respectively. The discriminative model parameters are learned by minimizing the classification error over the labeled training set L. This represents a straightforward optimization problem that can be solved by any iterative optimization technique [Luenberger (1984)]. The discriminating line is defined by f k (·) = 0. If f k (·) > 0 then the document x is likely to belong to category k (Fig. 5.2). For a |Y | category classification problem, we learn |Y | − 1 such discriminant functions each with two parameters. In practice, however, the bias parameter set to zero often yields better results, leaving only the slope parameter to be learned. In accordance with the conclusion of the discussion in Sect. 5.3, the document is assigned the category corresponding to the classifier with the highest score. This avoids tie’s and conflicts between the |Y | − 1 classifiers therefore the DTWC’s overall classifier function for multi-class is defined as

Φ(x) = argmaxk f k (x).

(5.12)

DTWC derives its strength from the wide range of weighting methods (with only 5 discussed here) for quantifying the discrimination information based term weighting, discrimination information pooling to form a two-dimensional feature space, and a simple linear discriminative model for classification. These characteristics make DTWC efficient, in terms of both time and space, and 122

15000

Scorek(x)

10000

k

k

C\k

α * Score (x) − Score

(x) + α

0

5000

0

0

0.5

1 ScoreC\k(x)

1.5

2 4

x 10

Figure 5.2: The two-dimensional feature space and the linear discriminant function for a spam classification problem

robust to noise and changing data distributions. The multi-class algorithm of DTWC is given in Algorithm 4.

5.4

Evaluation Setup

We here compare the results of six commonly-used text classification datasets – three from the general text classification problems, namely, 20 Newsgroups, Movie Review and SRAA. The remaining three are the datasets from spam filtering that were used in Chapter 4, we replicate their results here for convenience of comparison with the other three datasets. These three spam datasets are; ECML-A, ECUE and PU e-mail datasets. The TC datasets, not only are they from a varied domain but have different underlying characteristics, with some datasets having distribution shift while others dont, some are two class problems while others are multi-class, some have small number of training data while others have large and so on. We contrast the performance of DTWC on

123

Algorithm 4 DTWC Input: set of labeled documents L, set of unlabeled documents U Output: labels for documents in U On training data L for k = 1 to |Y | − 1 do for j = 1 to |T | do Y \k (Eq. 5.3, 5.4, 5.5,5.6 or 5.7) compute wjk and wj end for compute Scorek (x) and ScoreY \k (x) (Eqs. 5.9 and 5.10) learn parameters αk and α0 end for On test data U for k = 1 to |Y | − 1 do compute Scorek (x) and ScoreY \k (x) (Eqs. 5.9 and 5.10) compute f k (x) (Eq. 5.11) end for output k = argmaxk f k (x) (Eq. 5.12) these datasets with four benchmark text classifiers – naive Bayes (NB), balanced winnow (BW), maximum entropy (ME), and support vector machine (SVM). DTWC’s performance with relative risk (DTWC-RR), log-relative risk (DTWC-LRR), odds (DTWC-OR), log-odds (DTWC-LOR), and KL divergence (DTWC-KL) discriminative term weighting strategies is reported. For naive Bayes, maximum entropy, and balanced winnow we use the implementation provided by the mallet toolkit [McCallum (2002)]. For SVM, we use the implementation provided by SV M Light [Joachims (1999a)]. We report the classification accuracy for all the datasets, and the mean and standard deviation of classification accuracy for the movie and SRAA datasets calculated over 5 runs of the algorithms.

5.4.1

Datasets

In all the six datasets, documents are represented as bag-of-words/terms. We convert them to formats in which documents are represented by term frequency vectors and term occurrence vectors. Where applicable, stop words, HTML tags, and message headers are removed from the datasets. The details of ECML-A, PU and ECUE datasets have been discussed in the Chapter 4.

124

20 Newsgroups (20 NG) dataset collected by Ken Lang [Lang (2006)] is a very popular dataset for text classification in the data mining and machine learning community. It is a collection of about 20,000 newsgroup documents from around 20 newsgroups, each corresponding to a different topic. Some of these topics are related and can be loosely grouped in six different topic categories e.g. autos, motorcycles are related topics. There are three different versions of 20 Newsgroups dataset available1 , the first one is the original dataset that has 19997 documents, the second version called the “bydate” version is sorted by date into training (60%) and testing (40%) with duplicates and header information removed. The third version called “18828” does not include duplicates, and includes the “Form” and “Subject” headers. We chose the “bydate” version for three reasons, first the duplicates are removed, secondly newsgroup identifying information (header information) is left out and lastly there is no randomness in the selection of training and test set which makes it more realistic. The movie review dataset2 , henceforth identified as the Movie dataset, captures the sentiment classification problem in which movie reviews from IMDB (Internet Movie Database) are labeled as either positive or negative (2 categories). It consist of 2000 positive and 2000 negative reviews. We remove the stop words/terms using the mallet toolkit [McCallum (2002)]. We holdout 400 examples of each class for testing and randomly select different numbers of examples for training. The SRAA (Simulated/Real/Aviation/Auto) dataset3 is a collection of 73,218 documents from four newsgroups (simulated-aviation, simulated-auto, real-aviation, and real-auto), representing a 4 category classification problem. We remove the HTML header and the stop words using the mallet toolkit. We holdout 1000 examples of each class for testing and randomly select different numbers of examples for training. 1

http://people.csail.mit.edu/jrennie/20Newsgroups/ http://www.cs.cornell.edu/people/pabo/movie-review-data 3 http://www.cs.umass.edu/ mccallum/code-data.html 2

125

5.5 5.5.1

Results and Discussion Classifier Performance

Tables 5.2, 5.3 and 5.5 show the classification accuracies of DTWC, naive Bayes (NB), maximum entropy (ME), balanced winnow (BW), and SVM on Movie, SRAA and Spam datasets, respectively. Table 5.4 shows the results for 20 NG, ECUE and PU datasets. The results for DTWC with relative risk, log-relative risk, odds, log-odds, and KL divergence discriminative term weighting strategies are identified by DTWC-RR, DTWC-LRR, DTWC-OR, DTWC-LOR and DTWC-KL, respectively. Since the tables were becoming two wide, therefore we omit the results of DTWC-OR and DTWCLOR from these tables and compare them with DTWC-RR, DTWC-LRR and DTWC-KL in Fig. 5.3. More detailed comparison of the five weighting measures is presented in Sect. 6.4.1. For Movie and SRAA datasets, we give the mean and standard deviation of the classification accuracies over five runs of the classifiers with each run using randomly chosen examples for training and testing. For the rest of the datasets, we give classification accuracies for each user inbox. DTWC performs exceptionally good for dataset which have a distribution shift between the training and test data such as ECML dataset. The distribution of training and test sets are similar for the Movie and the SRAA datasets. For these datasets also, DTWC outperforms all the other algorithms, however by a lesser margin as compared to that for the ECML dataset. DTWC-KL is the best performer for the Movie dataset, while DTWC-RR and its log variant are the best performer for the rest of the datasets. The results obtained by NB, ME,and SVM are comparable to those reported in Druck et al. (2007), McCallum et al. (2006). DTWC’s performance appears

Table 5.2: Accuracy results for Movie dataset. Means plus/minus standard deviations are computed from 5 runs with randomly drawn training sets of sizes specified in the first column and randomly selected test sets of size 800. Ex. DTWC-RR DTWC-LRR DTWC-KL 600 500 400 300 200 Avg

80.90 ± 1.13 82.47 ± 1.50 79.84 ± 1.94 79.64 ± 1.00 78.02 ± 1.46 80.17

81.35 ± 0.83 82.40 ± 1.21 81.42 ± 0.80 80.07 ± 0.91 79.30 ± 1.90 80.91

82.32 ± 0.83 83.24 ± 1.14 81.52 ± 0.79 82.27 ± 1.36 80.87 ± 1.28 82.04

NB

ME

BW

SVM

79.25 ± 1.15 80.74 ± 0.37 79.17 ± 1.08 77.57 ± 1.01 76.42 ± 1.57 78.63

82.14 ± 0.50 81.32 ± 0.50 79.62 ± 1.28 77.97 ± 1.56 76.32 ± 0.92 79.47

78.89 ± 0.86 77.92 ± 2.31 78.34 ± 1.58 76.09 ± 1.36 74.12 ± 1.86 77.07

81.85 ± 0.91 81.35 ± 1.78 79.65 ± 1.07 78.52 ± 1.33 76.10 ± 1.22 79.49

126

Table 5.3: Accuracy results for SRAA dataset. Means plus/minus standard deviations are computed from 5 runs with randomly drawn training sets of sizes specified in the first column and randomly selected test sets of size 4000. Ex. DTWC-RR DTWC-LRR DTWC-KL 1500 1000 500 250 150 Avg

93.41 ± 0.30 92.94 ± 0.14 91.26 ± 0.51 88.88 ± 0.42 86.63 ± 0.22 90.62

91.93 ± 0.10 91.14 ± 0.37 88.48 ± 0.61 83.60 ± 1.41 78.12 ± 1.31 86.65

88.61 ± 0.66 88.50 ± 0.37 87.50 ± 0.81 85.52 ± 0.72 83.74 ± 0.96 86.77

NB

ME

BW

SVM

92.72 ± 0.31 92.10 ± 0.67 90.59 ± 0.67 88.05 ± 0.92 85.69 ± 0.69 89.83

90.53 ± 0.58 89.12 ± 0.26 86.75 ± 0.60 83.28 ± 0.17 81.87 ± 1.02 86.31

88.23 ± 0.46 87.54 ± 0.37 85.01 ± 0.82 81.94 ± 0.54 79.97 ± 1.01 84.54

91.54 ± 0.36 89.34 ± 0.30 86.73 ± 1.36 84.52 ± 0.37 83.58 ± 1.17 87.14

Table 5.4: Accuracy results for the ECUE, PU and 20 Newsgroup dataset. Inbox DTWC-RR DTWC-LRR DTWC-KL NB ME BW SVM 20 NG ECUE-1 ECUE-2 PU1 PU2

78.73 92.20 83.45 98.27 97.18

35.13 91.95 82.65 98.62 94.83

66.59 83.05 89.25 97.58 94.36

73.67 50.05 50.00 96.55 87.32

71.20 78.30 79.50 96.89 94.36

60.32 83.05 77.50 97.24 90.61

77.32 83.30 76.95 96.21 88.26

Table 5.5: Accuracy results for Spam dataset. The training set and each user’s inbox contain 4000 and 2500 e-mails, respectively. Inbox DTWC-RR DTWC-LRR DTWC-KL NB ME BW SVM Inbox 1 Inbox 2 Inbox 3 Avg

91.00 92.36 87.52 90.29

91.12 91.96 88.60 90.56

79.88 82.24 68.88 77.00

127

81.24 83.80 87.88 84.30

62.20 68.16 78.92 69.76

61.00 64.76 73.44 66.40

64.40 69.56 80.24 71.40

100

90

Percentage AUC

80

70

60

LOR NB ME BW SVM

50

40 Inbox 1

Inbox 2

Inbox 3

ECUE−1 ECUE−2 Datasets

PU1

PU2

Movies 600

Figure 5.3: Performance in terms of area under the curve (AUC)

slightly lesser than that of multi-conditional learning reported in McCallum et al. (2006); however, their exact evaluation and dataset is not known so a direct comparison is not possible. Even though the results of NB, ME, BW and SVM are comparable, but each one of them performs significantly poorly on atleast one of the dataset, e.g. NB for ECUE-1 and ECUE-2 datasets, BW for 20 NG and ECML dataset, ME for ECUE-1 and ECML datasets, and SVM for the ECML dataset. DTWC fares as the most consistent classifier of the lot, and we show in Chapter 6 that its performance is statistically better than the above mentioned classifiers. One surprising result is that DTWC-LRR performs very poorly on the 20 NG group as compared to its non-log variant, this is also true for DTWC-LOR. We failed to understand the reason for this. The AUC values for the above datasets are reported in fig 5.3 and 5.4. Both of the figures do not show the result for SRAA and 20 NG dataset because the AUC measure is defined for two class problems only. The fig. 5.3 shows that in terms of AUC, DTWC consistently outperforms all the other algorithms by a big margin. Fig 5.4 compares the performance of the aforementioned

128

weighting strategies. Generally DTWC-LRR and DTWC-LOR perform better than their anti log versions. The worst performer is the DTWC-KL but it catches up with the rest on the PU and Movies datasets. This characteristic of the DTWC-KL can be credited to the penalizing of absence of a term by the Kullback-Leibler divergence measure. Since the distribution shift is largest for ECML-A dataset, many terms that occur in the training set are absent in the test set, as a result the performance of DTWC-KL is worst for this dataset. Because of the lesser distribution shift in the ECUE dataset (only temporal shift), DTWC-KL is not far away from the rest of the weighting schemes. As for PU and Movies 600 which do not have this distribution shift, DTWC-KL performs equally better as the rest of the schemes. 100

Percentage AUC

95

90

85

80

RR LRR OR LOR KL

75 Inbox 1

Inbox 2

Inbox 3

ECUE−1 ECUE−2 Datasets

PU1

PU2

Movies 600

Figure 5.4: Comparison between different weighting schemes.

In Chapter 4 we saw that PSSF significantly improved the performance over DTWC, but this observation is violated for SRAA, 20 NG and Movie datasets. PSSF achieves an accuracy of 80.24% and 90.47% as opposed to the 80.17% and 90.62% of DTWC on Movie and SRAA datasets respectively. Asf or 20 NG the performance of PSSF deteriorated as well. This is because the training

129

and the test datasets are randomly sampled from the same distribution, hence no distribution shift. For general text classification problems PSSF seems to be an over kill in terms of resources and performance.

5.5.2

Parameter Estimation

DTWC uses a set of generative model parameters – the discriminative term weights – and |Y | − 1 discriminative model parameters – the slope αk and bias α0 . The weights are computed from the labeled training set by using probabilities. This is a straightforward computation requiring a single pass over the training set. The discriminative model parameters are learned by minimizing the classification error over the labeled training set. This is a convex optimization problem, as empirically verified from the error versus slope parameter graph (not shown here). The bias parameter, which is usually close to zero in our evaluations, can be determined after learning the slope parameter. The optimization problems can be solved efficiently by an iterative optimization technique or by grid search.

5.6

Conclusion

In this chapter we explored the characteristics that distinguish general text classification problems from spam filtering. We then discussed different techniques to develop a multi-class version of DTWC and we found 1 vs all strategy with confidence to be the most suitable. We also provided formulation for incorporating different weighting measures in DTWC and evaluated its performance on 5 different information theoretic measures such as relative risk, log of relative risk, odds ration, log of odds ration, and Kullback-Leiblar divergence. For these five measures we demonstrated the performance and effectiveness of DTWC on 20 News Groups, Movie Review and Simulated/Real/Aviation/Auto datasets and compared the results with supervised naive Bayes, maximum entropy, balanced winnow and support vector machine classifiers. DTWC outperformed all the other techniques on all scenarios, but its performance is substantially better in situations having distribution shift between the training and test sets. For general TC, PSSF fails to significantly improve on the performance of DTWC and just adds a perfor130

mance overhead. Out of the five weighing techniques evaluated relative risk and odds ratio are the most consistent. Results suggest that KL divergence weighing measure does not perform nicely on datasets with large distribution shift. All these characteristics make DTWC suitable for all kinds of text classification problems, including problems with many classes where scalability and robustness are essential. Scalability analysis of DTWC/PSSF, significance tests of comparisons, generalization of a framework from DTWC/PSSF algorithm, and relationship of DTWC/PSSF with LEGO framework is discussed in the next chapter.

131

Chapter 6

Classifier Properties and Generalizations 6.1

Introduction

In the preceding chapters we introduced DTWC/PSSF and laid its theoretical foundations supported by empirical results. In this chapter we verify the importance of these results through statistical significance tests so as to ascertain whether the performance gain by DTWC/PSSF is significantly better or not. Furthermore we analyze the time and space complexity of DTWC/PSSF to determine whether it is feasible for a practical personalized spam filtering solution where billions of e-mails are exchanged daily between millions of users. We then derive a framework from DTWC/PSSF and discuss how different weighting measures, feature selection, opinion consolidation and discriminating models would fit into this framework to produce a variety of classifiers. We also empirically demonstrate this by showing results for five weighting measures, six different opinion consolidation techniques and four discriminative classifiers. The DTWC framework is compared and related to the LeGo (from local patterns to global models) data mining framework as well. It focusses on finding individually promising (highly informative) patterns in data that are then used as features in global models of classification. The LeGo framework is broad enough to cover and leverage frequent pattern mining, multiview learning, subgroup discovery, pattern teams,

132

and several other popular algorithms [Knobbe et al. (2008)].

6.2

Scalability

DTWC/PSSF is highly efficient in terms of time and space requirements. It only requires a single pass over the labeled data to compute the discriminative term weights (the local model parameters) and its global model parameters are obtained by solving a straightforward optimization problem. No further references to the training data are required once the model is learned. Time taken to train local model of DTWC is O(|L|a), where |L| is the number of documents, |T | is the size of the dictionary and a is the average length of the document such that a  |T |. This complexity is linear in terms of the size of the dataset. The discriminant model is learned on top of the local model i.e. in a two dimensional space. Depending on which discriminant model we are using, it can also be learned in linear time. This is fastest that it can get for any text classification algorithm. The time taken by DTWC/PSSF for classifying each document is O(|T |) because in order to classify each document we have to calculate the |Y | scores, which in turn are the sums of the weights of O(|T |) terms that have been partitioned into |Y | different mutually exclusive sets. This is faster than NB that takes O(|Y ||T |) time to classify a single mail, because to calculate the class conditional probabilities, NB calculates the product of class conditional probabilities of each of the |T | terms. Therefore it takes O(|T |) time to calculate a single class conditional probability and there are |Y | classes. Similarly, the local model of DTWC/PSSF requires only O(|T |) space as compared to the O(|Y ||T |) space required for the well known NB classifier. DTWC owes this advantage to its partitioning of the terms into |Y | mutually exclusive sets, whereas NB stores the probabilities of all the terms for each class. SVM tends to be quite effective for text classification but at times its training model size becomes very prohibitive. Even though it can be trained in linear time [Joachims (2006)], it tends to be slower than NB [Malik et al. (2010a)]. For the 20 Newsgroups dataset, the size of the SVM model reached more than 1.5 giga bytes with the regularization parameter Y converging at 2.5 million. As for DTWC the training model has a total number of 95663 features, for each of them we only store their id (integer), weight (float) and the class number (integer). For 133

the discriminative model we only store two parameters for each class. If the integer and float are taken to be of 4 bytes each then the total size of our model is less than 1.1 mega byte and still DTWC performs better than SVM. Generally |Y |  |T | and |Y |  |L|, but for problems such as Wikipedia article categorization, 2.9 million articles are assigned to 1.5 million categories [Dekel and Shamir (2010)]. In problems such as targeted advertisement, the number of classes (advertisements) could be asymptotically equal to the number of examples (users). In problems like personalized spam filtering, where millions of users receive more than 100 e-mails per day (both spam and non-spam), the filter size and response time is very crucial. The saving of |Y | space per user filter and |Y | steps per classification of each e-mail is a huge saving that can save the e-mail service providers millions of dollars in hardware costs. In TC documents are sparsely represented in a |T | dimensional space. Approaches dependant on document incidence matrix require a |L|x|T | dimensional matrix, which results in a huge memory and computation cost. DTWC/PSSF enjoys the benefit of being implemented for a document incidence matrix and as well as for the hash table data structure for which it only takes O(|L|a) time for training. Using the hash table data structure, the object corresponding to each term (containing id, weight etc) can be retrieved and stored in constant time. Hash table data structure is the natural choice for DTWC/PSSF because, a) search and update, which are the most frequently performed operations only take constant time, and b) DTWC/PSSF does not require access to terms in any specific order e.g. sorting of terms on term id’s (or weights) or finding the term with minimum or maximum term id (or weight). This computational complexity is the best that it can get for text classification problems.

6.2.1

Feasibility for Service Side Personalized Spam Filtering

Robustness and scalability are essential for personalized service-side spam filtering [Kolcz et al. (2006)]. To implement PSSF on the service-side for a given user, the local and the global model parameters must be resident in memory. There are only two local model parameters: the slope and bias of the line. The local model parameters are the discriminative term weights for the significant

134

spam and non-spam terms. The number of significant spam and non-spam terms, which defines the size of the filter, is controlled by the term selection parameter t . Figure 6.1 shows that with a slight increase in the value of t (starting from 0), the number of significant terms drop sharply. This reduction in filter size does not degrade filtering performance significantly (and sometimes it even improves the performance) as only less discriminating terms are removed from the model. 4

x 10

User 1 User 2 User 3

2.5

No. of Significant Terms

2

1.5

1

0.5

0 0

0.05

0.1

0.15

t’

Figure 6.1: Number of significant terms versus term selection parameter t for PSSF1/PSSF2 on ECML-A dataset

Table 6.1 shows the term selection parameter, average filter size (as average number of significant terms), and average percent accuracy and AUC values of PSSF1 and PSSF2 on ECML-A dataset. When the average filter size is reduced by about a factor of 165 (from 24,777 to 146.66 terms) the average AUC value for PSSF1 decreases very slightly (less than 0.25%), and this value is still greater than that reported by Junejo et al. (2006), Pfahringer (2006), Gupta et al. (2006), Cormack (2006b) (see Table 4.11). Moreover, with an average filter size of only 66 terms, PSSF performs remarkably well with average AUC values greater than 97%. It is worth pointing out that even when t = 0, the average number of significant terms is much less than the significant terms of the 135

global model (40516 terms). In fact, the actual number of terms in training data is 41,675 (1159 i.e. about 2% terms are automatically dropped from the global model at t = 0). The average filter size is directly related to the scalability of the filter – the smaller the size the greater the number of users that can be served with the same computing resources. For example, when the average filter size is 8 terms, PSSF1 can serve approximately 30,000 users with 1 MB of memory (assuming 4 bytes per discriminative term weight). Interestingly, the average performance of PSSF1 with this filter size is similar to that of a single global NB filter (see Table 4.3). However, the size of the NB filter will be over 83,350 (41,675 per class) conditional probabilities as compared to only 24 (8 per each of the three user inboxes) weights for PSSF1. This makes PSSF one of the most efficient and the fastest algorithm for spam filtering. It is capable of producing very light weight filters [Sculley and Cormack (2009)] at a very little cost of performance overhead. To adapt the filter to changes in distribution, the filter can be adapted continuously or be updated periodically. Similarly, the global model parameters can be updated periodically. Furthermore, the naive SSL approach that we use for personalization decouples the adaptation from the training data and as such can be performed on the client side in addition to the service side.

6.2.2

Term Selection for Supervised Model

In the previous subsection we looked at how the efficiency and performance of PSSF was effected with varying the threshold parameter t . It showed that even very small personalized filters yielded

Table 6.1: Scalability of PSSF: impact of filter size on the three test sets in ECML-A dataset. PSSF1 t Terms Acc AUC 0.00 24777 91.58 98.94 0.05 484 93.90 98.80 0.12 146.66 94.00 98.75 0.20 66 92.17 97.77 0.30 26.33 88.88 96.00 0.35 14.66 85.92 93.75 0.40 9.33 82.85 90.96 0.41 8 82.68 87.45 136

performance. The results are averages for PSSF2 Acc AUC 95.80 99.16 93.88 98.81 96.06 98.88 93.80 98.03 89.44 96.84 84.33 94.38 81.42 91.69 81.11 88.07

good performance. In this sub-section we will look at the effect of threshold t on the discriminative term model learned in the supervised setting. Figure 6.2 shows the variation of the number of significant terms with threshold t for the ECML-A dataset (using DTWC-RR). As expected, like Fig. 6.1, the number of selected terms drops sharply with only a small increase in t. For a specific threshold, the training model has more terms than a personalized filter, this is because the number of terms in the training model is higher. As for the accuracy vs. number of terms (Fig. 6.3), the accuracy drops relatively sharply by thresholding on the training data as compared to when we decrease the number of terms in the personalized filter. In fact, in some instances, the accuracy for the PSSF1 had increased with slight increase in threshold. Even though the training model has 45 words (at t = 0.3) as compared to the 26 words of the personalized filter, it still lags by more than 26.5% in performance than both PSSF1 and PSSF2, thus making the personalized filter more robust to term selection. Table 6.2 shows this in detail. Table 6.2: Selected terms and accuracy at different values of threshold t for Spam dataset (DTWCRR) Threshold Terms Accuracy 0 0.0025 0.005 0.0075 0.05 0.12 0.20 0.3

6.3

40516 16666 9333 6608 860 247 111 45

90.29 88.48 86.85 85.86 77.68 69.25 67.78 61.94

Statistical Significance Testing

Mere reporting of the absolute difference of accuracy or AUC values is not sufficient enough for an empirical comparison of classifiers. The hypothesis that our algorithm performs significantly better than the other techniques (SVM, NB, ME, BW and their semi-supervised versions on datasets in Chapter 4 and 5) requires some statistical verification. Statistical tests that quantify the consistency of the observed differences should also be performed so that it is ascertained that the results hence

137

4

4

x 10

3.5

Number of terms

3 2.5 2 1.5 1 0.5 0

0

0.02

0.04 0.06 Threshold t

0.08

0.1

Figure 6.2: Number of terms selected versus threshold t for Spam dataset (DTWC-RR)

achieved are not because of a mere chance. These significance test are used in a variety of ways to compare the performance of two or more classifiers. In the literature, there are many settings and tests that are used to achieve this task, we in this section, mention the most relevant ones and weigh their trade offs so as to select the tests based on their statistical soundness, their acceptance within the machine learning community, and their applicability to our problem setting.

6.3.1

Multiple Comparison on Single Datasets vs. Comparison on Multiple Datasets

We want to compare two or more classifiers that have been run on more than one datasets and were evaluated using classification accuracy and AUC. Due to the specific nature of our datasets (except the PU dataset), the training and test sets are almost fixed. Random sampling of documents would jeopardize the element of distribution shift present across the training and test sets, therefore we do not record the variance (or deviation) of these results over multiple samples, and hence assume

138

92 90 88

Accuracy (%)

86 84 82 80 78 76 74 72

0

0.02

0.04 0.06 Threshold t

0.08

0.1

Figure 6.3: Average accuracy versus threshold t for Spam dataset (DTWC-RR)

nothing about the sampling scheme. We only assume that the provided results are reliable estimates of the algorithms performance on each dataset. Furthermore, we avoid the comparison over multiple sampling over datasets because when testing on a single dataset, usually the mean performance and its variance over repetitive training and testing on random samples of examples is computed. Since these samples are usually related, tests tend to give a biased estimations of variance which may lead to elevated false positives. If we compare the classifiers on individual datasets’ basis by drawing multiple samples, then if there are m classifiers and k datasets, then it would have k*(m!) pair-wise comparisons, making things very complex. Therefore we need to perform tests that compare the algorithms across multiple datasets. Running the classifiers on multiple datasets naturally gives a sample of independent measurements. In this setting, the sources of variance are the differences in performance over independent datasets and not on the dependent samples (as in case of sampling). These things make comparisons over multiple datasets more meaningful, reliable and simpler than classifier comparisons on a single dataset or multiple samples of multiple datasets. Furthermore,

139

the discussion in this section draws heavily from the Demsar (2006), as it is an excellent and well cited work that is very much relevant to our aim in this section. Formally, we are testing 10 learning algorithms on 44 training-testing datasets based on two performance measures, namely accuracy and AUC (44 for accuracy and 44 for AUC). If k is the number of classifiers and N is the total number of results (in this case 44), then let cji be the performance score of the j-th algorithm on the i-th dataset. Therefore our aim is to decide which algorithms are statistically different based on the values cji .

6.3.2

Comparisons of Two Classifiers

Two types of tests are available in the statistical literature for comparing more than two classifiers: one which can only compare two classifiers at a time while the others can compare multiple classifiers simultaneously. While the former tests tell which classifier is statistically better, the latter tests only indicate whether there is atleast one classifier that is statistically different or not and hence they require a certain post-hoc analysis to indicate which classifier (or classifiers) are statistically different. In this subsection we mention 4 different techniques mentioned by Demsar (2006) that are frequently used to compare the performance of two classifiers: Averaging Over Datasets: Averaging the performance of classifiers over several datasets has not found much acceptance in the Machine Learning community. Firstly, performance over datasets from different domains might not be comparable. Secondly, they are quite susceptible to outliers, so a poor performance on one dataset could overshadow a fair performance on many other datasets (or vice versa). Paired T-Test: checks whether the average difference between the performance of two classifiers over multiple datasets is significantly different from zero or not. If di = c1i − c2i is the difference between the performance of two classifiers over the i-th datasets then t statistics is computed as ¯ ¯ and is distributed according to the student distribution with N-1 degrees of freedom, where d/σ d d¯ and σd¯ are the average difference and standard error. The t-test is not suitable for our problem setting because of three reasons; a) it requires di to follow a normal distribution, which in our case is violated as can be seen from the Fig. 6.4, it is similar to averaging over datasets because outliers

140

skew the test statistics and decrease the tests power by increasing the estimated standard error and c) lastly, it suffers from commensurability i.e. its numerator is nothing but the averaged difference between two classifiers (similar to averaging). 8 7 6 5 4 3 2 1 0 0

5 10 15 20 25 Difference in Accuracy values of DTWC/PSSF and ME/ME−SSL

Figure 6.4: Histogram of the differences between the percentage accuracy of the DTWC/PSSF and ME/ME-SSL classifiers on the 44 training testing pair sets. The mean of the distribution is 9.16, with standard deviation of 7.28.

Wilcoxon Signed Ranks Test: It is a non-parametric alternative to the paired t-test that ranks the differences of two classifiers on each dataset in ascending order, and then aggregates the ranks of the positive and negative differences. di (ignoring the sign of di ) are sorted with the highest difference being assigned the highest rank of k (number of classifiers) while the lowest difference getting a rank of 1. Then R is chosen as the minimum of the sum of ranks when di > 0 and di < 0 (for di = 0, ranks are split evenly between Ra and Rb ). The performance of the classifiers is then said to be statistically different with p = 0.05 if z is smaller than -1.96, where z=

R − 14 N (N + 1) 1 24 N (N

(6.1)

+ 1)(2N + 1)

Wilcoxon signed ranks test is more powerful than the t-test when the assumptions of latter are not met (like in our case). Unlike t-test, it does not assume a distribution for di s. Even though 141

greater differences count more in the Wilcoxon signed ranks test but unlike the t-test, it is less sensitive to outliers because the absolute magnitude of the differences are ignored. Sign Test: Oftentimes researchers compare the classifiers by counting the number of times the algorithm has a winning performance. Some times a table of pairwise comparison for multiple classifiers is also presented as a matrix that shows the number of times a classifier was outperformed by other classifier in that pair. These counts have been used in inferential statistics with a form of binomial test that is known as the sign test. The two classifiers are said to be equivalent if each wins N/2 times. For a sizeable N , the number of wins is distributed according to the Normal  distribution N (N/2, N/2). Then a classifier is significantly better with p < 0.05 if it wins atleast √ N/2 + 1.96 N /2 times out of N . This test does not take in to account the difference in the performance scores; when a classifier outperforms the other by say 30% accuracy, or by only 1% accuracy, in both cases it is considered just a single win. Unlike the t-test and alike the Wilcoxon signed ranks test, it does not assume normal distribution, but it is much weaker than the Wilcoxon signed ranks test, because a classifier will have to win almost always to be considered significantly better, especially when N is small.

6.3.3

Multiple Classifier Comparison

The above mentioned tests are for comparing two classifiers and are not designed for comparing multiple classifiers (or multiple hypothesis testing). That is because when many comparisons are made using the above tests, a certain proportion of the null hypothesis is rejected due to random chance, a phenomenon referred to as family-wise error which is the probability of making at least one false positive in any of the comparisons. The two well known tests for comparing multiple classifiers are the ANOVA (analysis of variance) and its non-parametric counterpart Friedman test. Both of them test whether the observed differences are just a random phenomenon or not. If the classifiers are different, then in order to find which classifiers actually differ, a post-hoc analysis is required. ANOVA: It is the most common statistical method used to compare more than two related sample means. It divides the total variability into three parts, the variability between the datasets,

142

between the classifiers, and the residual (or error) variability. If the residual variability is significantly smaller than the variability between the classifiers then we can conclude that there are some differences between the classifiers that are beyond a random chance or an error. Like t-test, ANOVA too suffers from drawbacks that make it un-applicable to our setting. Firstly it assumes that the samples are drawn from a normal distribution, and it is not guaranteed in is our case. Secondly, ANOVA assumes sphericity, which states that all the samples have equal variance. This property too is violated in our situation e.g. for the accuracy measure the variance of DTWC/PSSF sample is 76.34 while for balanced winnow, variance is 398.91 and similarly for the AUC measure the variance varies from 43.97 for ME to 448.61 for Balanced Winnow. Friedman test: Friedman test is the non-parametric equivalent of the repeated measures ANOVA test. Unlike ANOVA and alike Wilcoxon, it does not assume a normal distribution for the samples. Friedman test is more powerful than repeated measures ANOVA when the assumptions of the latter are not met, which is the case here. Like Wilcoxon, it also does ranking but quite differently. It ranks the algorithms for each dataset and then compares their average ranks. The best performing algorithm on a dataset gets the rank of 1, followed by a rank of 2 for the second best performer and so on, thus in our case it varies from 1 to 5. Subsequently the average ranks are computed and compared through Friedman statistic. A complete explanation of Friedman statistics [Friedman (1937; 1940)], its variations [Iman and Davenport (1980)] and calculation of the critical values [Zar (1998), Sheskin (2000)] is out of the scope of this thesis.

6.3.4

Post-Hoc Analysis

Both ANOVA and the Friedman tests, only tell us whether there is a significant difference between classifier performances, but do not tell which classifiers actually differ. For this purpose, post-hoc analysis are performed. Of the many such tests for ANOVA, the two most relevant to our problem setting are the Tukey [Tukey (1949)] and Dunnett [Dunnett (1980)] tests. Tukey test compares all classifiers with each other while the Dunnet test compares all classifiers with a base classifier or the control classifier (DTWC/PSSF). There are different post-hoc analysis for the Friedman test, with Nemenyi test [Nemenyi (1963)]

143

as the most famous one. It is similar to the Tukey test for ANOVA and is used when all classifiers are compared to each other. When all classifiers are compared with a control classifier (DTWC/PSSF), we can use one of the general procedures for controlling the family-wise error in multiple classifier testing, such as the Bonferroni correction. Although Bonferroni correction and other similar procedures are weaker than the Nemenyi test but in our setting it is other way around, because the latter adjusts the critical value for it k(k − 1)/2 comparison while the former only makes k − 1 comparisons. Campbell and Skillings (1985) proposed a step-down procedure that starts with the overall hypothesis that all k classifiers are similar, and if this hypothesis is rejected, then it considers the sub-hypotheses involving k − 1 classifiers, continuing until the hypothesis only involves two classifiers or no hypotheses are rejected. This procedure results in a sequence of subsets of classifiers with homogeneous characteristics i.e. within each subset, the classifiers are not statistically different. A detailed description and comparison of the post-hoc analysis is out of the scope of this thesis, interested readers are suggested to refer to Demsar (2006). From the above relevant tests we select only two tests namely, Wilcoxon signed ranks test for two-classifier and Friedman test followed by the step-down approach by the Campbell and Skillings (1985) as the post-hoc analysis procedure for multiple classifier comparison. We select them mostly because of their nonparametric nature which does not assume that the classifier scores or their differences follow any distribution. Furthermore, they are less sensitive to outliers. Demsar (2006) refers to these selected tests as “.. we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple datasets”. We use the widely used IBM’s SPSS 19 tool for performing these statistical tests.

6.3.5

Significance Test Results

In our analysis, we perform two tests of Wilcoxon signed ranks (one each for accuracy and AUC values) comparing the differences in performances of DTWC/PSSF with each of the other classifiers evaluated for global and personalized filtering, respectively. We find that DTWC/PSSFs perfor-

144

mance is significantly different (better) than the others on both accuracy and AUC values at the confidence level of 0.05 for accepting the null-hypothesis. In fact, except for the significance level (or p-value) of 0.025 obtained when DTWC/PSSF is compared with ME/ME-SSL on AUC values, all other p-values are less than 0.001. This result verifies that the difference in performance observed between our algorithms and the others on both global and personalized filtering is statistically significant. In our analysis, we perform two Friedmans tests (one each for accuracy and AUC values) comparing 10 classifiers (DTWC, NB, ME, BW, SVM, PSSF, NB-SSL, ME-SSL, BW-SSL, SVM-SSL) on 22 datasets (3 ECML-A, 15 ECML-B, 2 PU, and 2 ECUE datasets). The Friedman test for accuracy and AUC measures indicates that there is a classifier that is statistically different. To determine which classifiers are different, we performed the stepwise step-down post-hoc analysis provided by the SPSS tool. Tables 6.3 and 6.4 show the homogeneous subsets view of the results of this test for the accuracy and AUC values, respectively. Each row in the Classifier group corresponds to a related sample (classifier’s output as either accuracy or AUC). Classifiers that are not statistically (significantly) different are grouped into same subsets, where column corresponds to a different subset. If all samples are statistically different, then there is a separate subset for each sample, and when none of the samples are statistically different, then they are grouped into a single subset. Each value in the classifier group is the average rank of that classifier, the greater the value the better is the classifiers performance, therefore the classifiers are sorted from top to bottom with the classifier with the lowest average rank at the top. Similarly the subsets are also sorted from left to right with the subset on the right having an equal or better average rank than the subset on its left. Hence the classifiers at the bottom right are significantly better classifiers. The results in Tables 6.3 and 6.4 show that DTWCs performance in accuracy is statistically different from all other classifiers evaluated for global filtering as it is placed in a subset that contains none of the other global filters. In fact, because of its superior performance, it is grouped with the personal filters of BW-SSL and ME-SSL. When performance is measured using AUC values, DTWC is placed in the same subset with ME, although it still maintains a higher average rank. For personalized filtering, PSSF is grouped with BW-SSL and ME-SSL in both the tables. Nonetheless, PSSF maintains a

145

Table 6.3: Post-Hoc Analysis of Friedman’s test for the accuracy measure. are based on asymptotic differences. The significance level is 0.05. Homogeneous Subsets Subsets 1 2 3 4 BW 2.09 SVM 3.31 3.31 SVM-SSL 3.31 NB-SSL 4.04 4.04 NB 4.31 4.31 Classifier1 ME 5.04 DTWC 7.40 BW-SSL 8.00 ME-SSL 8.36 PSSF Test Statistic 6.54 4.80 3.90 5.54 Sig. (2-sided test) 0.01 0.18 0.14 0.06 Adjusted Sig. (2-sided test) 0.05 0.40 0.39 0.19 1 Each cell shows the classifier’s average rank.

Homogeneous subsets

5

8.00 8.36 9.09 2.81 0.24 0.60

Table 6.4: Post-Hoc Analysis of Friedman’s test for the AUC measure. Homogeneous subsets are based on asymptotic differences. The significance level is 0.05. Homogeneous Subsets Subsets 1 2 3 4 5 6 BW 2.04 NB-SSL 3.02 3.02 NB 3.29 3.29 SVM-SSL 3.77 3.77 SVM 4.86 Classifier1 ME 6.40 DTWC 7.22 7.22 BW-SSL 7.31 7.31 7.31 ME-SSL 7.90 7.90 PSSF 9.13 Test Statistic 4.93 3.43 4.54 4.72 4.45 6.81 Sig. (2-sided test) 0.08 0.18 0.03 0.09 0.10 0.03 Adjusted Sig. (2-sided test) 0.25 0.48 0.15 0.28 0.31 0.10 1 Each cell shows the classifier’s average rank.

146

higher average rank and the significance level of this subset under AUC is not very high (0.10). This could have been the result of low power of the post-hoc test as it often happens that the Friedman test suggests that classifiers are different but a following post-hoc analysis fails to identify it because of its over conservative nature [Salzberg (1997)]. Moreover, PSSF appears in only one subset, unlike the other two classifiers. Note, the difference in the average ranks of DTWC and PSSF of 1.69 and 1.91 for accuracy and AUC resp. emphasizes the benefit of personalization. These statistical analysis confirm the overall superiority of our algorithms on global and personalized spam filtering. They also validate the robustness of our algorithms across different spam filtering settings.

6.4

DTWC as a Framework

From the technique described in the Chapter 4 and 5, emerges a framework for a class of algorithms. The framework is very rich and diverse in terms of the different types of feature weighting, feature selection and discriminative models that can be incorporated to match the needs of the problem scenario. We do not consider preprocessing as a part of our framework as it depends on the dataset, e.g. some data might benefit from stemming, while others may not. The proposed framework consists of the following four major steps: 1. Feature Weighting 2. Feature selection and partitioning 3. Opinion Consolidation 4. Discriminative Model

6.4.1

Feature Weighting Schemes

We compared the results of five feature weighting schemes (RR, log-RR, OR, log-OR and KLDivergence) in Chapter 5 but they were compared as a part of DTWC. Therefore we compare them on the basis of the local model (Eq. 4.9) to determine which weighting measure captures the 147

Table 6.5: Classification results for local model. Winning performance for each dataset is highlighted in bold letters. Values are percentage accuracies. Dataset RR LRR OR LOR KL ECML-A ECUE-1 ECUE-2 PU1 PU2 Movies 600

64.64 50.00 50.05 97.93 82.62 68.83

60.52 50.00 50.10 95.86 80.28 67.76

64.53 50.00 50.05 97.93 83.09 68.94

60.36 50.00 50.00 96.89 81.22 67.98

66.21 50.00 50.00 83.10 81.69 66.38

discriminative information more efficiently. The results of this local model for these five weighting measures are presented in Table 6.5. Even though there is no clear winner in this table, a few points can be easily inferred. The logarithmic variants (LRR and LOR), on average, fare worse than their non-logarithmic variants. Secondly, their is no significant difference between the performance OR and RR (same is the case for LOR and LRR). Thirdly, KL is on average 2.5% behind RR and OR. Other feature weighting schemes that can be used are information gain, chi-square or conditional probabilities can serve as good weighting schemes. There is a vast number of statistical divergence measures that can be used in our framework, e.g. information theoretic measures such as Hellinger distance, total variation distance, Renyi’s divergence, Jensen-Shannon divergence, etc. Probability based measures such as Bhattacharyya distance, Levy-Prokhorov metric and Wasserstein metric (also known as earth movers distance) might capture the discriminative information more effectively for text classification problems. Other approaches common in computer vision literature such as Mahalanobis distance can also be used here as well. Each of these weighting measures have distinct properties that can significantly impact the classifier outcome. Only few of these measures have been explored in the literature for feature selection let alone feature weighting and feature transformation. A complete discussion of these weighting measures, their properties and their impact on classification is out of the scope of this thesis and is left as future work.

6.4.2

Feature Selection and Partitioning

As discussed in Sect. 4.4.1 and 5.3.2, we partition the features in two sets of terms (Z k and Z Y \k ) based on thresholding on their class conditional probabilities. Thus, this feature selection is

148

inherently built in DTWC/PSSF and no further weight computation is required. Therefore each word can give its opinion for one class only. If we want each word to give its opinion for every class then we will have to ease the partitioning restriction of the feature space. This can be done by computing two weights for each word, i.e. one for each class, as a result Eq. 5.2 will become

k = f1 (j)k w1j

k and w2j = f2 (j)k

(6.2)

and f2 (j)k = bj /aj ∀j

(6.3)

as a result of which equation 5.3 will become f1 (j)k = aj /bj ∀j

where aj = p(xj = 1|y = k) and bj = p(xj = 1|y = Y \k) (refer to Sect. 5.3). Similarly Eq. 5.4, 5.5, 5.6 and 5.7 will also change. As a result, the sets Z k and Z Y \k will same terms but with different weights according to f1 (j)k and f2 (j)k , resp. No further changes are required in the model for this setting. Intuitively this model should perform better than our original model because more features are now contributing their opinion for a class, but this is not the case. In this model, if a term i such that ai > bi adds weights w1i to Scorek (x) then it would also add weight 1/w1i to ScoreY \k (x) so the change in difference between the two scores because of term i would be w1i − 1/w1i as opposed to w1i in the original model. Therefore in the newly transformed dimensional space documents will not be as well separated as in Fig. 5.2. Another modification that can be done to this new model so that the scores spread out further could be that if a term i such that ai > bi adds weights w1i to Scorek (x) then it should also add weight −w1i or −1/w1i to ScoreY \k (x). This is likely to spread out the points in the transformed two dimensional space. The new model and the suggested modifications are just part of a few techniques that could be used in this step of feature selection and partitioning within the proposed broad framework of DTWC. Currently for feature selection we apply a threshold t on the absolute difference of the class conditional probabilities of a term. Instead, we can any of the five weighting method discussed in Chapter 5, or even a totally different method such as information gain. Some of the above mentioned feature selection and partitioning methods might suit one set of problems better than

149

others, to exhaustively determine which method (or their combination) would provide the best feature partitioning and selection can be explored as part of future work.

6.4.3

Opinion Consolidation

Opinion of the discriminative terms is consolidated through opinion pooling techniques having some interesting properties, e.g. linear opinion pooling preserves the information gain provided by the features when they are weighted by the relative risk measure in our framework. Therefore we experiment with six different consolidation schemes in Table 6.6. The first three are occurrence based i.e. for calculating Scorek (x), if a discriminative term belongs to set Z k then its weight will be 1, otherwise it will be 0. First of these approaches, is linear opinion pooling (LOP), the second  one is linear opinion pooling without the normalization by term weights (xj / xj ) i.e. the score is just the sum of the weights (we refer to it as Sum). The third one referred as Avg can is determined as



j∈Z k

Score (x) =  k

xj wjk

j∈Z k

xj

 Y \k

and Score

(x) =

j∈Z Y \k



Y \k

xj wj

j∈Z Y \k

xj

(6.4)

i.e. both the scores are just the average of the weights of the discriminative terms of document x belonging to set Z k and Z Y \k respectively. The later three approaches can be termed as frequency based where opinion of each terms is weighted proportional to the number of times it occurs in that document. These three are same as the former approach except that xj is now not boolean instead it is equal to the number of times

Table 6.6: Combining Experts. All values are percentage accuracies. 1, 2 and 3 in the first column refer to the three users of ECML-A dataset Occurrence Based Frequency Based Dataset LOP Sum Avg FLOP FSum FAvg 1 91.00 91.00 78.72 90.56 90.56 78.16 2 92.32 92.36 78.64 92.36 92.40 77.24 3 87.52 87.52 83.64 85.72 85.72 80.44 ECUE-1 92.20 92.20 89.10 82.20 82.20 54.25 ECUE-2 83.30 83.45 85.15 92.10 92.10 89.85 PU1 98.27 98.27 96.89 97.24 97.24 96.55 PU2 97.18 97.18 96.71 95.30 95.30 98.59 Average 91.68 91.71 86.97 90.78 90.78 82.15

150

the discriminative term j occurs in the document x. We refer to these three approaches with the same abbreviation with the addition of a prefix ’F’ i.e. FLOP, FSum and FAvg respectively. One thing that can be interpreted from this table is that there seems to be no absolute winner (in boldface) or looser, almost every approach has bagged the highest accuracy on any one of the datasets. Following four points can be inferred from the results in this table; a) there is no significant difference between LOP and Sum (similar case for FLOP and FSum), which means that normalization does not have a significant effect, b) the Avg (and FAvg) clearly lags behind the LOP and Sum (FLOP and FSum) approaches, and Avg has a much higher average accuracy than FAvg, with the later having the most variance, and c) occurrence based approaches on average perform better than their corresponding frequency based approaches. Since each of the six approaches has a winning performance on atleast one dataset, this encourages experimentation with other expert opinion consolidation techniques. Many methods for combining the experts such as, products of experts [Hilton (1999)] and supra Bayesian method [Jacobs (1995)] etc. can be used based on their distinctive properties for better consolidation.

6.4.4

Discriminative Model

After opinion pooling, the feature space is transformed from number of discriminative terms to the number of classes in the dataset. Each point in this space corresponds to a document while the dimensions correspond to the consolidated opinion of the discriminative terms corresponding to that class. Documents in this space can be assigned the class for which it has the highest score (Eq. 4.9). This strategy oftentimes gives bad results see Table 6.6. This is the reason why relative risk and odds ratio have not found much success in text classification literature. This is because the class distributions within a dataset could be very different from each other, as a result of which the number of features and hence the computed class scores might get skewed in favor of one class. For example, the spammers distort the spelling of the words to bypass the word based filters because of which the number of features selected for spam class might be very high as compared to that of non spam. For some problems the documents of a class tend to be very short or terse as compared to

151

the other classes resulting in small number of features for that class. We overcome this deficiency by building classifiers on this newly transformed two dimensional space (shown in Sect. 5.3.3) for the final classification. Keeping the weighting and expert consolidation scheme constant (i.e. RR and LOP), we compare three discriminative models to our technique in Fig. 6.5. This figure shows the comparison of four classifiers namely DTWC’s discriminative model, SVM, logistic regression (LR) and k-means (KM) (other techniques such as neural networks, decision trees and even naive Bayes can also be used for this final classification). All the algorithm perform equally well and there is no clear winner but the discriminative model of DTWC and SVM stand out to be the most consistent ones, while LR is the least consistent and KM is the worst performer.

100 DTWC SVM

95

LR

Percentage accuracy

KM 90

85

80

75

70

Inbox 1 Inbox 2 Inbox 3 ECUE−1 ECUE−2 Datasets

PU1

PU2 Movies 600

Figure 6.5: Performance of discriminative classifiers on the transformed two dimensional feature space.

152

6.5

Relation to LeGo Framework

Building global models from local patterns is a promising approach to classification [Mannila (2002), Knobbe et al. (2008), Bringmann et al. (2009), Knobbe and Valkonet (2009)]. This approach, often referred to as LeGo (from local patterns to global models) framework for data mining, is a generic framework that utilizes existing local pattern mining techniques for global modeling in a variety of diverse data mining tasks. It focusses on finding individually promising (highly informative) patterns in data that are then used as features in global models of classification. The LeGo framework is broad enough to cover and leverage frequent pattern mining, multiview learning, subgroup discovery, pattern teams, and several other popular algorithms [Knobbe et al. (2008)]. The use of both local and global learning also allows greater flexibility in updating model parameters in the face of distribution shift. It also contributes to the robustness of the classifier to distribution shift. Our DTWC framework fits into this LeGo framework i.e DTWC framework can be considered a specialization of the LeGo framework. The LeGo framework consists of three phases, namely, Local Pattern Discovery, Pattern Set Discovery and Global Modeling. The first phase is responsible for generating a set of candidate patterns, which can turn out to be very large and redundant. The second phase therefore selects a handful of these patterns that are highly informative and relevant such that they only demostrate a little redundancy. The third and the final phase then uses these selected patterns (referred to as pattern teams) in a global classifier. The first and last phase of the DTWC framework can be mapped to the first and the last phase of the LeGo framework, while the second and third phase of the former maps to the second phase of the latter. The local model takes advantage of generative probabilities to discover discriminating patterns while the global model then performs pattern-based classification. It is interesting to note that the sets of significant spam and non-spam terms (Z + and Z − ) represent pattern teams [Knobbe and Valkonet (2009)] for a given value of t. To see this, define  q(Z) = j∈Z wj to be the quality measure of a set of patterns (terms) defined by the index set Z. Then, it follows from Eq. 4.3 that there exists no other set of patterns of size |Z + | with a higher value of q(Z + ). A similar reasoning shows that the set of patterns defined by Z − is a pattern team. 153

In other words, these sets are optimal and non-redundant with respect to the quality measure. This quality measure quantifies the collective discriminative power of a set of terms, and a pattern team with respect to this measure and a value of t will retain maximum discrimination power. For each e-mail, a feature is constructed from each pattern team by utilizing this definition of the quality measure. The aggregated opinions (one for spam and one for non-spam class) represent local patterns or features of an e-mail, learned from the training data, that are input to a global classification model. It is desirable that these patterns form a non-redundant and optimal set (i.e. a pattern team) [Knobbe and Valkonet (2009)]. Our local patterns form pattern teams based on the relative risk statistical measure. These local patterns can be model-independent or model-dependent [Bringmann et al. (2009)]. Our local patterns (significant term sets) are model-independent as they are based on discrimination information. Therefore, they can be used as an input to a number of global models such as SVM, KNN, NB, neural networks, etc. that operate on features derived from the pattern teams. We also relate our local and global approach to hybrid generative/discriminative models for classification in Sect. 4.4.4.

6.6

Conclusion

We elaborated on the time and memory scalability of DTWC/PSSF for personalized service side spam filtering, and showed that it is one of the fastest algorithm for text classification and spam filtering with the least memory footprint than the compared classifiers. The built in feature selection and partitioning of approach of DTWC/PSSF is capable of creating very small classifiers at a negligible performance overhead. This make it a tempting choice for large class problems such as personalized spam filtering. The performed statistical significance tests confirm that DTWC/PSSF outperforms the rest of the classifiers by a significant margin in terms of AUC and accuracy measure. We then derive the DTWC framework, which is a rich and a diverse framework. It has the potential of generating hundreds of different classifiers, many of which could outperform the state of the art solutions for some problems. We have successfully demonstrated its potential by evaluating it with five weighting measures, six opinion consolidation techniques and four global classifiers. Out of 154

these classifiers, relative risk measure with linear opinion pooling, followed by a linear discriminant or a SVM classifier, performs consistently better than the rest. This framework has a promising potential to develop into very efficient classifiers, in terms of performance, space and computational complexity to suit different text classification problems. We then relate it to the existing LeGo approach and show that DTWC framework can be considered as one of its instance. Since our pattern teams are model independent, our approach can be applied to various problems that follow the LeGo approach, such as pattern mining, multiview learning, subgroup discovery, pattern teams, etc. for various application domains.

155

Chapter 7

Conclusion and Future Work Because of its simplicity, scalability and robustness, DTWC/PSSF has found applications in text clustering, visualization, and feature extraction. It can also be extended for keyword extraction and topic identification. Evaluation of PSSF for spam filtering problems other than e-mails can be quite interesting as well. It can be furthered improved for better personalization, parameter convergence, and feature construction. Therefore we divide this chapter in three parts, Sect. 7.1 serves as the conclusion to the thesis while in Sect. 7.2 we discuss the published research motivated by the work in the thesis. Finally in Sect. 7.3, we suggest improvements and discuss potential application areas of our thesis work.

7.1

Conclusion

Automatic content based text classification into predefined categories is becoming extremely useful with the increasing availability of text documents in digital formats such as Web pages, e-mails, Web blogs, digital libraries, and corporate text databases. A common application of text classification is information filtering where a stream of documents (e.g. e-mails) is classified before or after reaching its destination. Other applications of text classification include document organization, web page categorization, and query classification. More recently, text classification has been used for semantic analysis, review mining and word sense disambiguation. The scope and scale of text classification applications is bound to increase in the future as more text documents in digital 156

formats become available. However text classification is challenging for number of reasons. Firstly, text documents are sparsely represented in a very high dimensional feature space (easily in hundred thousands), making learning and generalization difficult. Secondly, the high cost of labeling has forced researchers to collect training data from various sources besides the target domain resulting in a distribution shift between training and test data, which is exacerbated by the evolving distribution of the target domain. Thirdly, although unlabeled data is easily available, its utilization in text classification for improved performance in practice remains a challenge. Furthermore, text classification methods must be scalable to large volumes of data and yet be robust to small sizes of labeled sets. E-mail spam continues to be a menace for users and e-mail service providers (ESPs) with content-based e-mail spam filters providing a popular defence against spam. Spam filtering solutions deployed at the ESP’s side can either be a single global filter for all users or multiple filters each personalized for a specific user. The motivation for having personalized filters is to cater for the differing spam and non-spam preferences of users. However, personalized spam filtering solutions need to be scalable before they are practically useful for large ESPs. Similarly, a global filtering solution needs to be robust to differing preferences of users for its practical usefulness. We address the above issues by presenting a robust and efficient text classification method based on discriminative term weighting, discrimination information pooling, and linear discrimination in a transformed feature space. Based on local and global modeling of discrimination, our filter learns a discriminant function in a two-dimensional feature space in which document classes are well-separated. The two dimensions of the feature space correspond to the linear opinion pool or ensemble average of discrimination information (opinions) provided by significant terms. This feature space representation makes our filter robust to distribution shift. In addition to a supervised version, named DTWC, we also present two semi-supervised versions, named PSSF1 and PSSF2, that are suitable for personalized spam filtering because of their superior robustness and personalizability. PSSF1/PSSF2 can be adapted to the distribution of e-mails received by individual users to improve filtering performance. We evaluate DTWC/PSSF on nine text classification datasets (twenty five training and test sets)

157

out of which six are e-mail datasets and compare its performance with four benchmark classifiers, namely, naive Bayes, maximum entropy, balanced winnow, and support vector machines for both supervised and semi-supervised settings. In the supervised setting, DTWC outperforms all the other algorithms on all the datasets (except a few training and test set of one of the datasets). In the semi-supervised setting, PSSF outperforms 22 out of 44 results (accuracy and AUC for spam datasets) in global and personalized spam filtering. Statistical tests show that DTWC/PSSF performs significantly better than the other algorithms. In particular, DTWC/PSSF performs remarkably well when distribution shift is significant between training and test data, which is common in e-mail systems. We experiment with five different weighing strategies and four different discriminative classifiers. We also evaluate our algorithms under varying distribution shift, on gray e-mails, on unseen e-mails, and under varying filter size. Our personalized spam filter, PSSF, is shown to scale well for personalized service-side spam filtering. In this thesis, we also discuss the nature of the spam filtering problem and the challenges to effective global and personalized spam filtering. We define key characteristics of e-mail classification such as distribution shift and gray e-mails and relate them to machine learning problem settings. We also proposed a new learning framework which can be specialized to obtain robust and efficient text classifiers. We have also discussed the already published research work that has already benefitted from our work, identified future research directions and the possible improvements for our approach. DTWC/PSSF is state of art classifier for text classification and spam filtering especially with scenarios having a large distribution shift. PSSF’s ability to produce efficient light weight filters without requiring user feedback, makes it highly desirable personalized service side spam filtering.

7.2

Extensions of our Work

DTWC/PSSF is simple, adaptable, and robust, therefore it has gained importance in document clustering, feature reduction and extraction, and data visualization. In this section we focus on the published research that has already benefited from our technique proposed in this thesis.

158

7.2.1

Discriminative Document Clustering

Document clustering, a major part of text mining is mainly used for understanding and summarizing large document collections. One such recently proposed technique is CDIM (Clustering via discrimination information maximization) [Hassan and Karim (2012)]; an iterative partitional clustering algorithm that maximizes the sum of the discrimination information provided by the documents. CDIM not only outperforms the best clustering algorithms but its clusters are also readily interpretable. CDIM performs clustering in a k-dimensional space where k is the number of categories in the corpus. This space directly corresponds to the k-dimensional feature space of DTWC/PSSF formed by the aggregated discrimination scores of a document (i.e. Scorek (x)). Currently, they use relative risk as weighing measure and linear opinion pooling for aggregating the discrimination information of each term, but plan to explore other weighing measures and pooling techniques discussed in this thesis. They plan to incorporate soft clustering, and develop a hierarchical and repeated bisection version of CDIM. Hassan et al. (2009) make use of this discriminative clustering for content based tag recommendation in social bookmarking systems. Their experiments show that even though tag based clustering is more accurate than term based, yet, combining the predictions from both gives even better results. Hassan et al. (2010) propose a self optimization strategy for this approach that demonstrates promising results.

7.2.2

Dimensionality Reduction/Feature Extraction

In text mining and classification, dimensionality reduction through feature extraction is oftenly used for efficient and effective understanding and modeling of text documents. One such recently proposed approach is FEDIP [Tariq and Karim (2011)], a supervised dimensionality reduction technique through feature extraction, motivated by the discrimination information provided by a term for a category. They compare FEDIP with popular text dimensionality reduction techniques using SVM and Naive Bayes classifiers. The results show that FEDIP produces a low-dimensional feature space that yields higher classification accuracy as compared to the other techniques, it is 159

significantly faster as well. Just like DTWC/PSSF, FEDIP maps the document from term space to a lower dimensional feature space through term weighting, term selection and linear opinion pooling. Furthermore they extend this mapping to more than k-dimensional feature space i.e. now the feature space can have dimensions equal to the total number unique terms in the dataset. They empirically demonstrate that for classification, FEDIP outperforms the most popular dimensionality reduction techniques such as latent semantic indexing and linear discriminant analysis, especially when the number of dimensions in the feature space are very low. Furthermore the features identified are readily interpretable.

7.2.3

Feature Weighting Classifier

Malik et al. (2010a) propose a feature weighting classifier (FWC) that uses information gain (IG) to directly compute per-class feature weights. FWC’s is not only simple and fast but its performance is atleast comparable to, and often better than, naive Bayes, balanced winnow and linear SVM on text and web datasets. FWC consists of a generative model that corresponds to the local model of DTWC (4.9) except that they use information gain for features weights and use term frequencies instead of term occurrences. They acknowledge this by stating, “FWC draws inspiration from [22] and [13]..”, where 13 refers to our work.

7.2.4

Others

Apart from the above mentioned techniques, many other published works have compared their approach/results with ours, and have cited our work included in this thesis i.e. text classification and spam filtering. Some of these publications are: Caruana and Li (2012), Javed et al. (2012), Cormack (2007a), Zhang et al. (2007), Zhen et al. (2011), Teng and Teng (2008), Yoshinaka et al. (2011), Klangpraphant and Bhattarakosol (2009), Fu and Gali (2012), Mojdeh (2012), Bouguila and Amayri (2009), Lynam (2009), etc.

160

7.3

Future Work

The proposed work can be extended in two aspects, one is improvement in the approach while the other is its extension to new domains. The proposed classifier framework can be improved for better personalization, parameter convergence, and feature construction, we briefly discuss a few of them below: • In this thesis spam filtering is discussed as a personalized vs. global filtering problem, with both having some their pros and cons. A better of both worlds can be achieved if we combine both. If users are clustered based on their interests, then a separate filter can be learned for each cluster. Users whose interests are starkly different than others will form a single user cluster of their own. This semi-personalized solution would save memory as total number of filters will be decreased, while giving better results for users belonging to a cluster as the filter will be learned on the collective e-mails of all the users. • We have employed naive semi-supervised learning (SSL) approach in PSSF. Experimenting with other SSL approaches might give better adaptation of the distribution shift. For both PSSF1 and PSSF2, when we obtain the initial labels for the test data, we totally discard the local model learned on the training data and learn the final local model on the newly labeled test data. Even though this results in good performance, it is unreasonable to assume that whole of the local model learned on training data was useless. Surely we should make use of part of this training model as well. One approach could be that we could combine both the local models completely. Another approach could be to combine only a part of the local model of training data by only selecting the highest discriminative terms, because for these terms a sharp distribution shift in the test data is unlikely. Another strategy could be that during the semi-supervised step, test data is augmented with some of the e-mails from the training data. These e-mails could be determined by either finding e-mails from the training data that are similar to the e-mails in the test data, or by performing clustering in the combined feature space of the training and test e-mails and select e-mails that are clustered with test e-mails. Currently we are only performing batch adaptation, we can also try to do sequential update

161

of the model as well. • According to LeGo framework, larger the number of local patterns, the likelier it is to converge at an optimal global classifier. We can increase the local patterns for DTWC/PSSF by adding the bi-grams and tri-grams to the current set of uni-gram features. This is likely to increase the performance without having any significant increase in the size of final filter as less discriminative features are automatically filtered out due to the threshold parameter. • A threshold parameter is used to determine the significant terms, a separate threshold parameter can be learned for each category. This way we can choose a higher threshold for categories that are more prone to distribution shift or instead choose a lower threshold for categories having less amount of features. Additionally a separate threshold can be used for each of the user inboxes. This is likely to increase the performance but will also incur a performance overhead. • Currently, we consider each term independently of the others in the local discrimination model, which disregards any correlation between terms. One way of addressing this can be to discover combinations of terms (or phrases) with high discrimination power and use them as weighted features in the local model. • We can experiment with more weighing measures, opinion consolidation techniques and global classifiers. We can also provide code for DTWC/PSSF and integrate it with popular classification and mining tools such a weka and rapid miner. Due to its scalability, robustness and simplicity, DTWC/PSSF can be successfully applied to related domains of text classification and spam filtering, especially where distribution shift is large and there is a need to light weight filters. We discuss some of them below: • With the advent of WWW 2.0, interaction through social networking sites, blogs, tweets, wikis, collaborative tagging, etc. has grown phenomenally, similarly the problem of spam is now not just limited to e-mails only. The menace of spam is plaguing every electronic medium of communication whether it be, instant messaging, Usenet newsgroup, Web search engine, 162

blogs, Wikipedia pages, online classified advertisements, mobile phone messaging, Internet forum, fax transmissions or social networking applications. Despite the differences between the above services, they share common characteristics of high dimensionality, sparsity, distribution shift and the adversarial nature of spammers. With DTWC/PSSF exceptional performance in these problem settings, it would be worthwhile to evaluate its performance for spam in other domains, especially in blogs, discussion forums, tweets, instant messages and SMS’s. • DTWC/PSSF is not limited to hard classification, instead it outputs a separate value for each document for each category that is thresholded to obtain classification. Therefore it can be used in scenarios where a ranking of documents is desirable, such as ordering of positive (or negative) reviews, ordering of customer complaints based on importance, etc. Furthermore it can be used for multi-label problems. It is reasonable to assume that DTWC/PSSF will perform reasonably good in such systems because its identified significant words form optimal pattern teams. • By utilizing the input to feature transformation of DTWC, Tariq and Karim (2011) successfully identified terms that are highly related to a category. The utility of these terms was measured for classification, but these terms can also be used for identification of category topics. These terms can also be used for keyword identification or constructing a summary of the documents as well. • With DTWC/PSSF giving good results even with very small filter sizes, its feasibility for client side filtering, especially memory limited devices such as hand held devices and smart phones is quite promising. • For a multi word query, a search engine based on the history/context of a user, determines weights for each unique term. These weights help in retrieving the most relevant documents. Discriminative term weights proposed in this thesis can be a good substitute for these weights.

163

Bibliography E. Agirre and O.L. de Lacalle. On robustness and domain adaptation using SVD for word sense disambiguation. In COLING-08: Proceedings of the 22nd International Conference on Computational Linguistics, pages 17–24. Association for Computational Linguistics, 2008. R. Alkula. From plain character strings to meaningful words: Producing better full text databases for inflectional and compounding languages with morphological analysis software. Information Retrieval, 4(3):195–208, 2001. E. Alpaydin. Introduction to Machine Learning. MIT Press, 2004. R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005. ISSN 1532-4435. G. Andrew. A hybrid markov/semi-markov conditional random field for sequence segmentation. In EMNLP 06: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006. I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, and C.D. Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In SIGIR-00: Proceedings of the 23rd Conference on Research and Development in Information Retrieval. ACM, 2000. C. Apt´e, F. Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. ACM Transactioins on Information Systems, 12:233–251, 1994. ISSN 1046-8188.

164

S. Atkins. Size and cost of the problem. In Proceedings of the Fifty-sixth Internet Engineering Task Force (IETF) Meeting, pages 16–21, 2003. M. Atzmuller, F. Lemmerick, B. Krause, and A. Hotho. Towards understanding spammers – discovering local patterns for concept description. In Workshop on From Local Patterns to Global Models, 2009. P. Azevedo and A. Jorge. Ensembles of jittered association rule classifiers. Data Mining and Knowledge Discovery, 21:91–129, 2010. ISSN 1384-5810. L.D. Baker and A. McCallum. Distributional clustering of words for text classification. In SIGIR98: Proceedings of the 21st Conference on Research and Development in Information Retrieval. ACM, 1998. R. Barzilay, K. R. McKeown, and M. Elhadad. Information fusion in the context of multi-document summarization. In ACL 99: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 1999. M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399– 2434, 2006. ISSN 1532-4435. S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS-07: Advances in Neural Information Processing Systems, pages 137–144. MIT Press, 2007. S. Bickel. Ecml-pkdd discovery challenge 2006 overview. In Proceedings of ECML-PKDD Discovery Challenge, 2006. S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In NIPS-06: Advances in Neural Information Processing Systems, 2006.

165

S. Bickel, M. Bruckuer, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10:2127–2155, 2009. B. Bigi. Using kullback-leibler distance for text categorization. In ECIR:03 Proceedings of 25th European Conference on Information Retrieval Research. Springer, 2003. J. Blitzer, R. McDonald, and F. Pereira. Domain adaptation with structural correspondence learning. In EMNLP-06: Proceedings of 11th Conference on Empirical Methods in Natural Language Processing, pages 120–128. Association for Computational Linguistics, 2006. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT 98: Proceedings of the eleventh annual conference on Computational learning theory. ACM, 1998. A.L. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, pages 245–271, 1997. H. Borko and M. Bernick. Automatic document classification. Journal of ACM, 10(2):151–162, April 1963. G. Bouchard and B. Triggs. The trade-off between generative and discriminative classifiers. In IASC 04: 16th Symposium of IASC, Proceedings in Computational Statistics, 2004. N. Bouguila and O. Amayri. A discrete mixture-based kernel for svms: Application to spam and image categorization. Information Processing and Management, 45(6):631–642, 2009. P.O. Boykin and V.P. Roychowdhury. Leveraging social networks to fight spam. Computer, 38(4): 61–68, 2005. A.P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7):1145–1159, 1997. B. Bringmann, S. Nijssen, and A. Zimmermann. Pattern-based classification: A unifying perspective. In Workshop on From Local Patterns to Global Models, 2009.

166

A. Broder, M. Fontoura, V. Josifovski, and L. Riedel. A semantic approach to contextual advertising. In SIGIR 07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2007. P. Raghavan & H. Schutze C. D. Manning. An Introduction to Information Retrieval. Cambridge University Press, England, 2009. G. Campbell and J. Skillings. Nonparametric stepwise multiple comparison procedures. Journal of American Statistical Association, 80(392), 1985. M. F. Caropreso, S. Matwin, and F. Sebastiani. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization, pages 78–102. IGI Publishing, Hershey, PA, USA, 2001. ISBN 1-878289-93-4. G. Caruana and M. Li. A survey of emerging approaches to spam filtering. ACM Computing Surveys (CSUR), 44(2):9, 2012. V. R. Carvalho and W. Cohen. Single-pass online learning: performance, voting schemes and online feature selection. In KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006. D. Chakrabarti and K. Punera. Event summarization using tweets. In ICWSM 11: Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media. AAAI Press, 20011. M. Chang, W. Yih, and R. McCann. Personalized spam filtering for gray mail. In CEAS-08: Proceedings of 5th Conference on Email and Anti-Spam, 2008. O. Chapelle, B. Schlkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010. ISBN 0262514125, 9780262514125. V. Cheng and C.H. Li. Personalized spam filtering with semi-supervised classifier ensemble. In WI06: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, 2006. Y. M. Chung and J. Y. Lee. A corpus-based approach to comparative evaluation of statistical term

167

association measures. Journal of the American Society for Information Science and Technology, 52(4):283–296. John Wiley and Sons, Inc., 2001. I. Cohen, F.G. Cozman, N. Sebe, M.C. Cirelo, and T.S. Huang. Semi-supervised learning of classifiers: Theory, algorithms for bayesian network classifiers and application to human-computer interaction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:1553–1567, 2004. M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999. R. Collobert, F. Sinz, J. Weston, and L. Bottou. Large scale transductive svms. Journal of Machine Learning Research, 7:1687–1712, 2006. ISSN 1532-4435. G. V. Cormack. Email spam filtering: A systematic review. Foundations and Trends in Information Retrieval, 1(4):335–455, 2007a. G.V. Cormack. Trec 2006 spam track overview. In TREC-06: Proceedings of 15th Text Retrieval Conference, 2006a. G.V. Cormack. Harnessing unlabeled examples through application of dynamic markov modeling. In Proceedings of ECML-PKDD Discovery Challenge Workshop, 2006b. G.V. Cormack. Trec 2007 spam track overview. In TREC-07: Proceedings of 16th Text Retrieval Conference, 2007b. G.V. Cormack and T. R. Lynam. Trec 2005 spam track overview. In TREC-05: Proceedings of 14th Text Retrieval Conference, 2005. S. Corston-Oliver and M. Gamon. Normalizing german and english inflectional morphology to improve statistical word alignment. Machine Translation: From Real Users to Research, pages 48–57, 2004.

168

C. Cortes and M. Mohri. Auc optimization vs. error rate minimization. In NIPS-04: Advances in Neural Information Processing Systems, 2004. E. Cortez, M. R. Herrera, A. S. da Silva, E. S. de Moura, and M. Neubert. Lightweight methods for large-scale product categorization. Journal of the American Society for Information Science and Technology, 62(9):1839–1848, 2011. I. Dagan, Y. Karov, and D. Roth. Mistake driven learning in text categorization. In EMNLP-97: Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing, 1997. A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M. W. Mahoney. Feature selection methods for text classification. In KDD-07: Proceedings of 13th International Conference on Knowledge Discovery and Data Mining. ACM, 2007. H. Daum´e, T. Deoskar, D. McClosky, B. Plank, and J. Tiedemann, editors. Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing. Association for Computational Linguistics, July 2010. O. Dekel and O. Shamir. Multiclass-multilabel classification with more classes than examples. 2010. S.J. Delany, P. Cunningham, and L. Coyle. An assessment of case-based reasoning for spam filtering. Journal of Artificial Intelligence Review, 24(3-4):359–378, 2005a. S.J. Delany, P. Cunningham, A. Tsymbal, and L. Coyle. A case-based technique for tracing concept drift in spam filtering. Knowledge Based Systems, 18:187–195, 2005b. S.J. Delany, P. Cunningham, and A. Tsymbal. A comparison of ensemble and case-base maintenance techniques for handling concept drift in spam filtering. In FLAIRS-06: Proceedings of the 19th International Florida Artificial Intelligence Research Society Conference. AAAI Press, 2006. K. Dembczynski, W. Kotlowski, and R. Slowinski. Ender: a statistical framework for boosting decision rules. Data Mining and Knowledge Discovery, 21:52–90, 2010. ISSN 1384-5810. J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006. 169

I.S. Dhillon, S. Mallela, and R. Kumar. A divisive information theoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3:1265–1287, 2003. M. Dredze, T. Lau, and N. Kushmerick. Automatically classifying emails into activities. In IUI 06: Proceedings of the 11th international conference on Intelligent user interfaces. ACM, 2006. G. Druck, C. Pal, A. McCallum, and X. Zhu. Semi-supervised classification with hybrid generative/discriminative methods. In KDD-07: Proceedings of 13th Conference on Knowledge Discovery and Data Mining. ACM, 2007. S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In In CIKM 98: Proceedings of the seventh international conference on Information and knowledge management. ACM, 1998. C. W. Dunnett. A multiple comparison procedure for comparing several treatments with a control. Journal of American Statistical Association, 50:1096–1121, 1980. H. Esquivel, T. Mori, and A. Akella. Router-level spam filtering using tcp fingerprints: Architecture and measurement-based evaluation. In Proceedings of the 6th Conference on Email and AntiSpam (CEAS), 2009. S. Eyheramendy, D. Lewis, and D. Madigan. On the naive bayes model for text categorization. In AISTATS-03: Proceedings of 9th International Workshop on Artificial Intelligence and Statistics, pages 332–339, 2003. T. Fawcett. “in vivo” spam filtering: a challenge problem for kdd. SIGKDD Explorations Newsl., 5(2):140–148, 2003. D. A. Ford, R. Kraft, and G. Tewari. System and technique for dynamic information gathering and targeted advertising in a web based model using a live information selection and analysis tool. (6606644), August 2003. URL http://www.freepatentsonline.com/6606644.html. G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289–1305, 2003a. 170

G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, pages 1289–1305, 2003b. G. Forman and E. Kirshenbaum. Extremely fast text feature extraction for classification and indexing. In In CIKM ’08: Proceedings of the 17th ACM conference on Information and knowledge management, 2008. C. Fox. A stop list for general text. ACM SIGIR Forum, 24(1-2):19–21, 1989. W.B. Frakes. Stemming algorithms. Information Retrieval Data Structures and Algorithms, pages 131–160, 1992. M. Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32:675–701, 1937. M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. Annals of Mathematical Statistics, 11:86–92, 1940. L. Fu and G. Gali. Classification algorithm for filtering e-mail spams. Recent Progress in Data Engineering and Internet Technology, pages 149–154, 2012. N. Fuhr and C. Buckley. A probabilistic learning approach for document indexing. ACM Trans. Inf. Syst., 9:223–248, 1991. ISSN 1046-8188. E. Gabrilovich, S. Dumais, and E. Horvitz. Newsjunkie: providing personalized newsfeeds via analysis of information novelty. In WWW 04: Proceedings of the 13th international conference on World Wide Web. ACM, 2004. Y. Genc, Y. Sakamoto, and J. Nickerson. Discovering context: Classifying tweets through a semantic transform based on wikipedia. Foundations of Augmented Cognition. Directing the Future of Adaptive Systems, 6780:484–492, 2011. X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classiffication: A deep learning approach. In ICML 11: Proceedings of 28th International Conference on Machine Learning, 2011. 171

J. Goodman, G.V. Gormack, and D. Heckerman. Spam and the ongoing battle for the inbox. Communications of the ACM, 50(2):25–33, 2007. M. Goodman. Prism: A case-based telex classifier. In IAAI 90: The Second Conference on Innovative Applications of Artificial Intelligence. AAAI Press, 1991. A. Gray and M. Haahr. Personalised collaborative spam filtering. In CEAS-04: Proceedings of 1st Conference on Email and Anti-Spam, 2004. K. Gupta, V. Chaudhary, N. Marwah, and C. Taneja. Inductis india pvt ltd. 2006. V. Gupta and G. Lehal. A survey of text mining techniques and applications. Journal of Emerging Technologies in Web Intelligence, 1(1), 2009. W. H¨am¨al¨a¨ınen. StatApriori: an efficient algorithm for searching statistically significant association rules. Knowledge and Information Systems, 23:373–399. Springer London, 2010. M. T. Hassan and A. Karim. clustering and understanding documents via discrimination information maximization. In PAKDD 12: Proceedings of the 16th Asia Pacific Conference on knowledge discovery and data mining, 2012. M. T. Hassan, A. Karim, S. Manandhar, and J. Cussens. Discriminative clustering for content-based tag recommendation in social bookmarking systems. In Proceedings of ECML-PKDD Discovery Challenge, 2009. M. T. Hassan, A. Karim, F. Javed, and N. Arshad. Self-optimizing a clustering-based tag recommender for social bookmarking systems. In ICMLA 10: Fourth International Conference on Machine Learning and Applications. IEEE Computer Society, 2010. P.J. Hayes, P.M. Andersen, I.B. Nirenburg, and L.M. Schmandt. Tcs: a shell for content-based text categorization. In CAIA 90: Sixth Conference on Artificial Intelligence Applications, 1990. G. E. Hilton. Products of experts. In IEE conference on artificial neural networks, 1999. T.K. Ho, J.J. Hull, and S.N. Srihari. Decision combination in multiple classifier systems. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 16(1):66–75, 1994. 172

J. Hovold. Naive bayes spam filtering using word-position-based attributes. In CEAS-05: Proceedings of 2nd Conference on Email and Anti-Spam, 2005. D. A. Hsieh, C. F. Manski, and D. McFadden. Estimation of response probabilities from augmented retrospective observations. Journal of the American Statistical Association, 80(391):651–662, 1985. M. Ikonomakis, S. Kotsiantis, and V. Tampakas. Text classification using machine learning techniques. WSEAS Transactions on Computers, 4(8):966–974, 2005. R. L. Iman and J. M. Davenport. Approximations of the critical region of the friedman statistic. Communications in Statistics, pages 571–595, 1980. Ipsos. 2010 MAAWG Email Security Awareness and Usage Report. Ipsos public affairs research. Messaging Anti-abuse Working Group, 2010. D. Isa, L. H. Lee, V. P. Kallimani, and R. RajKumar. Text document preprocessing with the bayes formula for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering., 20:1264–1272, 2008. ISSN 1041-4347. T.S. Jaakkola and D. Haussler. Exploiting generative models in discriminative classifiers. In NIPS 98: Advances in Neural Information Processing Systems, 1998. R.A. Jacobs. Methods for combining experts’ probability assessments. Neural Computation, 7: 867–888, 1995. F. Javed, M. T. Hassan, K. N. Junejo, N. Arshad, and A. Karim. Self-calibration: Enabling self-management in autonomous systems by preserving model fidelity. In ICECCS 12: 17th International Conference on Engineering of Complex Computer Systems. IEEE Computer Society, 2012. J. Jiang. A literature survey on domain adaptation of statistical classifiers, 2007. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In ECML-98: Proceedings on 10th European Conference on Machine Learning, 1998. 173

T. Joachims. Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. MIT-Press, 1999a. T. Joachims. Transductive inference for text classification using support vector machines. In ICML-99: Proceedings of 16th International Conference on Machine Learning, 1999b. T. Joachims. Training linear svms in linear time. In KDD-06 Proceedings of the 12th International Conference on Knowledge Discovery and Datamining, 2006. Z. Jorgensen, Y. Zhou, and M. Inge. A multiple instance learning strategy for combating good word attacks on spam filters. Journal of Machine Learning Research, 9:1115–1146, 2008. A. Juan, D. Vilar, and H. Ney. Bridging the gap between naive bayes and maximum entropy for text classification. In PRIS-07: Proceedings of the 7th International Workshop on Pattern Recognition in Information Systems, 2007. K. N. Junejo and A. Karim. Automatic personalized spam filtering through significant word modeling. In ICTAI-07: Proceeding of 19th IEEE International Conference on Tools with Artificial Intelligence, 2007a. K. N. Junejo and A. Karim. PSSF: a novel statistical approach for personalized service-side spam filtering. In WI-07: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pages 228–234, 2007b. K. N. Junejo and A. Karim. A robust discriminative term weighting based linear discriminant method for text classification. In ICDM-08: Proceedings of 8th International Conference on Data Mining, pages 323–332, 2008. K. N. Junejo, M.M. Yousaf, and A. Karim. A two-pass statistical approach for automatic personalized spam filtering. In Proceedings of ECML-PKDD Discovery Challenge Workshop, 2006. C. Kang and J. Tian. A hybrid generative/discriminative bayesian classifier. In FLAIRS-06: Proceedings of the 19th International Florida Artificial Intelligence Research Society Conference. AAAI Press, 2006. 174

I. Katakis, G. Tsoumakas, and I. Vlahavas. Tracking recurring contexts using ensemble classifiers: An application to email filtering. Knowledge and Information Systems, 22(3):371–391, 2010. B. M. Kelm, C. Pal, and A. McCallum. Combining generative and discriminative methods for pixel classification with multi-conditional learning. In ICPR-06: Proceedings of 18th International Conference on Pattern Recognition, 2006. J.E. Kennedy and M. P. Quine. The total variation distance between the binomial and poisson distributions. Annals of Probability, 17(1):396–400, 1989. P. Klangpraphant and P. Bhattarakosol. e-mail authentication system: a spam filtering for smart senders. In ICIS 09: Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human. ACM, 2009. B. Klimt and Y. Yang. The enron corpus: A new dataset for email classiffication research. In ECML 2004, Proceedings of the European Conference on Machine Learning, 2004. A. Knobbe and J. Valkonet. Building classifiers from pattern teams. In Workshop on From Local Patterns to Global Models, 2009. A. Knobbe, B. Cremileux, J. Furnkranz, and M. Scholz. From local patterns to global models: the lego approach to data mining. In Workshop on From Local Patterns to Global Models, 2008. A. Kolcz and Wen-tau Yih. Raising the baseline for high-precision text classifiers. In KDD-07: Proceedings of the 13th Conference on Knowledge Discovery and Data Mining. ACM, 2007. Aleksander Kolcz, Michael Bond, and James Sargent. The challenges of service-side personalized spam filtering: scalability and beyond. In InfoScale-06: Proceedings of the 1st international conference on Scalable Information Systems, page 21. ACM, 2006. R.E. Kraut, J. Morris, R. Telang, D. Filer, M. Cronin, and S. Sunder. Markets for attention: Will postage for email help?

In CSCW 02: Proceedings of 5th conference on computer supported

cooperative work. ACM, 2002. N. Krawetz. Anti-honeypot technology. Security & Privacy, IEEE, 2(1):76–79, 2004. 175

S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, 1951. A. Kyriakopoulou and T. Kalamboukis. Text classification using clustering. In Proceedings of ECML-PKDD Discovery Challenge Workshop, 2006. K. Lai and D. Fox. Object recognition in 3d point clouds using web data and domain adaptation. The International Journal of Robotics Research, 29(8):1019–1037, 2010. K. Lang. Newsweeder: Learning to filter netnews. In ICML-95: Proceedings of the Twelfth International Conference on Machine Learning, 2006. N. Leavitt. Vendors fight spam’s sudden rise. Computer, 40(3):16–19, 2007. M. LeBlanc and J. Crowley. Relative risk trees for censored survival data. Biometrics, 48(2): 411–425, 1992. J.R. Levine. Experiences with greylisting. In CEAS 05: Proceeding of second conference on email and anti-Spam, 2005. D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In In SIGIR 92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1992. D.D. Lewis, Y. Yang, T.G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004. H. Li, J. Li, L. Wong, M. Feng, and Y. P. Tan. Relative risk and odds ratio: a data mining perspective. In PODS ’05: Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2005. J. Li, G. Liu, and L. Wong.

Mining statistically important equivalence classes and delta-

discriminative emerging patterns. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007.

176

J. Liu, J.-Q. Song, and Y.-L. Huang. A generative/discriminative hybrid model: Bayes perceptron classifier. In ICMLC-07: Proceedings of 6th International Conference on Machine Learning and Cybernetics, 2007. M. Lu and B.C. Tilley. Use of odds ratio or relative risk to measure a treatment effect in clinical trials with multiple correlated binary outcomes: data from the ninds t-pa stroke trial. Statistics in medicine, 20(13):1891–1901, 2001. D.G. Luenberger. Linear and nonlinear programming. Reading, Mass: Addison-Wesley, 2nd edition, 1984. T. R. Lynam. Spam Filter Improvement Through Measurement. PhD thesis, School of Computer Science, University of Waterloo, 2009. H. Malik, D. Fradkin, and F. Moerchen. Single pass text classification by direct feature weighting. Knowledge and Information Systems, pages 1–20, 2010a. H. Malik, J. Kender, D. Fradkin, and F. Moerchen. Hierarchical document clustering using local patterns. Data Mining and Knowledge Discovery, 21:153–185, 2010b. ISSN 1384-5810. H. Mannila. Local and global methods in data mining: basic techniques and open problems. In Automata, Languages, and Programming, volume 2380 of Lecture Notes in Computer Science, page 778. Springer, 2002. C.D Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. D. Mavroeidis, K. Chaidos, S. Pirillos, and M. Vazirgiannis. Using tri-training and support vector machines for addressing the ecml-pkdd 2006 discovery challenge. In Proceedings of the ECMLPKDD Discovery Challenge Workshop, 2006. A. McCallum. and K. Nigam. A comparison of event models for naive bayes text classification. In In AAAI 98: Workshop on Learning for Text Categorization, 1998.

177

A. McCallum, C. Pal, G. Druck, and X. Wang.

Multi-conditional learning:

genera-

tive/discriminative training for clustering and classification. In AAAI-06: Proceedings of the 21st National Conference on Artificial Intelligence, 2006. A.K. McCallum. Mallet: A machine learning language toolkit. http://mallet.cs.umass.edu, 2002. J. Meng, Hong fei Lin, and Yu hai Yu. Transfer learning based on svd for spam filtering. In ICICCI 10: International Conference on Intelligent Computing and Cognitive Informatics, 2010. V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes – which naive bayes? In CEAS-06: Proceedings of 3rd Conference on Email and Anti-Spam, 2006. D. Miao, Q. Duan, H. Zhang, and N. Jiao. Rough set based hybrid algorithm for text classification. Expert Systems with Applications, 36(5):9168–9174, 2009. ISSN 0957-4174. Microsoft. Microsoft Security Intelligence Report volume 6. http://www.microsoft.com/sir, 2009. Microsoft. Microsoft Security Intelligence Report volume 12. http://www.microsoft.com/sir, 2012. M. Mojdeh. Personal Email Spam Filtering with Minimal User Interaction. PhD thesis, University of Waterloo, 2012. E. Montanes, I. Diaz, J. Ranilla, E.F. Combarro, and J. Fernandez. Scoring and selecting terms for text categorization. Intelligent Systems, 20(3):40–47, 2005. J. G. Moreno-Torres, T. Raeder, R. Alaiz-Rodrguez, N. V. Chawla, and F. Herrera. A unifying view on dataset shift in classification. Pattern Recognition, 45(1):521–530, 2012. P. B. Nemenyi. Distribution-free multiple comparisons - PhD thesis. Princeton University, 1963. K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61–67, 1999. K. Nigam, A. Mccallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39:103–134, 2000. ISSN 0885-6125.

178

S. Nijssen and E. Fromont. Optimal constraint-based decision tree induction from itemset lattices. Data Mining and Knowledge Discovery, 21:9–51, 2010. ISSN 1384-5810. M. Noll and C. Meinel. Web search personalization via social bookmarking and tagging. In The Semantic Web, volume 4825 of Lecture Notes in Computer Science, pages 367–380. Springer Berlin / Heidelberg, 2007. P. Ogilvie and J. Callan. Experiments using the lemur toolkit. In Proceedings of the Tenth Text Retrieval Conference (TREC-10), 2001. N.C. Oza and K. Tumer. Classifier ensembles: Select real-world applications. Information Fusion, 9(1):4–20, 2008. V. Kumar P. N. Tan and J. Srivastava. Selecting the right objective measure for association analysis. Information Systems, 29(4):293–313, 2004. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classification using machine learning techniques. In EMNLP-02: Proceedings of 7th conference on Empirical Methods in Natural Language Processing, pages 79–86. Association for Computational Linguistics, 2002. Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1-2):1–135, January 2008. T. Peng, W. Zuo, and F. He. Svm based adaptive learning method for text classification from positive and unlabeled documents. Knowledge and Information Systems, 16(3):281–301, 2008. B. Pfahringer. A semi-supervised spam mail detector. In Proceedings of ECML-PKDD Discovery Challenge, 2006. N. Provos. A virtual honeypot framework. In Proceedings of the 13th USENIX security symposium, 2004. J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine Learning. The MIT Press, 2009. ISBN 0262170051, 9780262170055.

179

S. Radicati. Radicati Email Statistics Report, 2009-2013. The Radicati Group Incorporated, http://www.radicati.com/, 2009. R. Raina, Y. Shen, and A.Y. Ng. Classification with hybrid generative/discriminative models. In NIPS 03: Advances in Neural Information Processing Systems, 2003. R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: transfer learning from unlabeled data. In ICML-07: Proceedings of the 24th International Conference on Machine Learning, pages 759–766, 2007. A. Ramachandran, N. Feamster, and S. Vempala. Filtering spam with behavioral blacklisting. In Proceedings of the 14th ACM conference on Computer and communications security, 2007. Ferris Research. Email Products: Market Shares, Version Deployed, Migrations, and Software Cost. http://email-museum.com/?p=318858, 2008. S. Russel and P. Norvig. Artificial intelligence: A modern approach. Prentice Hall, 1995. ISBN 0131038052. G. Salton and C. Buckley. Term weighting approaches in automated text retrieval. Technical Report 87-881, Dept. of Computer Science, Cornell University, 1987. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513 – 523, 1988. ISSN 0306-4573. S L Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:317–328, 1997. E.P. Sanz, J.M. G´omez Hidalgo, and J.C. Cortizo P´erez. Email spam filtering. Advances in Computers, 74:45–114, 2008. Karl-Michael Schneider. A comparison of event models for naive bayes anti-spam e-mail filtering. In In EACL 03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2003.

180

Karl-Michael Schneider. On word frequency information and negative evidence in naive bayes text classification. In Advances in Natural Language Processing, volume 3230 of Lecture Notes in Computer Science, pages 474–485. Springer Berlin / Heidelberg, 2004. H. Sch¨ utze, D. A. Hull, and J. O. Pedersen. A comparison of classifiers and document representations for the routing problem. In In SIGIR 95: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1995. D. Sculley and G.V. Cormack. Going mini: extreme lightweight spam filters. In CEAS-09: Proceedings of 6th Conference on Email and Anti-Spam, 2009. F. Sebastiani. A tutorial on automated text categorisation. In ASAI 99: Proceedings of 1st Argentinian Symposium on Artificial Intelligence, 1999. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34 (1):1–47, 2002. A.K. Seewald. An evaluation of naive bayes variants in content-based learning for spam filtering. Intelligent Data Analysis, 11(5):497–524, 2007. R. Segal. Combining global and personal anti-spam filtering. In CEAS-07: Proceedings of 4th Conference on Email and Anti-Spam, 2007. R.

Segal

and

G.V.

Cormack.

The

ceas

2007

live

spam

challenge.

In

http://www.ceas.cc/2007/challenge/challenge.html, 2007. R. Segal, J. Crawford, J. Kephart, and B. Leiba. Spamguru: an enterprise anti-spam filtering system. In CEAS-04: Proceedings of 1st Conference on Email and Anti-Spam, 2004. R. Segal, G.V. Cormack, and A. Bratko.

The ceas 2008 spam filter challenge.

In

http://http://www.ceas.cc/2008/challenge/, 2008. X. Shen, B. Tan, and C.X. Zhai. Context-sensitive information retrieval using implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 43–50. ACM, 2005. 181

D. J. Sheskin. Handbook of parametric and nonparametric statistical procedures. Chapman and Hall/CRC, 2000. C. Siefkes, F. Assis, S. Chhabra, and W. Yerazunis. Combining winnow and orthogonal sparse bigrams for incremental spam fltering. In PKDD-04: Proceedings of the 8th European Conference on Principle and Practice of Knowledge Discovery in Databases, 2004. L. Spitzner.

Honeypots: definitions and value of honeypots.

Available from: www.tracking-

hackers.com/papers/honeypots.html, 2003. H. Stern. A survey of modern spam tools. In CEAS-08: Proceedings of 5th Conference on Email and Anti-Spam, 2008. J. Suzuki, A. Fujino, and H. Isozaki. Semi-supervised structured output learning based on a hybrid generative and discriminative approach. In (ACL 07: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 2007. Symantec. Messagelabs intellignece: 2010 annual security report, 2010. M. Szummer and T. Jaakkola. Information regularization with partially labeled data. In NIPS 03: In Advances in Neural Information Processing Systems. MIT Press, 2003. R.H.R. Tan and F.S. Tsai. Authorship identification for online text. In CW 10: International Conference on Cyberworlds, 2010. A. Tariq and A. Karim. Fast supervised feature extraction by term discrimination information pooling. In CIKM 11: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 2011. D.M.J. Tax and R.P.W. Duin. Using two-class classi?ers for multiclass classi?cation. In ICPR-02: Proceedings of the 16th International Conference on Pattern Recognition, 2002. Wei-Lun Teng and Wei-Chung Teng. A personalized spam filtering approach utilizing two separately trained filters. In WI-IAT 08: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE Computer Society, 2008. 182

T. Tompkins and D. Handley. Giving e-mail back to the users: Using digital signatures to solve the spam problem. First Monday, 8(9), 2003. N. Trogkanis and G. Paliouras. Tpn2 : Using positive-only learning to deal with the heterogeneity of labeled and unlabeled data. In Proceedings of ECML-PKDD Discovery Challenge, 2006. J. W. Tukey. Comparing individual means in the analysis of variance. Biometrics, 5:99–114, 1949. K. Tumer and N.C. Oza. Input decimated ensembles. Pattern Analysis & Applications, 6(1):65–77, 2003. P. D. Turney. Thumbs up of thumbs down? semantic orientation applied to unsupervised classification of reviews. In ACL ’02: Proceedings of the 40th annual meating of the association for computational linguistics, 2002. K. Tzeras and S. Hartmann. Automatic indexing based on bayesian inference networks. In In SIGIR 93:Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1993. D. Xing, W. Dai, Gui-Rong Xue, and Y. Yu. Bridged refinement for transfer learning. In PKDD-07: Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, pages 324–335. Springer-Verlag, 2007. J. C. Xue and G.M. Weiss. Quantification and semi-supervised classification methods for handling changes in class distribution. In KDD-09: Proceedings of the 15th Conference on Knowledge Discovery and Data Mining, pages 897–906. ACM, 2009. A.M. Yegenian. Disposable Email Addresses. PhD thesis, Carnegie Mellon University, 2008. T. Yoshinaka, S. Ishii, T. Fukuhara, H. Masuda, and H. Nakagawa. A user-oriented splog filtering based on a machine learning. In Recent Trends and Developments in Social Software, volume 6045 of Lecture Notes in Computer Science, pages 88–99. 2011. X. Wu Z. Zheng and R. Srihari. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explorations Newsletter, 6(1):80–89, 2004. 183

J H Zar. Biostatistical Analysis. Prentice Hall, Englewood Clifs, 4th edition, 1998. L. Zhang, J. Zhu, and T. Yao. An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4):243–269, 2004. X. Zhang, W. Dai, G.-R. Xue, and Y. Yu. Adaptive email spam filtering based on information theory. In B. Benatallah et al., editor, WISE 2007, volume 4831 of Lecture Notes in Computer Science, pages 159–170. Springer, 2007. Z. Zhang and J. Zhou. Multi-task clustering via domain adaptation. Pattern Recognition, 45: 465–473, 2012. ISSN 0031-3203. Z. Zhen, X. Zeng, H. Wang, and L. Han. A global evaluation criterion for feature selection in text categorization using kullback-leibler divergence. In SoCPaR 11: International Conference of Soft Computing and Pattern Recognition. IEEE, 2011. D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with local and global consistency. In In Advances in Neural Information Processing Systems, 2003. D. Zhu. Improving the Relevance of Search Results: Search-term Disambiguation and Ontological Filtering. VDM Verlag, Germany, 2009. ISBN 3639140850, 9783639140859. X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005. X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In ICML 03: Proceedings of 20th International Conference on Machine Learning, 2003. C.C. Zou and R. Cunningham. Honeypot-aware advanced botnet construction and maintenance. In DSN 06: Proceedings of 36th International Conference on Dependable Systems and Networks. IEEE/IFIP, 2006.

184

Suggest Documents