A Hybrid Approach based on Particle Swarm Optimization and Random Forests for E-mail Spam Filtering Dr. Bashar Al-Shboul Assistant Professor, Dept. of BIT Web Intelligence Research Group The University of Jordan
[email protected]
Outline… Related Work Particle Swarm Optimization Proposed Method Contribution Experiment Setup Results
Introduction Don’t we all receive spam e-mails ? spam filtering problem is considered as a text categorization problem the high-dimensionality of the problem is a main challenge when sophisticated learning algorithms are applied in text categorization Vector Space Model
Related Works Artificial Neural Networks (ANN) Silva et al., 2012 Faris et al., 2015 Deshpande et al., 2007
Naïve Bayes (NB) Sakkis et al., 2003
K-Nearest Neighbor (kNN) Drucker et al., 1999
Support Vector Machine (SVM) Blanco et al., 2007
Ensemble Methods (RF, Boosting Trees, Combined SVMs, Voting, among others) Delany et al., 2006 Fernandez-Delgado et al., 2014 DeBarr & Wechsler, 2009 Rios & Zha, 2004
Particle Swarm Optimization Craig Reynolds, 1986 avoid crowding local flockmates move towards the average heading of flockmates move toward the average position of flockmates
The Algorithm: Uses a number of agents (particles) that constitute a swarm moving around in the search space looking for the best solution Each particle in search space adjusts its “flying” according to its own flying experience as well as the flying experience of other particles
Particle Swarm Optimization Collection of flying particles (swarm) - Changing solutions Search area - Possible solutions Movement towards a promising area to get the global optimum Each particle keeps track: its best solution, personal best, pbest the best value of any particle, global best, gbest
Each particle adjusts its travelling speed dynamically corresponding to the flying experiences of itself and its colleagues
its current position & velocity
the distance between its current position and pbest & gbest
Particle Swarm Optimization Updating positions take the following forms: Xi(t + 1) = Xi(t) + Vi(t + 1) Xi(t): Particle i position at iteration t
Vi(t + 1) = W · Vi(t) + r1 · c1 · [pBesti − Xi(t)] + r2 · c2 · [gBesti − Xi(t)] Vi(t) Velocity of particle i at iteration t W is interia weight r1 & r2 are random numbers between 0 & 1 c1 & c2 are constants pBesti: Local Best position of particle i gBesti : Global Best position of particle i
Personal Influence
Social Influence
Geometric Particle Swarm Optimization The only difference from regular PSO is that there is no clear definition of what Velocity is, thus the process of updating particle positions is not quit possible as in canonical PSO. Therefore, updating position is based on a three mask-based geometric crossovers and a mutation. Inertia, Personal Influence, and Social Influence are represented as a stream of bits, crossed over, then mutated.
Proposed Method
Contribution General Goal: An enhanced spam e-mail classifier
Specific Contribution: Utilizing GPSO to providing a better feature set to RF spam e-mail classifier to enhance classification quality
Experiment Setup Dataset: Source: SpamAssassin Size: 9346 e-mails (6951 Non-spam) Size: 86 features Imbalanced
Tool: Weka Data Mining Tool Cost Functions / Evaluation Measures:
Settings: GPSO: 20 Individuals / 20 Generations per Run / 1% Mutation Probability / other weights split equally RF: 100 Trees
Other Settings: Decision Trees: J48
Accuracy,
SVM with RBF Kernel / gamma & cost tuned with 5-fold cross validation
F-Measure (F1),
kNN: k = 1
Area under Receiver Operating Characteristics (ROC), Root Mean Squared Error (RMSE)
Results
Results
Contact Details First Author Dr. Hossam Faris Associate Professor, Department of Business Information Technology The University of Jordan
[email protected]
Second Author Dr. Ibrahim Aljarah Assistant Professor, Department of Business Information Technology The University of Jordan
[email protected]
Thank You !