Towards a Feature Rich Model for Predicting Spam Emails containing ...

Towards a Feature Rich Model for Predicting Spam Emails containing Malicious Attachments and URLs AusDM 2013, Canberra

Khoi-Nguyen Tran*, Mamoun Alazab #, Roderic Broadhurst #

*Research School of Computer Science #Regulatory Institutions Network The Australian National University

Introduction (1/2) • Email traffic – 183 billion emails sent everyday – 66.5% are spam (122 billion) – 3.3% contain malicious attachments (6 billion)

• Malicious software (malware) in emails – Direct delivery with attachments – Indirect delivery with URLs (becoming a more common method) 2

Introduction (2/2) • Identifying malicious spam emails – Scan attachments and URLs – Match against blacklists

• Propose using only spam email text to predict malicious attachments and URLs – Novel features to capture text patterns – Self-contained (no external resources) – Two real world data sets 3

Related Work • Classification of Malicious Attachments – Generate features to model user behaviour

• Classification of Malicious URLs – Features from blacklists and accessing URLs

• Wikipedia Vandalism Detection – Text patterns to show vandalism regularities – Features applicable to detecting malicious content in spam emails 4

Emails (Standard Structure) From: “Postal Service” To: [email protected]; [email protected]

Header

Date: Sun, 01 Jan 2013

01:23:45 +0100

MIME-Version: 1.0 Content-Type: multipart/mixed; Subject: Track your parcel #12345 ------=_NextPart_001 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

You have an undelivered parcel!

Body

Please follow the instructions attached to find your parcel here: http://tracking.yourpostoffice.example.com

Text Content

------=_NextPart_000 Content-Type: application/x-zip-compressed; name="tracking_instructions.zip"

Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="tracking_instructions.pdf.zip"

Attachments

(base64 string of attachment)

5

Malicious Spam Emails (Header) From: “Postal Service” To: [email protected]; [email protected] Date: Sun, 01 Jan 2013 01:23:45 +0100 MIME-Version: 1.0 Content-Type: multipart/mixed; Subject: Track your parcel #12345

6


Spam and spam campaigns are often sent • In large quantities • At certain times or time frames 7


Social engineering • Entice recipients to act • Important points of contact 8

Malicious Spam Emails (Text Content) ------=_NextPart_001 Content-Type: ... Content-Transfer-Encoding: ... You have an undelivered parcel! Please follow the instructions attached to find your parcel here: http://tracking.yourpostoffice.example.com

9


More social engineering • Further enticement • Instructive 10


Seemingly harmless URL that can redirect to compromised Web sites. 11

Malicious Spam Emails (Attachments) ------=_NextPart_000 Content-Type: application/x-zip-compressed; name="tracking_instructions.zip" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="tracking_instructions.pdf.zip"


12

Malicious Spam Emails (Attachments) ------=_NextPart_000 Content-Type: application/x-zip-compressed; name="tracking_instructions.zip" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="tracking_instructions.pdf.zip"


Malicious software may be hidden in compressed files. Inconspicuous file names and extensions. 13

Email Spam Data Sets • Real world data sets – Compiled monthly from Jan-Dec 2012 – Received in anonymised form

• Habul: the Habul Plugin for Thunderbird – User manually label emails as spam

• Botnet: from global system of spam traps – Automatic labelling of emails as spam 14

Email Spam Data Sets (Botnet - 1/2)

15


16

Feature Engineering • Header Features (H) • Text Features – Subject (S) – Content (P – Payload)

• Attachment Features (A) • URL Features (U)

17

Generating Features (Example) Subject: Track your parcel #12345

• • • • • •

Number of words Number of different characters Numbers in words Character diversity per word Non-alphanumeric characters per word … 18

Feature Ranking • Ranking shows the most important patterns of text that distinguishes malicious and non-malicious emails

• Example – Determined by Random Forest classifier – Botnet, Attachments, Feature S21-BZ2

19

20

21

Evaluation Methodology (1/3) • Partition by months for training/testing – Training: Combine data sets from January to the month named. – Testing: Combine the rest of the months.

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

22

Evaluation Methodology (1/3) • Partition by months for training/testing – Training: Combine data sets from January to the month named. – Testing: Combine the rest of the months. – Example: July (Jul) Jan

Feb

Mar

Apr

May

Training

Jun

Jul

Aug

Sep

Oct

Nov

Dec

Testing 23

Evaluation Methodology (2/3) • Select only spam emails with attachments and/or URLs • Generate features • Select features sets appropriate for the classification task • Train models and evaluate on test data

24

Evaluation Methodology (2/3) • Three classifiers (from spam research) – Naïve Bayes (NB): common baseline – Random Forest (RF): seen as robust – Support Vector Machine (SVM): seen as the best algorithm, when the time needed to train models is not an issue 25

Evaluation Methodology (3/3) • Measures of success – AUC-PR (Area under precision recall curve) • Given a random malicious spam email, this gives the probability that the classifier will correctly identify that malicious spam email

– AUC-ROC (Area under ROC curve) • Given a random spam email, this gives the probability that the classifier will correctly label that spam email as malicious or non-malicious 26

Classification Results (AUC-PR - 1/2)

27


28

Classification Results (AUC-ROC - 1/2)

29


30


31


32


33


34

Discussion (1/2) • Initial findings are encouraging – Success with spam emails with malicious attachments, but not with malicious URLs

• Advantage of approach: self-contained sets of features extracted from only the email itself – No need for external resources, only text analysis and mining 35

Discussion (2/2) • Limitation: features are not descriptive enough for URLs – Future work: add lexical features for URLs

• Spam campaigns may cause in overly optimistic results – Basic analysis shows diversity of spam campaigns in data sets is high – High diversity strengthens our results 36

Conclusion (1/2) • Task of classifying malicious spam emails with attachments or URLs • Descriptive sets of text features for content of spam emails • Two real world data sets (Habul and Botnet) • Results show spam emails with malicious attachments can be reliably predicted from only the email content

37

Conclusion (2/2) • Compared classification performance of three classifiers: Naïve Bayes, Random Forest, and Support Vector Machine • Up to 95% of spam emails with malicious attachments can potentially be identified from the email content • Potentially huge saving in resources used to detect and filter high risk spam

38

Future Work • Add features to improve classification of spam emails with malicious URLs • Extract more features from the header of emails (e.g. graph relationships of email addresses, and potential spam campaigns) • Moving window of training data • Increase size of scope of our data sets • Comprehensive analysis of new feature sets

39

Thank you! Towards a Feature Rich Model for Predicting Spam Emails containing Malicious Attachments and URLs [email protected] [email protected] [email protected] Acknowledgements •ARC Discovery Grant on the Evolution of Cybercrime (DP 1096833) •Australian Institute of Criminology (Grant CRG 13/12-13) •ANU Research School of Asia and the Pacific •Australian Communications and Media Authority (ACMA) •Computer Emergency Response Team (CERT) Australia

Related Work (1/4) • Email Spam Filtering – Classify spam and non-spam emails – Mature research field – Identifying emails with malicious remains a problem worthy of investigation

41

Related Work (2/4) • Classification of Malicious Attachments – Malware inside attachments can do significant damage to computers and spread rapidly – Past research focuses on generating features to model email activity of a user – Infected computers can show different outgoing email behaviour

42

Related Work (3/4) • Classification of Malicious URLs – Blacklists are highly efficient • Relies on knowing malicious URLs • Cannot keep up with high volume spam botnets

– Generating features from URLs (text, hosting) • Requires many external resources

– Access Web pages of URLs for analysis • Comprehensive, but very slow

43

Related Work (4/4) • Wikipedia Vandalism Detection – Generate features to capture text patterns to identify vandalism – These text features can show regularities in spam emails – We hypothesise that spam emails with malicious content have different text patterns compared to non-malicious spam emails

44

Email Spam Data Sets (Habul – 1/2)

45


46


47


48


49


50


51


52

Classification Results (ACC - 1/4)

53


54


55


56

Towards a Feature Rich Model for Predicting Spam Emails containing ...

Towards a Feature Rich Model for Predicting Spam Emails containing ...

Suggest Documents

Towards a Feature Rich Model for Predicting Spam Emails ...

Towards a Feature Rich Model for Predicting Spam ...

Clustering Spam Emails into Campaigns

Recognizing Spam Domains by Extracting Features from Spam Emails ...

A Feature-Rich Constituent Context Model for ... - Research at Google

A Feature-Rich Constituent Context Model for ... - John DeNero

Detecting Emails Containing Requests for Action - CiteSeerX

Clustering Spam Emails into Campaigns - UniversitÃ© Laval

Towards Spam Mail Detection using Robust Feature Evaluated with ...

Feature Engineering for Mobile (SMS) Spam Filtering

Towards a Model for Predicting Intention in 3D Moving-Target ...

Towards a Machine Learning Model for Predicting ...

Towards a Flexible, Reusable Model for Predicting Eye Movements ...

Bayesian Spam Filtering for Vietnamese Emails - IEEE Xplore

Combining visual and textual features for filtering spam emails

A Feature-Rich Vietnamese Named-Entity Recognition Model

A Local-Concentration-Based Feature Extraction Approach for Spam ...

A Model for Feature-Based User Model

Diffusive Logistic Model Towards Predicting Information ... - arXiv

Clustering Malware-Generated Spam Emails with ... - Semantic Scholar

Clustering Malware-Generated Spam Emails with ... - Semantic Scholar

mucin cDNA containing a cysteine-rich domain

Towards General Equilibrium in a Technology-Rich Model with ...

Towards an Improved Model for Predicting Hydraulic Turbine ...