Towards a Feature Rich Model for Predicting Spam Emails containing Malicious Attachments and URLs AusDM 2013, Canberra
Khoi-Nguyen Tran*, Mamoun Alazab #, Roderic Broadhurst #
*Research School of Computer Science #Regulatory Institutions Network The Australian National University
Introduction (1/2) • Email traffic – 183 billion emails sent everyday – 66.5% are spam (122 billion) – 3.3% contain malicious attachments (6 billion)
• Malicious software (malware) in emails – Direct delivery with attachments – Indirect delivery with URLs (becoming a more common method) 2
Introduction (2/2) • Identifying malicious spam emails – Scan attachments and URLs – Match against blacklists
• Propose using only spam email text to predict malicious attachments and URLs – Novel features to capture text patterns – Self-contained (no external resources) – Two real world data sets 3
Related Work • Classification of Malicious Attachments – Generate features to model user behaviour
• Classification of Malicious URLs – Features from blacklists and accessing URLs
• Wikipedia Vandalism Detection – Text patterns to show vandalism regularities – Features applicable to detecting malicious content in spam emails 4
Emails (Standard Structure) From: “Postal Service” To:
[email protected];
[email protected]
Header
Date: Sun, 01 Jan 2013
01:23:45 +0100
MIME-Version: 1.0 Content-Type: multipart/mixed; Subject: Track your parcel #12345 ------=_NextPart_001 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
You have an undelivered parcel!
Body
Please follow the instructions attached to find your parcel here: http://tracking.yourpostoffice.example.com
Text Content
------=_NextPart_000 Content-Type: application/x-zip-compressed; name="tracking_instructions.zip"
Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="tracking_instructions.pdf.zip"
Attachments
(base64 string of attachment)
5
Malicious Spam Emails (Header) From: “Postal Service” To:
[email protected];
[email protected] Date: Sun, 01 Jan 2013 01:23:45 +0100 MIME-Version: 1.0 Content-Type: multipart/mixed; Subject: Track your parcel #12345
6
Malicious Spam Emails (Header) From: “Postal Service” To:
[email protected];
[email protected] Date: Sun, 01 Jan 2013 01:23:45 +0100 MIME-Version: 1.0 Content-Type: multipart/mixed; Subject: Track your parcel #12345
Spam and spam campaigns are often sent • In large quantities • At certain times or time frames 7
Malicious Spam Emails (Header) From: “Postal Service” To:
[email protected];
[email protected] Date: Sun, 01 Jan 2013 01:23:45 +0100 MIME-Version: 1.0 Content-Type: multipart/mixed; Subject: Track your parcel #12345
Social engineering • Entice recipients to act • Important points of contact 8
Malicious Spam Emails (Text Content) ------=_NextPart_001 Content-Type: ... Content-Transfer-Encoding: ... You have an undelivered parcel! Please follow the instructions attached to find your parcel here: http://tracking.yourpostoffice.example.com
9
Malicious Spam Emails (Text Content) ------=_NextPart_001 Content-Type: ... Content-Transfer-Encoding: ... You have an undelivered parcel! Please follow the instructions attached to find your parcel here: http://tracking.yourpostoffice.example.com
More social engineering • Further enticement • Instructive 10
Malicious Spam Emails (Text Content) ------=_NextPart_001 Content-Type: ... Content-Transfer-Encoding: ... You have an undelivered parcel! Please follow the instructions attached to find your parcel here: http://tracking.yourpostoffice.example.com
Seemingly harmless URL that can redirect to compromised Web sites. 11
Malicious Spam Emails (Attachments) ------=_NextPart_000 Content-Type: application/x-zip-compressed; name="tracking_instructions.zip" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="tracking_instructions.pdf.zip"
(base64 string of attachment)
12
Malicious Spam Emails (Attachments) ------=_NextPart_000 Content-Type: application/x-zip-compressed; name="tracking_instructions.zip" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="tracking_instructions.pdf.zip"
(base64 string of attachment)
Malicious software may be hidden in compressed files. Inconspicuous file names and extensions. 13
Email Spam Data Sets • Real world data sets – Compiled monthly from Jan-Dec 2012 – Received in anonymised form
• Habul: the Habul Plugin for Thunderbird – User manually label emails as spam
• Botnet: from global system of spam traps – Automatic labelling of emails as spam 14
Email Spam Data Sets (Botnet - 1/2)
15
Email Spam Data Sets (Botnet - 2/2)
16
Feature Engineering • Header Features (H) • Text Features – Subject (S) – Content (P – Payload)
• Attachment Features (A) • URL Features (U)
17
Generating Features (Example) Subject: Track your parcel #12345
• • • • • •
Number of words Number of different characters Numbers in words Character diversity per word Non-alphanumeric characters per word … 18
Feature Ranking • Ranking shows the most important patterns of text that distinguishes malicious and non-malicious emails
• Example – Determined by Random Forest classifier – Botnet, Attachments, Feature S21-BZ2
19
20
21
Evaluation Methodology (1/3) • Partition by months for training/testing – Training: Combine data sets from January to the month named. – Testing: Combine the rest of the months.
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
22
Evaluation Methodology (1/3) • Partition by months for training/testing – Training: Combine data sets from January to the month named. – Testing: Combine the rest of the months. – Example: July (Jul) Jan
Feb
Mar
Apr
May
Training
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Testing 23
Evaluation Methodology (2/3) • Select only spam emails with attachments and/or URLs • Generate features • Select features sets appropriate for the classification task • Train models and evaluate on test data
24
Evaluation Methodology (2/3) • Three classifiers (from spam research) – Naïve Bayes (NB): common baseline – Random Forest (RF): seen as robust – Support Vector Machine (SVM): seen as the best algorithm, when the time needed to train models is not an issue 25
Evaluation Methodology (3/3) • Measures of success – AUC-PR (Area under precision recall curve) • Given a random malicious spam email, this gives the probability that the classifier will correctly identify that malicious spam email
– AUC-ROC (Area under ROC curve) • Given a random spam email, this gives the probability that the classifier will correctly label that spam email as malicious or non-malicious 26
Classification Results (AUC-PR - 1/2)
27
Classification Results (AUC-PR - 1/2)
28
Classification Results (AUC-ROC - 1/2)
29
Classification Results (AUC-ROC - 2/2)
30
Email Spam Data Sets (Botnet - 1/2)
31
Classification Results (AUC-PR - 2/2)
32
Classification Results (AUC-ROC - 2/2)
33
Email Spam Data Sets (Botnet - 2/2)
34
Discussion (1/2) • Initial findings are encouraging – Success with spam emails with malicious attachments, but not with malicious URLs
• Advantage of approach: self-contained sets of features extracted from only the email itself – No need for external resources, only text analysis and mining 35
Discussion (2/2) • Limitation: features are not descriptive enough for URLs – Future work: add lexical features for URLs
• Spam campaigns may cause in overly optimistic results – Basic analysis shows diversity of spam campaigns in data sets is high – High diversity strengthens our results 36
Conclusion (1/2) • Task of classifying malicious spam emails with attachments or URLs • Descriptive sets of text features for content of spam emails • Two real world data sets (Habul and Botnet) • Results show spam emails with malicious attachments can be reliably predicted from only the email content
37
Conclusion (2/2) • Compared classification performance of three classifiers: Naïve Bayes, Random Forest, and Support Vector Machine • Up to 95% of spam emails with malicious attachments can potentially be identified from the email content • Potentially huge saving in resources used to detect and filter high risk spam
38
Future Work • Add features to improve classification of spam emails with malicious URLs • Extract more features from the header of emails (e.g. graph relationships of email addresses, and potential spam campaigns) • Moving window of training data • Increase size of scope of our data sets • Comprehensive analysis of new feature sets
39
Thank you! Towards a Feature Rich Model for Predicting Spam Emails containing Malicious Attachments and URLs
[email protected] [email protected] [email protected] Acknowledgements •ARC Discovery Grant on the Evolution of Cybercrime (DP 1096833) •Australian Institute of Criminology (Grant CRG 13/12-13) •ANU Research School of Asia and the Pacific •Australian Communications and Media Authority (ACMA) •Computer Emergency Response Team (CERT) Australia
Related Work (1/4) • Email Spam Filtering – Classify spam and non-spam emails – Mature research field – Identifying emails with malicious remains a problem worthy of investigation
41
Related Work (2/4) • Classification of Malicious Attachments – Malware inside attachments can do significant damage to computers and spread rapidly – Past research focuses on generating features to model email activity of a user – Infected computers can show different outgoing email behaviour
42
Related Work (3/4) • Classification of Malicious URLs – Blacklists are highly efficient • Relies on knowing malicious URLs • Cannot keep up with high volume spam botnets
– Generating features from URLs (text, hosting) • Requires many external resources
– Access Web pages of URLs for analysis • Comprehensive, but very slow
43
Related Work (4/4) • Wikipedia Vandalism Detection – Generate features to capture text patterns to identify vandalism – These text features can show regularities in spam emails – We hypothesise that spam emails with malicious content have different text patterns compared to non-malicious spam emails
44
Email Spam Data Sets (Habul – 1/2)
45
Email Spam Data Sets (Habul – 2/2)
46
Classification Results (AUC-PR - 1/4)
47
Classification Results (AUC-ROC - 1/4)
48
Email Spam Data Sets (Habul – 1/2)
49
Classification Results (AUC-PR - 2/4)
50
Classification Results (AUC-ROC - 2/4)
51
Email Spam Data Sets (Habul – 2/2)
52
Classification Results (ACC - 1/4)
53
Classification Results (ACC - 2/4)
54
Classification Results (ACC - 3/4)
55
Classification Results (ACC - 4/4)
56