Automated Prevention of Phishing Attacks by Machine Learning Web Application Firewall and GPOs
Konstantinos Demertzis & Lazaros Iliadis
No system is safe!!!
INTRODUCTION
[~] $Whoami… -
Dr. Konstantinos Demertzis (2LT)
Part-time Lecturer, Computer and Informatics Engineering Department, Eastern Macedonia & Thrace Institute of Technology. -
Dr. Lazaros Iliadis
Professor of Applied Informatics, Department of Civil Engineering, Democritus University of Thrace Greece.
Outline ● What is Phishing Attack? ● Process, Objective & Types used for Phishing Attacks
● Why is Phishing Attack detection important? ● Signature based defense ● Proposed method ● Machine Learning for Security ● Modeling Methodology ● DGA identification ● Phishing Discovery ● Spiking Neural Network ● Results
● Future Directions
What is Phishing Attack? ● Phishing is typically carried out by email spoofing or instant
messaging and it often directs users to enter personal information at a fake website, the look and feel of which are almost identical to the legitimate one. ● Communications purporting to be from social web sites, auction
sites, banks, online payment processors or IT administrators are often used to lure victims. ● Phishing emails may contain links to websites that are infected
with malware.
Phishing Attack Example
Objectives ● Main objectives of phishing attack are:
- Trick people into providing sensitive personal information such as account credentials or credit card numbers. - Gain further knowledge of internal assets. - Expand access into other systems. The information is then used to access important accounts and can result in identity theft and financial loss.
Techniques ● Some of the techniques used for phishing attacks include:
- Spear phishing - Clone phishing - Whaling
- Link manipulation - Filter evasion - Website forgery
- Covert redirect - Social engineering - Zero-day
Why is phishing attack detection important? ● Rapid detection of phishing attack can reduce, contain and prevent
further impact of a breach ● Detection of phishing attack enables SecOpS and InfoSec teams to act
in a more efficient manner
Why is phishing attack detection important?
93% of phishing emails are now ransomware ● The modification of the registry keys (Most associated with persistence.
I.E execute after reboot). ● Renames and encrypts file extensions of files (Targets User ’s docs. I.E
.doc, xls, ppt, mp3, wallet). ● Modifies Master Boot Record to prevent rebooting, usually encrypting
it relocating it and placing a replacement. ● Removal of Volume Snapshot Service files (VSS) or volume shadow files,
use for system restoration and backup ● Polymorphic/metamorphic behavior
93% of phishing emails are now ransomware
Signature based defense ● The rise of ransomware exemplifies how malicious actors always adapt
and create new methods of attacks to bypass system protections. ● Particularly with ransomware, specific verticals have been targeted due
to their high dependence on information availability in order to operate. ● Current defense technologies such as antivirus and firewalls are purely
based on static signatures. ● This signature based approach means malicious actors can and will
modify their code in order to bypass these signature-based defenses. ● This approach is limited and passive, forcing defenders to constantly
develop and update signatures in order to detect and prevent malicious code attacks.
Proposed method ● A new approach using machine learning techniques and leveraging the
processing power of big data technologies may provide a different and more comprehensive approach, which does not depend on static based signatures. ● This paper proposes the development of the Machine Learning Web
Application Firewall (MLWAF) which is innovative, ultra-fast and has low requirements. ● It is about an automate smart tool which builds Group Policy Objects
(GPO) and push into Windows Domain for automated prevention of phishing attacks. ● It runs under the Windows Server operating system and its reasoning is
based on advanced computational intelligence approaches.
Use the right tools for the job
MACHINE LEARNING FOR SECURITY
Big Data, ML & Cyber Security ● Big Data: Synthesis of technology providing visibility into the analysis of
large data sets and the ability to discover patterns, trends, and associations, especially relating to human behavior and interactions. ● Machine Learning: Subfield of computer science/statistics. Explores
and study construction of algorithms that can learn from and make predictions on Data. ● ML allows us to go beyond of static signature based technologies but
can be challenging to deal with for enterprise volumes of user data. ● Combining Traditional Security Tools + Data science creates a scenario
where detection of threats based on dynamic and multi contextual indicators is possible.
Machine Learning
Advantages of using ML ● Using ML allows us to put together very large and distinct sources of
data into a platform for analysis, interpretation and prediction. ● ML allows us to go beyond of static signature based technologies.
● ML creates an scenario where detection of threats based on dynamic
and multi contextual indicators is possible. ● A ML system randomly initialized and trained on some datasets will
eventually learn good feature representations for a given task (Feature Learning). ● ML mostly employs a gradient based method of optimizing a large array
of parameters. It is not feasible for a human to find such an optimal setting for large number of parameters by hand, thus large scale ML algorithms such as Stochastic Gradient Descent are used to find an optimal setting (Parameter Optimization).
“But all too often we forget the first rule of battle - the battlefield – the attacker can escape everything it cannot escape the terrain – choose the terrain, use the terrain – we win” Sun Tzu
SECURITY ANALYTICS FOR DEFENSE
Modeling Methodology ● Step 1: DGA Identification
● Step 2: Phishing Discovery ● Step 3: Active Defense ● Step 4: Data
● Step 5: ML Algorithm
Garbage in, Garbage out
STEP1: DGA IDENTIFICATION
DGA Identification ● Domain Generation Algorithm (DGA) ● Bot agents create a dynamic list of multiple FQDN’s that can be used
as rendezvous points with their C&C servers. ● The large number of potential rendezvous points makes it difficult
for law enforcement to effectively shut down botnets since infected computers will attempt to contact some of these domain names every day to receive updates or commands. ● By using public-key cryptography, it is unfeasible for law
enforcement and other actors to mimic commands from the malware controllers as some worms will automatically reject any updates not signed by the malware controllers.
DGA Identification ● For example, an infected computer could create thousands of
domain names such as: www.gi9bfb4er2ig4fws8h.ir and would attempt to contact a portion of these with the purpose of receiving an update or commands. ● Embedding the DGA instead of a list of previously-generated (by the
C&C servers) domains in the unobfuscated binary of the malware protects against a strings dump that could be fed into a network blacklisting appliance preemptively to attempt to restrict outbound communication from infected hosts within an enterprise.
DGA Identification
DGA Identification
Catching Fish
STEP 2: PHISHING DISCOVERY
Detecting phishing attacks
How Modern Web Phishing Works ● In most cases, phishing lures are just a very simple copy of a login
page for Facebook, Google, banks, insurance companies, etc. ● The attackers include locally-stored images, CSS, and JavaScript to
produce almost identical copies of the original login page. ● The important difference is the malicious PHP scripts which are
sending your username and password directly to the attacker. ● The stolen credentials and personal information are used to perform
identity theft and fraudulent activities. ● It’s that simple.
Web phishing example
Technical Details of Advanced Phishing Attacks
index.php
modules.php
Technical Details of Advanced Phishing Attacks
part of chmod.php
Technical Details of Advanced Phishing Attacks
visitor_log.php and its logging code
"The best defense is a good offense"
STEP 3: ACTIVE DEFENSE
Protection for Windows Servers ● Malware and Ransomware targets primarily Microsoft Windows
operating systems. ● Microsoft Windows, is the most used operating system in most
enterprises and by users at homes as well. ● In the case of Ransomware and due to the constant evolving nature of
malicious code, it is very likely that despite protections and new detection technologies users will still get infected. ● One of the most common drivers of users getting infected despite
technology protections is the use of phishing attack (by social engineering). ● In many cases users get messages or websites that present misleading
messages and drive them to allow execution of malicious payload.
Good protection doesn't need to be offensive ● Some of the roadmap items for active defense includes Group Policy
Object (GPO) scripting and push into Active Directory (AD) once attack has been detected, creation of Access Control Lists (ACLs) to isolate infected host and eventually provide an open format of input that can retro feed those signature based defense technologies. ● Also, active defense measures may consists of operationalized action
items performed in an automated fashion such as service shutdowns, application disabling or computer isolation, which may be combined the aforementioned items. ● Goal is to use machine learning to discover some common asset
classes (ML term sometimes is class labels).
MLWAF: how it works? ● MLWAF can extract the name of the payload as it is being served
providing information that can then be fed into a Power Shell script that creates a GPO to be distributed across systems in an Active Directory environment that disallows execution of the found malware. ● Even if the names are randomized the MLWAF will find current name
and produce output. ● The script will extract name of malicious executable then connect to
the Domain Controller using a service account and pushing a GPO that prevents executable from running.
MLWAF: how it works? ● The MLWAF tool and script can be run from the popular security
distribution “Security Onion” https://securityonion.net/ ● You will need to create a GPO prior to executing the script and
reference it in the name (I.E ' –AntiMalGPO ') ● The proof of concept script requires python Paramiko to run the script
and it also requires the SSH setup at Windows Server (FreeSSHD) and appropriate permissions for the SSH service account linked to AD to execute powershell script. ● Once the GPO is pushed it can be refreshed via schedule tasks in
Windows Server operating system.
Example of powershell GPO script/cmdlet Set-GPRegistryValue -Name AntiMalGPO -Key 'HKCU\Software\Microsoft\Windows\CurrentVersion\Policies\Explorer\Disallow Run' -ValueName '1' -Type String -Value 'WanaCrypt0r.exe'
“It's not failure, it's data…”
STEP 4: DATA
DGA Dataset ● Domain Generation Algorithms dataset (DGA_dataset).
● 5 features + class (legit or malicious), containing 131,374 patterns. ● 100,000 URLs they were chosen randomly from the database with the
1 million most popular domain names of Alexa and 16,374 malicious URLs from the updated list of the Black Hole DNS database and 15,000 malicious URLs they were created based on the timestamp DGA algorithm.
Phishing Dataset ● To implement and test our approach, we have used a dataset with 4000
emails (973 phishing and 3027 legitimate emails). ● In our approach, we make use of sixteen relevant features: HTML-
format, P-based URL, Age of Domain Name, Number of Domains, Number of Sub-domains, Presence of JavaScript, Presence of Form Tag, Number of Links, URL Based Image Source, Matching Domains, 6 groups of Keywords.
“Maybe the only significant difference between a really smart machine and a human being was the noise they made when you punched them…”
STEP 5: MACHINE LEARNING ALGORITHM
Spiking Neural Network ● A typical spiking neuron model consists of dendrites, which simulate ● ● ●
●
the input level of the network, which collects signals from other neurons and transmits them to the next level, which is called soma. The soma is the process level at which when the input signal passes a specific threshold, an output signal is generated. The output signal is taken from the output level called the axon, which delivers the signal (short electrical pulses called action potentials or spike train) to be transferred to other neurons. Α spike train is a sequence of stereo-typed events generated at regular or irregular intervals. Typically, the spikes have an amplitude of about 100 mV and a duration of 1-2 ms. Although the same elements exist in a linear perceptron, the main difference between a linear perceptron and a spiking model is the action potential generated during the stimulation time.
Spiking Neural Network ● Furthermore, the activation function used in spiking models is a
differential equation that tries to model the dynamic properties of a biological neuron in terms of spikes. ● The form of the spike does not carry any information, and what is
important is the number and the timing of spikes. ● The shortest distance between two spikes defines the absolute
refractory period of the neuron that is followed by a phase of relative refractoriness where it is difficult to generate a spike. ● Several spiking models have been proposed in the last years
aiming to model different neurodynamic properties of neurons.
Izhikevich spiking neuron model ● One of the simplest and versatile models is the one proposed by
Izhikevich. ● This model has only nine dimensionless parameters, and it is
described by the following equations:
“If you know the enemy and know yourself you need not fear the results of a hundred battles…”
RESULTS
Classification Accuracy & RMSE Classifier
DGA Dataset ACC
Phishing Dataset
RMSE F-ScoreROC Area ACC
Izhikevich SNM 98,2% 0.3284 0,982
RMSE F-Score ROC Area
0,990
99,6% 0.2951 0,996
0,995
RBF ANN
89,8% 0.5766 0,900
0,980
91,3% 0.5514 0,910
0,985
GMDH
94,4% 0.5017 0,945
0,955
97,8% 0.3983 0,978
0,980
PANN
90,9% 0.5633 0,910
0,950
96,6% 0.4512 0,965
0,975
FNN-GA
96,7% 0.4972 0,967
0,970
99,1% 0.3048 0,990
0,990
FNN-PSO
96,2% 0.4911
0,962
0,975
99,2% 0.3009 0,992
0,990
FNN-ACO
89,4% 0.5791
0,895
0,900
92,7% 0.5336 0,927
0,950
FNN-ES
90,1% 0.5716
0,901
0,901
93,5% 0.5125 0,936
0,945
“At this point all the hard work is done”…
FUTURE DIRECTION
MLWAF
References [1] Demertzis K., Iliadis L. (2015, April). Evolving Smart URL Filter in a Zone-based Policy Firewall for Detecting Algorithmically Generated Malicious Domains. Proceedings SLDS (Statistical Learning and Data Sciences) Conference LNAI (Lecture Notes in Artificial Intelligence) 9047 Springer, Royal Holloway University London, UK, 223-233. doi: 10.1007/978-3-319-17091-6_17. [2] Virvilis N., Gritzalis D., Apostolopoulos T., (2013), Trusted Computing vs. Advanced Persistent Threats: Can a defender win this game?, in Proc. of 10th IEEE International Conference on Autonomic and Trusted Computing (ATC-2013), pp. 396-403, IEEE Press, Italy, December 2013. [3] Holz T., C. Gorecki, K. Rieck, and F. Freiling, Measuring and detecting fast-flux service networks, in NDSS ’08: Proceedings of the Network & Distributed System Security Symposium, 2008. [4] Vazquez R., (2010), Izhikevich Neuron Model and its Application in Pattern Recognition, Intelligent Information Processing Systems, Vol 11, No 1, Neurodynamics. [5] Zadeh J., Soto R., (2016), Aktaion, A signature-less open source machine-learning tool for ransomware detection, http://www.github.com/jzadeh/Atkaion
Q&A
?
Thanks
[email protected] http://utopia.duth.gr/~kdemertz/