Multi-Layer Hidden Markov Model Based Intrusion Detection ... - MDPI

1 downloads 0 Views 3MB Size Report
Dec 25, 2018 - which attempts to detect multi-stage attacks from a dataset made up of several ... and unreliable as several flows are redundant; NSL-KDD has been used to ...... viewdoc/download?doi=10.1.1.87.408&rep=rep1&type=pdf ...
machine learning & knowledge extraction

Article

Multi-Layer Hidden Markov Model Based Intrusion Detection System Wondimu K. Zegeye *, Richard A. Dean and Farzad Moazzami Department of Electrical and Computer Engineering, Morgan State University, Baltimore, MD 21251, USA; [email protected] (R.A.D.); [email protected] (F.M.) * Correspondence: [email protected] Received: 14 October 2018; Accepted: 12 December 2018; Published: 25 December 2018

 

Abstract: The all IP nature of the next generation (5G) networks is going to open a lot of doors for new vulnerabilities which are going to be challenging in preventing the risk associated with them. Majority of these vulnerabilities might be impossible to detect with simple networking traffic monitoring tools. Intrusion Detection Systems (IDS) which rely on machine learning and artificial intelligence can significantly improve network defense against intruders. This technology can be trained to learn and identify uncommon patterns in massive volume of traffic and notify, using such as alert flags, system administrators for additional investigation. This paper proposes an IDS design which makes use of machine learning algorithms such as Hidden Markov Model (HMM) using a multi-layer approach. This approach has been developed and verified to resolve the common flaws in the application of HMM to IDS commonly referred as the curse of dimensionality. It factors a huge problem of immense dimensionality to a discrete set of manageable and reliable elements. The multi-layer approach can be expanded beyond 2 layers to capture multi-phase attacks over longer spans of time. A pyramid of HMMs can resolve disparate digital events and signatures across protocols and platforms to actionable information where lower layers identify discrete events (such as network scan) and higher layers new states which are the result of multi-phase events of the lower layers. The concepts of this novel approach have been developed but the full potential has not been demonstrated. Keywords: Intrusion Detection System (IDS); Hidden Markov Model (HMM); multi-stage attacks

1. Introduction Intrusion Detection Systems have been the subject of a lot of research both in academia and industry in the past few decades as the interest in information security has been growing rapidly. The National Institute of Standards and Technology (NIST) defines intrusion detection as “the process of monitoring the events occurring in a computer system or network and analyzing them for signs of intrusions, defined as attempts to compromise the confidentiality, integrity, availability, or to bypass the security mechanisms of a computer or network.” A system which addresses or automates intrusion detection is referred as Intrusion Detection System (IDS) [1]. Intrusion detection systems come in different flavors and approaches. Based on their points of placement, they can be categorized into network-based intrusion detection system (NIDS) and host-based intrusion detection system (HIDS). A network intrusion detection system (NIDS) is placed at a strategic point in the network such that packets traversing a particular network link can be monitored. NIDSs monitor a given network interface by placing it in promiscuous mode. This will help the IDS in hiding its existence from network attackers while performing the task of network traffic monitoring. On the other hand, Host-based IDSs monitor and reside in individual host machines.

Mach. Learn. Knowl. Extr. 2019, 1, 265–286; doi:10.3390/make1010017

www.mdpi.com/journal/make

Mach. Learn. Knowl. Extr. 2019, 1

266

HIDSs operate by monitoring and analyzing the host system internals such as operating system calls and file systems. In the same way as NIDS, it can also monitor the network interface of the host. The techniques employed by modern day IDSs to gather and analyze data are extremely diverse. However, those techniques have common basic features in their structures: a detection module which collects data that possibly contain evidence of intrusion and an analysis engine which processes this data to identify intrusive activity. Those analysis engines mainly use two techniques of analysis: anomaly detection and misuse detection. The intrinsic nature of misuse detection revolves around the use of expert system which is capable of identifying intrusions mainly based on a preordained knowledge base. Consequently, misuse structures are able to reach very high levels of accuracy in identifying even very subtle intrusions which might be represented in their knowledge base; similarly, if this expert know-how base is developed carefully, misuse systems produce a minimum number of false positives [2]. Unfortunately these architectures are less fortunate due to the fact that a misuse detection system is incapable of detecting intrusions that are not represented in its know-how base. Subtle versions of known attacks may additionally affect the evaluation if a misuse system is not always well constructed. Therefore, the efficiency of the system is highly dependent on the thorough and accurate creation of this information base, undertaking that calls for human expertise involvement, thus, the need to develop anomaly detection methods. A wide variety of strategies have been explored to detect anomalous events from normal ones including neural networks, statistical modeling and Hidden Markov Models to name a few. Those approaches rely on the same principle. At first, a baseline model that is a representative of normal system behavior against which anomalous events can be distinguished is established. When an event indicates anomalous activity, as compared with the baseline, it is considered as malicious. This system characterization can be used to identify anomalous traffic from normal traffic. One of the very attractive features of anomaly based intrusion detection systems is their capability to identify previously unseen attacks. The baseline model in this case is usually automated and it does not require both human interference and the knowledge base. The aftermath of this approach is that the detection system may fail to detect even well-known attacks if they are crafted not to be substantially different from the normal behavior established by the system. Currently, more than fifty percent of the web traffic is encrypted-both normal and malicious. The volume of encrypted traffic is expanding even more which creates confusions and challenges for security teams trying to monitor and identify malicious network traffic. The main goal of encryption is to enhance network security but at the same time it provides intruders the power to hide command-and-control (C2) activity giving them enough time to launch attack and cause damage. To keep up with the intruders, security teams need to include additional automation and modern tools that are developed using machine learning and artificial intelligence to supplement threat detection, prevention and remediation [3]. More enterprises are now exploring machine learning and artificial intelligence to prevail over the effect of encryption and decrease adversaries’ time. These advanced concepts have the capability to keep up their performance without humans having to clarify precisely the way to accomplish the tasks that they are provided. Unusual patterns of web traffic that can indicate malicious activity can be automatically detected as these advanced systems can, overtime, “learn” by themselves. To automatically detect “known-known” threats, the types of attacks that have been known previously, machine learning plays a significant role. But its main advantage in monitoring encrypted web traffic is due to the fact that it is capable of detecting “known-unknown” threats (previously unknown distinct form of known threats, malware subdivision, or similar new threats) and “unknown-unknown” (net-new malware) threats. Those technologies automatically alert potential attacks to network administrators as they can learn to identify unusual patterns in massive volumes of encrypted web traffic.

Mach. Learn. Knowl. Extr. 2019, 1

267

Those automatic alerts are very important in organizations where there is a lack of knowledgeable personnel in the enhancement of security defenses. Intelligent and automated tools using machine learning and artificial intelligence can help security teams fill the gaps in skills and resources, making them more practical at recognizing and responding to both well-known and prominent threats. Several techniques of artificial intelligence (AI) have been explored in the path towards developing IDSs, for example, fuzzy logic, artificial neural networks (ANNs) and genetic algorithms (GA). In addition, hybrid intelligent IDSs, such as evolutionary neural network (ENN) and evolutionary fuzzy neural networks (EFuNN)—based IDSs, are also used. This work proposes the potential application of the Hidden Markov Model (HMM) for intrusion detection which is capable of providing finer-grained characterization of network traffic using a multi-layer approach. The current implementations of HMMs for IDS are mainly based on a single HMM which will be trained for any incoming network traffic to identify anomalous and normal traffic during testing. Other HMM based IDS implementations rely on multi HMM profiles where each of the HMMs are trained for a specific application type traffic and posterior probabilities are used to select network applications using only packet-level information that remain unchanged and observable after encryption such as packet size and packet arrival time [4]. This approach, even if it factors based on application layer traffic, it considers a limited number of features and is unable to detect multistage attacks which can be crafted to look like normal traffic. Similarly, the work in Reference [5] applied several pre-processing techniques on the dataset considered to implement a multi class system (MCS) HMM based IDS. Besides to the capability of multi-layer HMM design to detect multi-stage attacks, the IDS data analysis engine of the approach proposed here, on the other hand, uses several features where a dimension reduction technique will be applied to extract only important features. A single layer HMM which attempts to detect multi-stage attacks from a dataset made up of several steps of an attack trace is implemented in Reference [6]. Recent HMM based implementations, such as in Reference [7], are based on a database of HMM templates corresponding to each attack and eventually used to detect multiple multi-stage attacks. HMMs use statistical learning algorithms that suffer in cost exponentially as the volume of data grows. This aspect is commonly referred as the curse of dimensionality [8]. The HMMs tend to fail, more often than not, on a large dimension state space. Considering a single HMM based IDS, as the incoming network traffic will have a large hidden state space, it will fall a victim to this curse of dimensionality. The solution to those problems involves the principle of decomposition. Decomposition technique [9] comprises of the following steps: the algorithm of decomposition is applied to lower levels to construct each individual HMMs, representing partial solutions, which will be combined to form the final global solution. The multi-layer approach proposed in this work has been developed and verified to resolve the common flaws in the application of HMM to IDS commonly referred as the curse of dimensionality. It factors a huge problem of immense dimensionality to a discrete set of manageable and reliable elements. The multi-layer approach can be expanded beyond 2 layers to capture multi-phase attacks over longer spans of time. A pyramid of HMMs can resolve disparate digital events and signatures across protocols and platforms to actionable information where lower layers identify discrete events (such as network scan) and higher layers new states which are the result of multi-phase events of the lower layers. The concepts of this novel approach have been developed but the full potential has not been demonstrated as will be described in the next sections. 2. Design Approach A challenge in applying the Markov model to intrusion detection systems is the lack of a standard method for the translation of the observed network packet data into meaningful Markov model.

Mach. Learn. Knowl. Extr. 2019, 1

268

The first step towards building an IDS based on multiple layers of Hidden Markov Models involves processing network traffic into meaningful observations. The IDS designed in this work is as specified in flow chart of Figure 1. It starts by data processing of the captured Wireshark file using feature generation, feature selection among those generated features Mach. Learn. 2018, 2, by x FOR PEER REVIEW of 22 or creation of Knowl. new Extr. features combining the generated features, use machine learning 4algorithm for dimension reduction and finally, apply vector quantization techniques to create meaningful learning algorithm for dimension reduction and finally, apply vector quantization techniques to observations for the HMMs. The details of the layered HMM design is discussed in Section 2.5 create meaningful observations for the HMMs. The details of the layered HMM design is discussed and the corresponding structure is asstructure shown in 3. in Figure 3. in Section 2.5 and themodel corresponding model is Figure as shown

Figure FlowChart Chartof ofthe the IDS IDS model thethe data processing. Figure 1. 1. Flow modelincluding including data processing.

2.1. CICIDS2017 Data Processing 2.1. CICIDS2017 Data Processing Datasets as DARPA98-99(Lincoln Laboratory 1998–1999), KDD’99 (University of Datasets suchsuch as DARPA98-99(Lincoln Laboratory 1998–1999), KDD’99 (University of California, California, Irvine 1998–1999), DEFCON (The Shmoo Group, 2000–2002) UMASS (University of Irvine 1998–1999), DEFCON (The Shmoo Group, 2000–2002) UMASS (University of Massachusetts Massachusetts 2011), ISC2012 (University of New Brunswick 2012) and ADFA’13 (University of 2011), ISC2012 (University of New Brunswick 2012) and ADFA’13 (University of New South Wales New South Wales 2013), to name a few, have been used by several researchers to validate their IDS’s 2013), to name a few, have been used by several researchers to validate their IDS’s designs and designs and approaches. Those datasets have shortcomings associated with them. For example, approaches. Those shortcomings associated with flows them.are For example,NSL-KDD DARPA98-99 DARPA98-99 anddatasets KDD’99 have are outdated and unreliable as several redundant; and KDD’99 are outdated and unreliable as several flows are redundant; NSL-KDD has been used to has been used to replace KDD’99 by removing the significant redundancy. The work in Reference replace removing significant redundancy. work Reference further analyzed [10]KDD’99 further by analyzed the the KDD’99 and presented new The proofs to in these existing[10] inconsistencies. the KDD’99 presented new proofs theseAnalysis existing inconsistencies. (Center of Applied CAIDA and (Center of Applied Internetto Data 2002–2016) trafficCAIDA payloads are heavily anonymized. ADFA’13 lacks attack diversity and variety. Besidesanonymized. to these popular datasets lacks amongattack Internet Data Analysis 2002–2016) traffic payloads are heavily ADFA’13 various researchers, several to others are also explored in Reference [11]. researchers, several others are diversity and variety. Besides these popular datasets among various Most of the datasets are also simulate attacks which are multi-stage. The CICIDS2017 dataset is also explored in Reference [11]. designed to cover what are commonly referred as the eleven criteria for IDS dataset evaluation Most of the datasets are also simulate attacks which are multi-stage. The CICIDS2017 dataset framework (Network, traffic, Label, Interaction, capture, Protocols, Attack diversity, anonymization, is designed to cover what are commonly referred as the eleven criteria for IDS dataset evaluation heterogeneity, features and metadata). The details of this dataset compared to the commonly framework (Network, traffic, Label,isInteraction, Attack diversity, anonymization, available IDS evaluation datasets presented in capture, ReferenceProtocols, [12]. heterogeneity, features metadata). The work, detailsa of this dataset compared to CICIDS2017 the commonly available For testing the and modeled IDS in this dataset is prepared from the dataset. IDS evaluation is presented in Reference [12]. in building a reliable benchmark dataset. It This datasetdatasets covers eleven criteria which are necessary For testing modeled IDS in this work, dataset is prepared from the dataset. contains verythe common attacks such as XSS, Port ascan, Infiltration, Brute Force, SQLCICIDS2017 Injection, Botnet DoS and DDoS. The dataset is made up of 81 features including the label and it is publicly available This dataset covers eleven criteria which are necessary in building a reliable benchmark dataset. It in Canadian Institute for Cybersecurity website [13]. Some of these features are directly extracted

Mach. Learn. Knowl. Extr. 2019, 1

269

contains very common attacks such as XSS, Port scan, Infiltration, Brute Force, SQL Injection, Botnet DoS and DDoS. The dataset is made up of 81 features including the label and it is publicly available in Canadian Institute for Cybersecurity website [13]. Some of these features are directly extracted while others are calculated for both normal (benign) and anomalous flows using the CICFlowMeter [14]. The CICIDS2017 dataset comprises 3.1 million flow records and it covers five days. It includes the following labeled attack types: Benign, DoS Hulk, Port Scan, DDoS, DoS, FTP Patator, SSH Patator, DoS Slow Loris, DoS Slow HTTP Test, Botnet, Web Attack: Brute Force, Web Attack: Infiltration, Web Attack: SQL Injection and Heartbleed. The portion of the “Thursday-WorkingHours-Morning-webattack” is used to create the training and testing dataset which constitutes “BENIGN”, “SSH Patator” and “Web Attack-Brute Force” traffic. 2.2. Feature Selection and Creation Feature selection involves choosing a subset of features from the initial available features whereas feature creation is a process of constructing new features and it is usually performed after feature selection process. Feature selection takes a subset of features (M) from the original set of features (N) where M < N. To build a robust and high performance IDS, the features created or constructed from the subset of selected features could follow a knowledge-based approach. Other approaches which can be applied to construct new features are data-driven, hypothesis-driven and hybrid [15,16]. From Features discarded include Source port. This is due to the fact that as it is part of the Categorical encoding is applied on the following features (Flow_ID, Source_IP, Destination_IP), resulting in numerical values. In addition, a new feature (label) which identifies the traffic type such as BENIGN, SSH-patator and web-attack-bruteforce is added. The values corresponding to this new feature are also categorically encoded. 2.3. Feature Extraction/Dimension Reduction On the other hand, dimension reduction is one form of transformation whereby a new set of features is extracted. This feature extraction process extracts a set of new features from the initial features through some functional mapping [17]. Feature extraction, also known as dimension reduction, is applied after feature selection and feature creation procedures on the original set of features. Its goal is to extract a set of new features through some functional mapping. If we initially have n features (or attributes), A1 , A2 , . . . , An , after feature selection and creation, feature extraction results in a new set of features, B1 , B2 , . . . , Bm (m < n) where Bi = Fi ( A1 , A2 , . . . , An ) and Fi is a mapping function. PCA is a classic technique which is used to compute a linear transformation by mapping data from a high dimensional space to a lower dimension. The original n features are replaced by another set of m features that are computed from a linear combination of these initial features. Principal Component Analysis is used to compute a linear transformation by mapping data from a high dimensional space to a lower dimension. The first principal component contributes the highest variance in the original dataset and so on. Therefore, in the dimension reduction process, the last few components can be discarded as it only results in minimal loss of the information value [18]. The main goals of PCA: 1. 2. 3. 4.

Extract the maximum variation in the data. Reduce the size of the data by keeping only the significant information. Make the representation of the data simple. Analyze the structure of the variables (features) and observations.

Generally speaking, PCA provides a framework for minimizing data dimensionality by identifying principal components, linear combinations of variables, which represent the maximum variation in the data. Principal axes linearly fit the original data so the first principal axis minimizes the

Mach. Learn. Knowl. Extr. 2019, 1

270

sum of squares for all observational values and maximally reduces residual variation. Each subsequent principal axis maximally accounts for variation in residual data and acts as the line of best fit directionally orthogonal to previously defined axes. Principal components represent the correlation between variables and the2,corresponding principal axes. Conceptually, PCA is a greedy algorithm Mach. Learn. Knowl. Extr. 2018, x FOR PEER REVIEW 6 of 22 fitting each axis to the data while conditioning upon all previous axes definitions. Principal components project the original axes, where these axes are ordered Principal algorithm fitting data eachonto axisthese to the data while conditioning uponsuch all that previous axesComponent definitions. 1 Principal (PC1 ) accounts for the most variation, followed PCthese , . . . , PC for p variables (dimensions). components project the original data by onto axes, where these axes are ordered such p 2 to compute PCA using Value for Decomposition (EVD): followed by thatSteps Principal Component 1 ( Eigen ) accounts the most variation, ,…, for variables (dimensions). 1. From each of the dimensions subtract the mean. Steps to compute PCA using Eigen Value Decomposition (EVD): Mach. Learn. Knowl.the Extr. 2018, 2, x FOR PEER REVIEW 6 of 22 2. Determine covariance matrix. 1. From each of the dimensions subtract the of mean. 3.Mach.Determine eigenvalues eigenvectors the calculated covariance matrix. Learn. Knowl.the Extr. 2018, 2, x FORand PEER REVIEW 6 of 22 algorithm fitting each axis to the data while conditioning upon all previous axes definitions. 2. Determine the covariance matrix. 4. Reduce dimensionality and construct a feature vector. Principal components project original data onto axes,upon where axes are such 3. Determine and eigenvectors ofthese the calculated covariance matrix. fitting each axis tothethe data while conditioning allthese previous axesordered definitions. 5.algorithm Determine the the neweigenvalues data. that Principal Component 1 ( ) accounts for the most variation, followed by , … , for 4. Reduce dimensionality andoriginal construct a feature vector. Principal components project the data onto these axes, where these axes are ordered such Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW 6 of 22 The PCA procedure used applies Singular Value Decomposition (SVD) instead of, …Eigenvalue variables (dimensions). 5. Determine the new data. that Principal Component 1 ( ) accounts for the most variation, followed by , for Decomposition (EVD) [19]. SVD is numerically more stable as it avoids the computation of the Steps(dimensions). to compute PCA using Eigen Value Decomposition (EVD): variables algorithm fitting each axis to applies the data while Value conditioning upon all(SVD) previous axes The PCA procedure used Singular Decomposition instead of definitions. Eigenvalue covariance matrix whichPCA is anusing expensive operation. Steps to each compute Value Decomposition (EVD): 1. From of theproject dimensions the mean. Principal components the Eigen original data onto these axes, thesethe axes are orderedofsuch Decomposition (EVD) [19]. SVD is subtract numerically more stable aswhere it avoids computation the Singular Value Decomposition (SVD) foroperation. Principal Component Analysis (PCA) 2. From Determine the covariance matrix. that Principal Component ) accounts for most variation, followed by ,…, for covariance matrix which is 1an( expensive 1. each of the dimensions subtract the the mean. 3. Determine the eigenvalues and eigenvectors of the calculated covariance matrix. variables (dimensions). T 2. Determine the covariance matrix. Any matrix of dimension N(SVD) × d can uniquelyComponent written as XAnalysis = U × Σ (PCA) ×V Singular ValueXDecomposition forbePrincipal 4. Reduce dimensionality andEigen construct a feature vector. Steps to compute PCA using Value Decomposition (EVD):covariance matrix. 3. Determine the eigenvalues and eigenvectors of the calculated Where Any matrix Xthe of dimension N xconstruct d can be uniquely written as = × × 5. Reduce Determine new data.and 4. dimensionality feature 1. From each of the dimensions subtract athe mean.vector. • Where r is the rankprocedure of the matrix X (i.e.,applies the number of linearly independent vectors the matrix) 5. Determine new data. The PCA used Singular Value Decomposition (SVD)ininstead of Eigenvalue 2. Determine the covariance matrix. • Decomposition U is a columnorthonormal matrix of dimension N × d  rThe isDetermine the rank of matrix X (i.e., the number of linearly independent vectors in the matrix) (EVD) [19]. SVD is numerically more stable as it avoids the computation of the 3. the eigenvalues and eigenvectors of the calculated covariance matrix. PCA procedure used applies Singular Value Decomposition (SVD) instead of Eigenvalue • covariance matrix of dimension N × d where σ ’s (the singular values) are sorted Decomposition UisReduce isaa diagonal columnorthonormal matrix of dimension × ∑4. matrix which is an expensive operation. i dimensionality and construct a feature vector. (EVD) [19]. SVD is numerically more stable as it avoids the computation of in the descending order across the diagonal.  ∑ Determine is amatrix diagonal matrix of dimension × where σi ‘s (the singular values) are sorted in 5. the new data. covariance which is an expensive operation. Singular Value Decomposition (SVD) for Principal Component Analysis (PCA) • Vdescending is a column-order orthonormal matrix of dimension d × d. across the diagonal. The PCA procedure used applies Singular Value Decomposition (SVD) instead of Eigenvalue Singular Decomposition for Principal Any matrix X of dimension (SVD) N x d can uniquelyComponent written = × (PCA) ×  V is aValue columnorthonormal matrix of be dimension × . as Analysis Decomposition (EVD) [19]. SVD is numerically more stable as it avoids the computation of the Given a data matrix X, the PCA computation using SVD is as follows Where Any matrix of dimension N x dcomputation canoperation. be uniquely = × × Given a dataXmatrix X, an theexpensive PCA usingwritten SVD isas as follows covariance matrix which is T X,rank  For r isXthe of matrix ofsymmetric linearly independent vectors in the matrix) • Where a rank r (N ≥ Xd (i.e., ⇒ r the ≤ dnumber N × N matrix ), square,  For Value , a Decomposition rank ( ≥ ⇒ ≤ for ), square, symmetric matrix(PCA) Singular (SVD) Principal Component×Analysis a columnorthonormal of dimension  rUisisthe rank of matrix X (i.e.,matrix the number of linearly × independent vectors in the matrix) ˆr }theisset , ,vˆ… ,X, .of . .},dimension vis the set of be orthonormal d × 1 Eigenvectors with {Eigenvalues {,vˆdiagonal { } 2 1 of orthonormal × Eigenvectors , … , in matrix N x d can uniquely written =singular ×Eigenvalues × values)   Any ∑ is a matrix of dimension × where are, sorted U is a column- orthonormal matrix of dimension × σi ‘sas(thewith λ , λ , . . . , λ { } r 2 1 descending order across the diagonal. Where ∑ a diagonal matrix ofofdimension × where of σi ‘s (the singular values) are sorted in Theisprincipal components are the eigenvectors T. X V is a columnorthonormal matrix of dimension × •  The principal components of X are the eigenvectors of X across the diagonal. rdescending is=√ the rank of matrix X (i.e., the number of linearly independent vectors in the matrix) areorder positive real and termed singular values •  σiV = λ are positive real and termed singular values is a columnorthonormal matrix of dimension × . { is, a columnU orthonormal matrix of dimension × defined Given a,i …data PCA computation SVD is as follows , }matrix is the X, setthe of orthonormal × using vectors by ˆ 1 , is ˆ 2 ,a . diagonal ˆ r } is the •  {u∑ u . . , u set of orthonormal N × 1 vectors defined by matrix of dimension × where ‘s (the singular values) are sorted in σ i is as follows matrix(X, ≥ the PCA SVD   Given For =a data , a rank ⇒ computation ≤ ), square, using symmetric × matrix descending order across the diagonal. 1rank   For , ,a… (the≥set of ⇒ ≤ofof ), dimension square, matrix ‖Eigenvectors (the form SVD) where ,i == , vˆorthonormal orthonormal ×symmetric Eigenvalues { , , … , } i} is “value” σi X V is{ auˆcolumnmatrix × ‖ .= × with X, × vˆi ,=…and σ, icomponents uˆ idiagonal “value” SVD) where k Xof vˆi k = σi with Eigenvalues { , , … , } }(the is the set of of form orthonormal × Eigenvectors ∑ is{ principal   The areofthe eigenvectors Given a data matrix X, the PCA computation using SVD is as follows i are called σprincipal singular values of . It is assumed that σ real and termed singular values •  ∑The is = N × d are andpositive diagonal components of are the eigenvectors of 1 ≥ σ2 ≥ … ≥ ≥ 0 (rank ordered).  For , a rank ( ≥ ⇒ ≤ ), square, symmetric × matrix {For=, N >, … }  , is the set of orthonormal × all vectors defined positive real and termed singular valueswhich (r are = d), the bottom N-r rows of ∑ are zeros willby be removed and the first r rows σ are called singular values of X . It is assumed that σ ≥ σwith ≥ . . . ≥ σr ≥ 0 (rank { } { ordered). i 1 2  , , … , is the set of orthonormal × Eigenvectors , … , 2.} { } of ∑ and, the , r columns is the setof ofUorthonormal × vectors defined by Eigenvalues first will be kept which results in the decomposition shown in, Figure =, …  For The principal components of rows are the eigenvectors of bottom N-r all zeros ∑ are  N >=(r== d), the ‖ which ‖ = will be removed and the first r rows (the “value” form ofofSVD) where positive of real valuesin the decomposition shown in Figure 2. of ∑ and = the firstare r columns U and will termed be keptsingular which results ‖= = and (the “value” form of SVD) where ‖   {∑ is, ,× … , } isdiagonal the set of orthonormal × vectors defined by PCAand σSVD relation singular values of . It is assumed that σ1 ≥ σ2 ≥ … ≥ ≥ 0 (rank ordered).  ∑ is i are × called and diagonal  = 1 T T X be its covariance matrix of dimension d the SVDvalues of matrix C = i= Let are singular of Xof . and It∑isare assumed that σ1 ≥ will σ2 ≥ … ≥ 0 (rank −1 X ForXσN >UΣV (rcalled = d),be the bottom N-r rows allNzeros which be ≥removed and ordered). the first r rows ‖ ‖  = (the “value” form of SVD) where = x of d. ∑ The Eigenvalues of C are of theUsame as the right singular vectors of X. and the first r columns will be kept which results in the decomposition shown Figure 2. For N > (r = d), the bottom N-r rows of ∑ are all zeros which will be removed and theinfirst r rows of ∑ and ∑ is the×firstand diagonalof U will be kept which results in the decomposition shown in Figure 2. r columns  σi are called singular values of . It is assumed that σ1 ≥ σ2 ≥ … ≥ ≥ 0 (rank ordered). Figure 2. Decomposition of matrix X.

For N > (r = d), the bottom N-r rows of ∑ are all zeros which will be removed and the first r rows of ∑ and the first r columns of U will be kept which results in the decomposition shown in Figure 2. PCA and SVD relation

variables (dimensions). Steps to compute PCA using Eigen Value Decomposition (EVD): 1. From each of the dimensions subtract the mean. 2. Determine the covariance matrix. 3. Determine the eigenvalues and eigenvectors of the calculated covariance matrix. Mach. Learn. Knowl. Extr. 2019, 1 4. Reduce dimensionality and construct a feature vector. 5. Determine the new data.

271

This The can PCA be shown withused the following proof:Value Decomposition (SVD) instead of Eigenvalue procedure applies Singular Decomposition (EVD) [19]. SVD is numerically more stable as it avoids the computation of the T T = VΣUoperation. UΣV T = VΣΣV T = VΣ2 V T covariance matrix which isXan X expensive Singular Value Decomposition (SVD) for Principal Σ2Component Analysis (PCA)

VT

C=V

Any matrix X of dimension N x d can be uniquely N − 1written as Where T

=

×

(1) (2)

×

As C is symmetric, hence C = VΛV . As a result, the eigenvectors of the covariance matrix are the r is matrix the rankVof(right matrixsingular X (i.e., thevectors) number and of linearly independentofvectors matrix) sameas the the eigenvalues C caninbethe determined from the ×

2 U is a column-σorthonormal matrix of dimension



singular values λi = N −t 1  ∑ is a diagonal matrix of dimension × where σi ‘s (the singular values) are sorted in PCA using EVD and SVD can be summarized as follows: descending order across the diagonal. Objective: project orthonormal the originalmatrix data matrix X using  V is a columnof dimension × the . largest m principal components, V = [ v1 , . . . , v m ]. Given a data matrix X, the PCA computation using SVD is as follows

1. 2.

mean the columns ZeroFor , a rank ( of ≥ X.⇒

≤ ), square, symmetric

×

matrix

Apply PCA and SVD to find the principal components of X. 

{

,

,…,

} is the set of orthonormal

×

Eigenvectors with Eigenvalues {

,

,…,

}

PCA: a.

The principal components of are the eigenvectors of  = are positive real and termed singular values T X. C = N1−× { , , …the  Determine , }covariance is the set ofmatrix, orthonormal defined by 1 X vectors

b.

V corresponds to the Eigenvectors of C. 

SVD: a. b. 3.



=

=

(the “value” form of SVD) where ‖

‖=

 Determine ∑ is × the and diagonal SVD of X = UΣV T  σi are called singular values of . It is assumed that σ1 ≥ σ2 ≥ … ≥

V corresponds to the right singular vectors.

≥ 0 (rank ordered).

For N > (r = d), the bottom N-r rows of ∑ are all zeros which will be removed and the first r rows

Project the an m dimensional Y = XV of ∑ and thedata firstin r columns of U will be space: kept which results in the decomposition shown in Figure 2.

Figure2.2.Decomposition Decomposition ofofmatrix X.X. Figure matrix

PCA and SVD relation reduction and form a feature vector using PCA, order the eigenvalues To perform dimension from the highest bySVD value. This ordering places thebe components in matrix order of of dimension significance to Let = to lowest be the of matrix X and = its covariance the variance of the original data matrix. Then we can discard components of less significance. If a d x d. The Eigenvalues of C are the same as the right singular vectors of X. number of eigenvalues are small, we can keep most of the information we need and loss a little or noise. We can reduce the dimension of the original data. E.g. we have data of d dimensions and we choose only the first r eigenvectors.

λ1 + λ2 + . . . + λr ∑ri=1 λi = d λ + λ 2 + . . . + λr + . . . + λ d 1 ∑i=1 λi

(3)

Feature Vector = (λ1 λ2 λ3 . . . λr )

(4)

Since each PC is orthogonal, each component independently accounts for data variability and the Percent of Total Variation Explained (PTV) is cumulative. PCA offers as many principal components as variables in order to explain all variability in the data. However, only a subset of these principal components is notably informative. Since variability is shifted into leading PCs, many of the

Mach. Learn. Knowl. Extr. 2019, 1

272

remaining PCs account for little variation and can be disregarded to retain maximal variability with reduced dimensionality. For example, if 99% total variation should be retained in the model for d dimensional data, the first r principal components should be kept such that PTV =

∑rk=1 λk ≥ 0.99 ∑dk=1 λk

PTV acts as the signal to noise ratio, which flattens with additional components. Typically, the number of informative components r is chosen using one of three methods: (1) Kaiser’s eigenvalue > 1; (2) Cattell’s scree plot; or (3) Bartlett test of sphericity. The amount of variation in redundant data decreases from the first principal component onwards. There are several methods to compute the cut off value for retaining the sufficient number of principal components out of all p components. In this work, we use the Cattell’s scree plot, which plots the eigenvalues in decreasing order [20]. The number of principal components to be kept is determined by the elbow where the curve becomes asymptotic and additional components provide little information. Another method is the Kaiser criterion Kaiser [21] which retains only factors with eigenvalues >1. 2.4. Vector Quantization Vector quantization (VQ) historically has been used in signal representation to produce feature vector sequence. One of the applications of K-Means is vector quantization, thus information theory terminologies used in VQ are commonly applied. For example, the “code book” represents the set of cluster centroids and “code words” represent the individual cluster centroids. The codebook maps the cluster indexes, also known as “code,” to the centroids. A basic VQ can be achieved using K-Means clustering with a goal of finding encoding of vectors which minimizes the expected distortion. 2.4.1. Normalization Before applying VQ on the data matrix resulting from the PCA, the normalization process is applied on the original data matrix preceding the PCA steps. Normalization is helpful in generating effective results as it standardizes the features of the dataset by giving them equal weights. In doing so, noisy or redundant objects will be removed resulting in a dataset which is more reliable and viable, it improves the accuracy. Normalization can be performed using several methods such as Min-Max, Z-Score and Decimal Scaling, to name a few. In a dataset, such as the one used in this work, where among attributes there is a high degree of variation the utilization of another type of normalization known as log-normalization is very convenient [22]. The notations and steps for applying this normalization are described below. Notation: xij —the initial value in row i and column j of the data matrix bij —the adjusted value which replaces xij The transformation below is a generalized procedure that (a) tends to preserve the original order of magnitudes in the data and (b) results in values of zero when the initial value was zero. Given: Min( x ) is the smallest non-zero value in the data Int( x ) is a function that truncates x to an integer by dropping digits after decimal point c = order of magnitude constant = Int(log( Min( x )) d = decimal constant = log−1 (c) Then the transformation is  bij = log xij + d − c

Mach. Learn. Knowl. Extr. 2019, 1

273

A small number must be added to all data points if the dataset contains zeros before applying the log-transformation. For a data set where the smallest non-zero value is 1, the above transformation will be simplified to  bij = log xij + 1 2.4.2. Clustering Once the PCA procedure is applied to the initial data, it results in the mapping of the data to a new feature space using the principal components. In the newly constructed feature space, VQ (clustering) is achieved by applying K-Means algorithm. K-Means objective function [23]:

• • •

Let µ1 , . . . µk be the K cluster centroid (means) Let rnk ∈ {0, 1} denotes whether point xn belongs to cluster k. It minimizes the total sum of distances of each points from their cluster centers (total distortion) N

J (µ, r ) =

K

∑ ∑ rnk k xn − µk k2

n =1 k =1

The steps followed by this algorithm are as follows: 1. 2.

Input: N examples { x1 , x2 , . . . , xn }; ( xn ∈ RD ) Initialization: K cluster centers µ1 , . . . µk . K can be initialized:

• • 3.

Randomly initialized anywhere in RD or Randomly take any K examples as the initial cluster centers.

Iteration:



Assign each of example xn to its closest cluster center  Ck = n : k = arg mink xn − µk 2 (Ck corresponds to the set of examples closest to µk )



Re-calculate the new cluster centers µk (mean/centroid of the set Ck ) µk =

1 xn |Ck | n∑ ∈C k



Repeat until convergence is achieved: A simple convergence criteria can be considered as the cluster centers do not move anymore.

To put it in a nut shell, K-Means Clustering is an algorithm which attempts to find groups in a given number of observations or data. Each element of the vector µ refers to the sample mean of its corresponding cluster, x refers to each of the examples and C contains the assigned class labels [24]. The optimal number of clusters is determined using the Elbow method which is among the many different heuristics for choosing a suitable K. The way it works is we run the algorithm using different values of K and plot the heterogeneity. It operates in such a way that, for different values of K the heterogeneity is plotted. In general, this measurement decreases when the value of K increases since the size of the clusters decreases. The point where this measurement starts to flat out (elbow on the plot) corresponds to the optimal value of K [25].

Mach. Learn. Knowl. Extr. 2019, 1

274

2.5. Design of Layered Hidden Markov Model (LHMM) based Intrusion Detection System (IDS) This section discusses the novel layered HMM which is designed to detect multi-stage attacks. This layering technique can be further extended beyond the specific structure (Figure 3) which is simulated and discussed in this work. 2.5.1. Problem Statement Summary and Preliminaries (1) (2)

(3)

The problem space is decomposed into two separate layers of HMMs. Each layer constitutes two levels: the observation data is used to train the HMMs and estimate the model’s parameters at the first level of each layer and those parameters are used to find the most probable sequence of hidden states at the second level of the same layer. The probable observable state sequences from each of the HMMs are used to construct the training data at Layer 2. It will be used for training the upper layer HMM which will be able to use the information from the lower layer HMMs to learn new patterns which are not possibly recognized by the lower layer HMMs.

2.5.2. Model Structure of HMM An HMM is a double stochastic process. In other words, it represents two related stochastic process with the underlying stochastic process that is not necessarily observable but can be observed by another set of stochastic processes that produces the sequence of observations [26,27]. A typical notation for a discrete-observation HMM is T = observation sequence length N = number of states in the model M = number of distinct observation symbols per state Q = {q1, q2, . . . , q N } = distinct “hidden” states of the Markov process V = {v1, v2, . . . , v M } = set of observation symbols per state S = {s1, s2, . . . , s N } = the individual states It is specified by a set of parameters (A, B, Π) and each of the parameters is described below. At time t, ot and qt denote the observation and state symbols respectively. 1. 2. 3.

The prior (initial state) distribution Π = Πi where Πi = P(q1 = si ) are the probabilities of si being the first state in a state sequence.   The probability of state transition matrix A = aij where aij = P qt+1 = s j qt = si , is the probability of going from state si to state s j . The observation (emission) transition probability distribution B = {bik } where bi (k) =  P ot = v j qt = si is the probability of observing state sk given qt = si .

Conventionally, the HMM model is represented by λ = (A, B, Π) . Given an HMM model, there are three problems to solve. One of the problems, also known as model training, is adjusting the model parameters to maximize the probability of the observation given a particular model and this is achieved using Baum-Welch algorithm which is a type of Expectation Maximization (EM) [28,29]. This procedure computes the maximum-likelihood estimates, local maxima, of the HMM model parameters (A, B, Π) using the forward and backward algorithms. In another word, for HMM models, λ1 , λ2 , . . . , λn and a given sequence of observations, O = o1 , o2 , . . . , ot , we choose λ = (A, B, Π) such that P(O|λi ), i = 1, 2, . . . , n is locally maximized. 2.5.3. Model Structure of Two-Layered HMM (LHMM) 1.

At Layer 1, we have HMM1 , HMM2 , . . . , HMMp with their corresponding number of hidden states S1 , S2 , . . . , Sp .

Mach. Learn. Knowl. Extr. 2019, 1

2.

275

Considering the same time granularity (t = T) of each of the HMMs,



The observation sequence for each of the HMMs are given as: O1 T = {O1 1 , O1 2 , . . . , O1 T }, O2 T = {O2 1 , O2 2 , . . . , O2 T }, . . . , Op T = {Op 1 , Op 2 , . . . , Op T }



The probable sequence of states for each of the HMMs are given as: Q1 T = {q1 1 , q1 2 , . . . , q1 T }, {q2 1 , q2 2 , . . . , q2 T }, . . . , {qp 1 , qp 2 , . . . , qp T }

3.

A new feature vector is constructed from the Layer 1 HMMs probable sequence of states. This statistical feature can be considered as a new data matrix where VQ can be applied and a new sequence of observations will be created from the Layer 2 HMM. The feature vector is constructed as follows:  q1i   f i =  ... , ∀i = 1, 2, . . . , p qiT 

(5)

 F = f , f , . . . f j , ∀j = 1, 2, . . . , p

2 Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER1REVIEW

11 of 22

(6)

HMM1 (A, B, Π)

O1T={ O11, O12 , …, O1T }

HMM2 (A, B, Π)

O2T ={ O21, O22 , …, O2T}

𝐹

VQ

HMMp (A, B, Π)

OpT ={ Op1, Op2 , …, OpT}

HMM (A, B, Π)

Estimated

Figure 3. Multi-Layered HMM (LHMM) model structure. Figure 3. Multi-Layered HMM (LHMM) model structure.

2.5.4. Learning in LHMM

2.5.4. Learning in LHMM

The models at Layer 1 and are trained independently to HMM model at Layer 2. The models at Layer 1 andLayer Layer 22are trained independently to HMM model at Layer 2. Every Every HMM in each layer constitutes two levels where the first level determines model parameters HMM in each layer constitutes two levels where the first level determines model parameters and the secondlevel level finds finds the sequence of states. At the At firstthe level of level Layer of 1, Layer given the and the second themost mostprobable probable sequence of states. first 1, given the discrete time sequence of observations, the Baum-Welch algorithm is used for training the outputs of discrete time sequence of observations, the Baum-Welch algorithm is used for training the outputs of probable sequences. At the second level of Layer 1 and Layer 2, the Viterbi algorithm is used for probable sequences. At the second level of on Layer 1 andof Layer 2, the Viterbifrom algorithm isofused finding the probable sequences based the output learned parameters first level samefor finding the probable sequences based on the output of learned parameters from first level of same layer. layer. 1.

1.

Learning at Layer 1

Learning at Layer 1

• •

• •

Vector Quantization technique using K-Means Clustering applied on the training dataset. The Baum-Welch, an Expectation Maximization (EM), is used to compute the maximum Vector log-likelihood Quantization technique using K-Means Clustering applied on the training estimates of the HMM model parameters (A, B, Π).

dataset. The Baum-Welch, an Expectation Maximization (EM), is used to compute the maximum 2. Learning at Layer 2 log-likelihood estimates of the HMM model parameters (A, B, Π). •

2.

Vector Quantization technique using K-Means Clustering, as defined in subsection 2.4.2, is

Learning atapplied Layer on 2 the training dataset. Here the training dataset corresponds to matrix F in Equation

• •

(6). As we have a single HMM at Layer 2, the Baum-Welch method is used to compute the Vector Quantization technique using K-Means Clustering, as defined in Section 2.4.2, is applied maximum log-likelihood estimates of the HMM model parameters (A, B, Π). •

on the training dataset. Here the training dataset corresponds to matrix F in Equation (6). For the simulation of the LHMM, the two HMMs considered at the lower layer are HTTP and As have a single HMM at Layer 2, the Baum-Welch method is used to compute the SSHwe traffic. maximum log-likelihood estimates of the HMM model parameters (A, B, Π). 3. Results This section discusses the results of the data processing for the modeled IDS: the dataset analysis using PCA for dimension reduction, K-Means clustering for vector quantization and finally the results of the LHMM. 3.1. Data Processing Results

Mach. Learn. Knowl. Extr. 2019, 1

276

For the simulation of the LHMM, the two HMMs considered at the lower layer are HTTP and SSH traffic. 3. Results This section discusses the results of the data processing for the modeled IDS: the dataset analysis using PCA for dimension reduction, K-Means clustering for vector quantization and finally the results of the LHMM. 3.1. Data Processing Results 1.

HTTP traffic PCA Analysis

The results of HTTP traffic PCA analysis are demonstrated below. Figure 4 shows the scree plot of the dimensions with respect to percentage of explained variance and eigenvalues. The elbow method can be used to determine the number of dimensions to be retained. Equivalently, the cumulative Mach. Learn. Extr. with 2018, 2,respect x FOR PEER REVIEW 12 of to 22 percent of Knowl. variance to the number of dimensions, as shown in Table 1, can be used determine the number of dimensions to be retained.

(a)

(b)

Figure 4. (a) Scree Plot of the percentage of explained variance; (b) Scree plot of eigenvalues. Figure 4. (a) Scree Plot of the percentage of explained variance; (b) Scree plot of eigenvalues. Table 1. Principal components with their variance contribution.

Table Eigenvalue 1. Principal components with their variance contribution. Percent of Variance Cumulative Percent of Variance 2

1

58.26067 4.762758 × 10 5.826067 × 10 Eigenvalue Percent of Variance Cumulative Percent of Variance 82.86276 2.011200 × 102 2.460210 × 101 2 1 1 4.762758 × 10 × 10 5.826067 × 10 58.26067 8.758904 91.62167 7.160326 1 2.480320 94.10199 2.027640 1 2.011200 × 102 × 101 2.460210 × 10 82.86276 1.758575 95.86056 1.437619 × 10 7.160326 × 101 × 101 8.7589041.236345 91.62167 97.09691 1.010701 97.59981 5.029072 × 10−1 2.027640 ×4.111222 101 2.480320 94.10199 4.057777 98.09618 4.963695 × 10−1 1.437619 ×2.812110 101 1.758575 95.86056 98.44018 3.439927 × 10−1 1.010701 ×2.672606 101 1.236345 97.09691 98.7671 3.269278 × 10−1 −1 1.433143 98.94241 4.111222 5.0290721.753099 × 10−1 × 10−1 97.59981 1.283170 99.09938 1.569643 × 10 4.057777 4.9636951.500099 × 10−1 × 10−1 98.09618 1.226318 99.24939 99.37006 9.864789 × 10−1 3.4399271.206715 2.812110 × 10−1 × 10−1 98.44018 −2 99.46927 8.110305 × 10−1 9.920970 −1 × 10 2.672606 3.269278 × 10 98.7671 99.55786 7.242188 × 10−1 8.859041 × 10−2 1.433143 × 10−1 × 10−2 98.94241 99.64264 6.931104 × 10−1 1.7530998.478506 −2 99.71658 6.044243 × 10−1 1.5696437.393649 1.283170 × 10−1 × 10 99.09938 99.76849 4.243713 × 10−1 5.191142 × 10−2 Dim.13 1.226318 1.500099 × 10−1 99.24939 −1 −1 Dim.14 9.864789 × 10 1.206715 × 10 99.37006 For the HTTP 8.110305 traffic 8 principal components correspond to 98.09618% of the explained −2 Dim.15 × 10−1 9.920970 × 10which 99.46927 variance are selected from Table selected 8 PCs and the first 6 (head) out of the total number −2 Dim.16 7.242188 × 10−11. Those 8.859041 × 10 99.55786 of featuresDim.17 are shown in Table 2. Each PC is constructed from a linear combination −1 −2 6.931104 × 10 8.478506 × 10 99.64264 of the total features and their Dim.18 multiplying coefficients specified the table. 6.044243 × 10−1 7.393649 × 10−2 99.71658 Dim.1 Dim.2 Dim.1 Dim.3 Dim.4 Dim.2 Dim.5 Dim.3 Dim.6 Dim.7 Dim.4 Dim.8 Dim.5 Dim.9 Dim.6 Dim.10 Dim.11 Dim.7 Dim.12 Dim.8 Dim.13 Dim.14 Dim.9 Dim.15 Dim.10 Dim.16 Dim.11 Dim.17 Dim.18 Dim.12 Dim.19

Dim.19

4.243713 × 10−1

5.191142 × 10−2

99.76849

Mach. Learn. Knowl. Extr. 2019, 1

277

Table 2. The selected 8 Principal components and head (6 of the original features only displayed here).

Flow_ID Source_IP Destination_IP Destination_Port Flow.Duration Total.Fwd.Packets

PC1

PC2

PC3

PC4

PC5

PC6

PC7

PC8

7.955283 × 10−5 7.387470 × 10−3 2.773583 × 10−7 1.232595 × 10−30 1.309048 8.387569 × 10−2

1.457065 × 10−2 4.617471 × 10−2 1.824271 × 10−2 1.232595 × 10−3 4.731603 2.23493 × 10−3

1.644281 × 10−3 4.837233 × 10−3 7.112974 × 10−3 5.772448 × 10−3 2.933845 × 10−4 2.357599 × 10−2

3.287887 × 10−1 2.641242 × 10−1 4.124236 × 10−1 1.774937 × 10−28 3.366436 × 10−1 3.299392 × 10−3

5.998037 × 10−2 2.070467 × 10−2 2.091758 × 10−4 8.493351 × 10−30 4.514761 3.453238 × 10−2

9.81543 × 10−2 2.057702 × 10−2 5.842059 × 10−2 2.097338 × 10−29 5.269066 1.038723 × 10−1

2.131072 × 10−1 2.77837 × 10−30 1.562471 × 10−1 1.203706 × 10−29 9.913239 6.913426 × 10−1

3.108175 × 10−1 6.927052 × 10−1 2.848223 × 10−1 4.930381 × 10−30 6.871774 × 10−1 4.477584

For the HTTP traffic 8 principal components which correspond to 98.09618% of the explained variance are selected from Table 1. Those selected 8 PCs and the first 6 (head) out of the total number of features are shown in Table 2. Each PC is constructed from a linear combination of the total features and their multiplying coefficients specified the table. Mach. Learn. Knowl. Extr. 2019, 1

2.

278

SSH Traffic PCA Analysis

The Traffic resultsPCA of SSH traffic PCA analysis are demonstrated below. Figure 5 shows the scree plot 2. SSH Analysis of the dimensions with respect to percentage of explained variance and eigenvalues. The elbow The results of SSH traffic PCA analysis are demonstrated below. Figure 5 shows the scree plot of method can be used to determine the number of dimensions to retain. Equivalently, the cumulative the dimensions with respect to percentage of explained variance and eigenvalues. The elbow method percent of variance with respect to the number of dimensions, as shown in Table 3, can be used to can be used to determine the number of dimensions to retain. Equivalently, the cumulative percent of determine the number of dimensions to be retained. variance with respect to the number of dimensions, as shown in Table 3, can be used to determine the number of dimensions to be retained.

(a)

(b)

Figure 5. (a) Scree Plot of the percentage percentage of of explained explained variance; variance; (b) (b) Scree Scree plot plot of of eigenvalues. eigenvalues. Table Table 3. 3. Principal Principal components components with with their their variance variancecontribution. contribution. Eigenvalue

Percentof of Variance Cumulative Cumulative Percent Percent Percent of of Variance 1 Dim.1 1.11834 × 10 8.53962 × 10 Variance Variance85.39622 13 1 Dim.2 Dim.1 1.11834 1.77023 × ×10 1014 1.35175 8.53962 × 101 × 10 85.3962298.91367 11 Dim.3 6.84686 × 1013 5.22826 1× 10−1 99.4365 Dim.2 1.77023 × 10 11 1.35175 × 10 98.91367 Dim.4 4.27484 × 10 3.26427 × 10−1 99.76292 11 5.22826 × 10−1× 10−1 99.4365 99.95681 Dim.5 Dim.3 6.84686 2.53906 × ×10 1011 1.93883 11 −1 10 − 2 3.26427 × 10 × 10 99.7629299.97848 Dim.6 Dim.4 4.27484 2.83901 × ×10 10 2.16787 10 Dim.7 Dim.5 2.53906 1.66896 × ×10 1011 1.27442 1.93883 × 10−1× 10−2 99.9568199.99123 −3 Dim.8 Dim.6 2.83901 6.49554 ××10 10109 4.95999 2.16787 × 10−2× 10 99.9784899.99619 −3 Dim.9 Dim.7 1.66896 3.08202 ××10 10109 2.35343 × 10 −2 1.27442 × 10 99.9912399.99854 Dim.10 1.31229 × 1099 1.00207 −3× 10−3 99.99954 Dim.8 6.49554 × 10 4.95999 × 10 99.99619 99.9998 Dim.11 3.37909 × 108 2.58027 × 10−4 9 3.08202 2.35343 × 10−3× 10−5 99.9985499.99989 Dim.12Dim.9 1.19652 ××10 108 9.13658 7 − 5 9 −3 1.00207 × 10 × 10 99.9995499.99993 Dim.13Dim.10 1.31229 5.03480 ××10 10 3.84457 Dim.14Dim.11 3.37909 3.36857 ××10 1087 2.57224 2.58027 × 10−4× 10−5 99.9998 99.99996 −5 Dim.15Dim.12 1.19652 2.31915 ××10 1087 1.77090 × 10 −5 9.13658 × 10 99.9998999.99998 Dim.16 1.72495 × 1077 1.31717 −5× 10−5 99.99999 Dim.13 5.03480 × 10 7 3.84457 × 10 99.99993 Dim.17 1.00092 × 10 7.64299 × 10−6 100 Dim.14 3.36857 × 107 2.57224 × 10−5 99.99996 Dim.15 2.31915 × 107 1.77090 × 10−5 99.99998 For theDim.16 SSH traffic, 4 principal components which correspond to 99.76292% of the explained 7 −5 1.72495 × 10 1.31717 × 10 99.99999 variance areDim.17 selected from Table three× dimensions of the PCA100 retains slightly over 99% of 1.00092 × 103.7 The first 7.64299 10−6 the total variance (i.e., information) contained in the data. Those selected 3 PCs and the first 6 (head) out of the total number of features are shown in Table 4.

Eigenvalue14

Mach. Learn. Knowl. Extr. 2018, 2, x; doi: FOR PEER REVIEW

www.mdpi.com/journal/make

Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW

2 of 22

Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW

2 of 22

For the SSH traffic, 4 principal components which correspond to 99.76292% of the explained variance are selected Table components 3. The first three dimensions of the PCA retainsofslightly over 99% of For the SSH traffic, 4from principal which correspond to 99.76292% the explained the total variance (i.e., information) contained in the data. Those selected 3 PCs and the first (head) Mach. Learn. Knowl. Extr. 2019,Table 1 279 variance are selected from 3. The first three dimensions of the PCA retains slightly over 99%6 of out of the total number of features are shown in Table 4. the total variance (i.e., information) contained in the data. Those selected 3 PCs and the first 6 (head) out of the total number of features are shown in Table 4.

Table 4. The selected 4 Principal components and head (6 of original features only displayed). Table 4. The selected 4 Principal components and head (6 the of the original features only displayed).

Table 4. The selected 4 Principal andPC2 headPC2 (6 of the originalPC3 features only displayed). PC1 PC3 PC4 PC1 components PC4 −8 9.18848 4.46286 9.02167 ×× 1010−8 Flow_ID Flow_ID 5.20233 ×5.20233 10−10 × 109.18848 × 10−9× 10 4.46286 × ×1010 9.02167 PC1 PC2 PC3 PC4 − 13 − 12 − 11 −13 −12 −11 −11−11 Source_IP × 101.93802 1.93802 1.48850 1.77674 1010 Source_IP 6.00175 ×6.00175 10 × 10−10 ××10 1.48850 ×−8×10109.02167 1.77674 Flow_ID 5.20233 9.18848 10−9× 104.46286 × 10 × 10−8× × − 30 − 30 −30 0.00000 −29−29 Destination_IP 1.23260 ×1.23260 10 1.23260 × ×1010−30 1.01882 1.01882 Destination_IP × 101.93802 1.23260 ×× 1010 Source_IP 6.00175 × 10−13 ×0.00000 10−12 1.48850 × 10−11 1.77674 × 10−11 −30 Destination_Port 0.00000 0.00000 0.00000 0.00000 1.23260 −30 Destination_Port 0.00000 1.23260 ×× 1010 −30 −1 1 −2 Destination_IP 1.23260 0.00000 1.23260 ×0.00000 10×−3010−1.01882 × 10−29 Flow.Duration 3.52624 × 10×1 10 7.10929 × 10 8.77584 7.02152 × 10 Flow.Duration 0.00000 3.52624 × 101 0.00000 7.10929 × 10−1 0.00000 8.77584 × 109−1 7.02152 10−2 −10 Destination_Port × 10−30× × Total.Fwd.Packets 1.22041 × 10−11 5.02931 × 10−11 6.34210 × 10−1.23260 4.41869 10 −10

−9

−8

−8

−11 −11 −9 −10 Total.Fwd.Packets 1.22041 5.02931 6.34210 4.41869 Flow.Duration 3.52624 × 101× 107.10929 × 10−1× 108.77584 × 10−1× 107.02152 × 10−2× 10 Total.Fwd.Packets 1.22041 × 10−11 5.02931 × 10−11 6.34210 × 10−9 4.41869 × 10−10 Vector Quantization Vector Quantization

The goalgoal of the quantization is to simplify datasetthe from a complex dimensional Vector Quantization The of vector the vector quantization is to the simplify dataset fromhigher a complex higher space into a lower dimensional space so that it can be easier for visualization and find patterns. In this dimensional space into a lower dimensional space so that it can be easier for visualization and The goal of the vector quantization is to simplify the dataset from a complex higher find work, it is achieved by using aachieved very common unsupervised machine learning algorithm such as patterns. In this it is by using a very common unsupervised machine learning dimensional space intowork, a lower dimensional space so that it can be easier for visualization and find K-Means clustering. algorithm such as K-Means clustering. patterns. In this work, it is achieved by using a very common unsupervised machine learning ToTo determine thethe number of of clusters (K)(K) in in K-Means, thethe simplest method involves plotting thethe determine number clusters K-Means, simplest method involves plotting algorithm such as K-Means clustering. number of of clusters against thethe within groups sum of of squares and find pick thethe ‘elbow’ point in in this number clusters against within groups sum squares and find pick ‘elbow’ point To determine the number of clusters (K) in K-Means, the simplest method involves plotting the this plot. This is similar in concept with the scree plot for the PCA discussed in Section 2.3. plot.ofThis is similar in concept with the scree for the PCA discussed in Section number clusters against the within groups sumplot of squares and find pick the ‘elbow’2.3. point in this 1. 1. HTTP Clustering Analysis plot. This is similar in concept with the scree plot for the PCA discussed in Section 2.3. HTTP Clustering Analysis K-Means clustering is is applied onon thethe HTTP traffic after PCA and thethe number of of clusters is is HTTP Clustering Analysis K-Means clustering applied HTTP traffic after PCA and number clusters determined where the elbow occurs in Figure 6 which is K = 4. The plot shows the within cluster sum determined where the occurs in Figure whichafter is K=4. Theand plotthe shows the within cluster K-Means clustering is elbow applied on the HTTP 6traffic PCA number of clusters is sum of of squares (wcss) as as thethe number of of clusters (K)(K) varies. squares (wcss) number clusters varies. determined where the elbow occurs in Figure 6 which is K=4. The plot shows the within cluster sum of squares (wcss) as the number of clusters (K) varies. 1.

Figure 6. The wcss vs. number of clusters for HTTP traffic. Figure 6. The wcss vs number of clusters for HTTP traffic.

2.

SSH Clustering Analysis Figure 6. The wcss vs number of clusters for HTTP traffic. 2. SSH Clustering Analysis On the SSH traffic, K-Means is applied after PCA analysis similar to the HTTP traffic above and 2. SSH Clustering Analysis the number of clusters where the elbow occurs, K = 3, is selected from Figure 7.

Figure 7. The wcss vs number of clusters for SSH traffic. Figure 7. The wcss vs. number of clusters for SSH traffic. Figure 7. The wcss vs number of clusters for SSH traffic.

Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW

3 of 22

On the SSH traffic, K-Means is applied after PCA analysis similar to the HTTP traffic above and 280 the elbow occurs, K = 3, is selected from Figure 7.

Mach. Learn. Knowl. Extr. 2019, 1 the number of clusters where

3.2. Layered HMM Experimental Results 3.2. Layered HMM Experimental Results For the simulation of the LHMM, the two HMMs considered the lower layer are HTTP and SSH For the simulation of the LHMM, the two HMMs considered the lower layer are HTTP and traffic. SSH traffic. 3.2.1. Training the LHMM 3.2.1. Training the LHMM The lower layer HMMs are trained using their corresponding training data and the optimized The lower layer HMMs are trained using their corresponding training data and the optimized model parameters are determined using the Baum-Welch algorithm. model parameters are determined using the Baum-Welch algorithm. 1. HTTP HMM training 1. HTTP HMM training HMM model parameters (A, B, Π) after training are as shown below. HMM model parameters (A, B, Π) after training are as shown below. 0.9827 0.0173 # =" 0.0140 0.9860 0.9827 0.0173 A = 0.3088 0.0140 0.0973 0.9860 0.2007 0.3932 = " 0.0000 0.8129 0.0952 0.0919 # 0.3088 0.0973 0.2007 0.3932 B= 1 0.0000 0.8129 0.0952 0.0919 = " 0 # 1 π= 0 The number of hidden states in the HTTP training traffic is shown in Table 5. The corresponding state symbols sequence is plotted against the HTTP training data in Figure The number of hidden states in the HTTP training traffic is shown in Table 5. The corresponding 8. symbols sequence is plotted against the HTTP training data in Figure 8. state Table5. 5. Hidden Hidden State State Symbols Symbols of of the the HTTP HTTPtraffic. traffic. Table

StateSymbols Symbols State 11 22

HTTP HTTP HTTP-BENIGN HTTP-BENIGN HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce

Figure Figure 8. 8. State State symbols symbols against against the the time time series series HTTP HTTP training training data. data.

2. 2. Training Trainingthe theSSH SSHHMM HMM HMM model parameters (A, Π) after training found HMM model parameters (A, B, Π)B,after training werewere found to beto be " 0.9772 0.0228# = 0.9772 0.0228 0.0308 0.9692 A 0.6135 = 0.1518 0.2348 0.0308 0.9692 =" 0.5092 0.2482 0.2427# 0.6135 0.1518 0.2348 B= 0.5092 10.2482 0.2427 = " # 0 1 π= 0

Mach. Learn. Knowl. Extr. 2019, 1 Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW

281

4 of 22

Theofnumber of hidden in the traffic SSH training is 6. shown in Table 6. The The number hidden states in the states SSH training is showntraffic in Table The corresponding statecorresponding symbols plotted against the SSH training Figure 9.data in Figure 9. statesequence symbolsissequence is plotted against thedata SSHintraining Table 6. Hidden StateState Symbols of the traffic. Table 6. Hidden Symbols of SSH the SSH traffic.

State Symbols SSHSSH State Symbols 1 1 SSH-BENIGN SSH-BENIGN 2 2 SSH-Patator SSH-Patator

Figure 9. State symbols against the time series training Figure 9. State symbols against the time series SSHSSH training data.data.

Training Upper Layer HMM 3. 3.Training Upper Layer HMM Following training lower layer HMMs, model parameters B, after Π) after training Following the the training the the lower layer HMMs, the the model parameters (A, (A, B, Π) training of of the upper layer HMM were found to be the upper layer HMM were found to be   0.9860 0.0000 0.0000 0.0000 0.00000.0000 0.0000 0.0140 0.0140⎤ ⎡0.00110.9860 0.8962 0.0026 0.0988 0.0014   0.0011 ⎢ 0.8962 0.0026 0.0988 0.0014⎥  = ⎢ 0.0000 0.2557 0.2870 0.4240 0.0334⎥    A⎢0.0255 =  0.0000 0.2557 0.28700.1401 0.4240 0.0678 0.0334⎥  0.7642 0.0024   ⎣0.0568  0.0255 0.0105 0.0684 0.7642 0.00240.0007 0.1401 0.8636 0.0678⎦  0.0000 0.0000 0.0007 0.00000.8636 0.0568 1.0000 0.0105 0.0684 ⎡ 1.0000  0.0000 0.0000 0.0000 ⎤  0.0000 1.0000 0.0000 0.0000⎥ = ⎢⎢ 0.9996 0.0000 0.0004 0.0000 ⎥   1.0000 0.00000.0000 0.0000 0.0000 0.0000⎥  0.0000  ⎢ 1.0000   ⎣B0.0000 =  0.9996 0.0000 0.00001.0000 0.0004 0.0000 0.0000⎦     1.0000 0.0000 0.0000 0.0000  0.0000 0.0000 ⎡ 1 ⎤ 1.0000 0.0000 ⎢ 0 ⎥  = ⎢0⎥  ⎢ 0 ⎥1     ⎣ 0 ⎦0    π=     0  traffic is shown  The hidden states in the upper layer HMM training in Table 7. The corresponding   0 state symbols sequence is plotted against the upper layer HMM training data in Figure 10. 0 Table 7. Hidden State Symbols of the Upper layer in the training data. The hidden states in the upper layer HMM training traffic is shown in Table 7. The corresponding state symbols sequence is plotted against theHTTP upper layer HMM training State Symbols SSHdata in Figure 10.

1 2 3 4 5

HTTP-BENIGN HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce HTTP-BENIGN

SSH-BENIGN SSH-Patator SSH-BENIGN SSH-Patator SSH-Patator

Mach. Learn. Knowl. Extr. 2019, 1

282

Table 7. Hidden State Symbols of the Upper layer in the training data. State Symbols

HTTP

1 HTTP-BENIGN 2 HTTP-Web-attack-bruteforce 3 HTTP-Web-attack-bruteforce 4 HTTP-Web-attack-bruteforce 5 HTTP-BENIGN Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW

SSH SSH-BENIGN SSH-Patator SSH-BENIGN SSH-Patator SSH-Patator

Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW

5 of 22 5 of 22

Figure Figure10. 10. State State symbols symbols against against the the time time series seriesUpper Uppertraining trainingdata. data. Figure 10. State symbols against the time series Upper training data.

3.2.2. Testing the LHMM 3.2.2. Testing the LHMM 3.2.2. Testing the LHMM The sequence of network states during testing are determined using the Viterbi algorithm which The sequence of network states during testing are determined using the Viterbi algorithm The of network states that during testing are during determined using the Viterbi algorithm uses as an sequence input the model parameters are determined training phase. which uses as an input the model parameters that are determined during training phase. which uses as an input the model parameters that are determined during training phase. 1. Testing HTTP HMM 1. Testing HTTP HMM 1. Testing HTTP HMM During testing, the hidden states of the HTTP HMM shown in Table 8 are similar to the training During testing, the hidden states of the HTTP HMM shown in Table 8 are similar to the training the hidden states ofsate thesymbol HTTP HMM shown Table with 8 are the similar to testing the training phaseDuring hiddentesting, states. The corresponding sequences arein plotted HTTP data phase hidden states. The corresponding sate symbol sequences are plotted with the HTTP testing phase hidden in Figure 11. states. The corresponding sate symbol sequences are plotted with the HTTP testing data in Figure 11. data in Figure 11. Table 8. Hidden State Symbols of the HTTP traffic during testing. Table 8. Hidden State Symbols of the HTTP traffic during testing. Table 8. Hidden State Symbols of the HTTP traffic during testing. State Symbols HTTP

State Symbols State Symbols 11 12 2 2

HTTP HTTP HTTP-BENIGN HTTP-BENIGN HTTP-Web-attack-bruteforce HTTP-BENIGN HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce

Figure 11. State symbols against the time series HTTP test data Figure Figure11. 11.State Statesymbols symbolsagainst againstthe thetime timeseries seriesHTTP HTTPtest testdata. data

2. 2.

Testing SSH HMM Testing SSH HMM Similarly, the SSH HMM hidden state symbols are the same as the SSH training data hidden Similarly, the SSH HMM hidden state symbols are the same as the SSH training data hidden states as shown in Table 9. These state symbols are plotted against a time series data of the testing states as shown in Table 9. These state symbols are plotted against a time series data of the testing data as shown in Figure 12. data as shown in Figure 12. Table 9. Hidden State Symbols of the SSH traffic in testing data.

Mach. Learn. Knowl. Extr. 2019, 1

2.

283

Testing SSH HMM

Similarly, the SSH HMM hidden state symbols are the same as the SSH training data hidden states as shown in Table 9. These state symbols are plotted against a time series data of the testing data as shown in Figure 12. Table 9. Hidden State Symbols of the SSH traffic in testing data. State Symbols Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW 1

2

Mach. Learn. Knowl. Extr. 2018, 2, x FOR PEER REVIEW

SSH SSH-BENIGN SSH-Patator

6 of 22 6 of 22

Figure 12. State symbols against the time series SSH test data. Figure Figure12. 12. State State symbols symbols against against the the time time series seriesSSH SSHtest testdata. data.

3. Testing Upper Layer HMM 3. 3. Testing TestingUpper UpperLayer LayerHMM HMM The Upper layer HMM testing hidden states are shown in Table 10 and constitute the hidden The layer layer HMMHMMs. testing hidden states are shown in Table 10 and thevalidity hidden states states ofUpper the lower And final results shown ininFigure 13constitute prove the of the The Upper layer HMM testing hidden states are shown Table 10 and constitute the hidden of the lower HMM layer HMMs. And final results shown in Figure 13IDS prove the validity of the multi-layer multi-layer in determining the hidden states within the detection engine. states in of determining the lower layer HMMs.states Andwithin final results in Figure HMM the hidden the IDSshown detection engine.13 prove the validity of the multi-layer HMM in determining the hidden states within the IDS detection engine. Table 10. Hidden State Symbols of the Upper layer testing data. Table 10. Hidden State Symbols of the Upper layer testing data. Table 10. Hidden State Symbols of the Upper layer testing data. State Symbols HTTP SSH State Symbols HTTP SSH

1 State Symbols 12 21 3 32 4 34 55 4 5

HTTP-BENIGN HTTP HTTP-BENIGN HTTP-Web-attack-bruteforce HTTP-BENIGN HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce HTTP-Web-attack-bruteforce HTTP-BENIGN HTTP-BENIGN HTTP-Web-attack-bruteforce

SSH-BENIGN SSH SSH-BENIGN SSH-Patator SSH-BENIGN SSH-Patator SSH-BENIGN SSH-BENIGN SSH-Patator SSH-Patator SSH-Patator SSH-BENIGN SSH-Patator SSH-Patator SSH-Patator

HTTP-BENIGN

SSH-Patator

Figure Figure13. 13.State Statesymbols symbolsagainst againstthe thetime timeseries seriesUpper Uppertest testdata. data.

4. Discussion.

Figure 13. State symbols against the time series Upper test data.

4. Discussion. A high detection rate is essential in a machine learning based IDS alongside the evaluation metrics aforementioned. Theismain aspects consider learning when measuring thealongside accuracy are A high detection rate essential in atomachine based IDS the evaluation

Mach. Learn. Knowl. Extr. 2019, 1

284

4. Discussion A high detection rate is essential in a machine learning based IDS alongside the evaluation metrics aforementioned. The main aspects to consider when measuring the accuracy are

• • • •

True Positive (TP): Number of intrusions correctly detected. True Negative (TN): Number of non-intrusions correctly detected. False Positive (FP): Number of non-intrusions incorrectly detected. False Negative (FN): Number of intrusions incorrectly detected.

The performance of the LHMM by can be evaluated by calculating the common performance measures: Accuracy, Sensitivity, Specificity, Precision, Recall and F-Measure. Those metrics are among the few considered which are commonly used for evaluating IDS performance [30–32]. a)

Accuracy: the proportion of true results (both true negatives and true positives) with respect to the total number. t p + tn Accuracy = t p + tn + f p + f n

b)

Precision: the fraction of the states which were classified as the interesting state (loaded in this case) that are really that state. tp Precision = tp + f p

c)

Recall: the fraction of the interesting states that were correctly predicted as such. It is also referred as Sensitivity. tp Recall = t p + fn

d)

F-measure: a combination of precision and recall and provides the percentage of positively classified incidents that are truly positive. F1 =

2xPrecision × Recall Precision + Recall

The performance of the LHMM for the above test is as follows: [Accuracy Precision Recall f_measure] = [0.9898 0.9793 1.0000 0.9895] Additionally, the CPU consumption, the throughput and the power consumption are important metrics for the evaluation of intrusion detection systems running on different hardware on specific settings such as high-speed networks, or on hardware with limited resources. But in this work the performance of the developed IDS is measured solely on the above metrics. As compared to a single-layer HMM, the layered approach has several advantages: (1) A singlelayer HMM possibly have to be trained on a large number of observations space. In this case, the model can be over-fitted when enough training data is not used. As the observation space increases, the amount of data needed to train the model well also increases. As a result it incurs what is commonly referred as the curse of dimensionality. On the contrary, the layers in this approach are trained over small-dimensional observation spaces which results in more stable models and do not require large amount of training data. (2) The lower layer HMMs are defined and trained with their corresponding data as needed. (3) The second layer-HMM is less sensitive to variations in the lower layer features as the observations are the outputs from each lower layer HMMs, which are expected to be well trained. (4) The two layers (lower and upper) are expected to be well trained independently. Thus, we can explore different HMM combination systems. In particular, we can replace the first layers

Mach. Learn. Knowl. Extr. 2019, 1

285

HMMs with models that are more suitable for network traffic data sequences, with the goal of gaining understanding of the nature of the data being used. The framework is thus easier to improve and interpret individually at each level. (5) The layered framework in general can be expanded to learn new network traffics that can be defined in the future by adding additional HMMs in the lower layers. The results demonstrate how a Markov Model can capture the statistical behavior of a network and determine the presence of attacks and anomalies based on known normal network behaviors gathered from training data. Using the vector quantization method, we are able to include multiple dimensions of information into the model and this will be helpful in reducing false positives and determining more attack states in the future. The model can be re-trained to identify new network states based on the robustness of the training data. This is a promising approach because it is extensible. 5. Conclusions This work highlights the potential for use of multi-layer HMM-based IDS. PCA using SVD is used for dimension reduction of CICIDS2017 data after feature selection and feature creation. By applying K-Means clustering as a vector quantization technique on the reduced data, the cluster labels are used as a sequence of observation to an HMM model. The IDS is based on multi-layer HMM. The proposed approach has been developed and verified to resolve the common flaws in the application of HMM to IDS commonly referred as the curse of dimensionality. It factors a huge problem of immense dimensionality to a discrete set of manageable and reliable elements. The multi-layer approach can be expanded beyond 2 layers to capture multi-phase attacks over longer spans of time. A pyramid of HMMs can resolve disparate digital events and signatures across protocols and platforms to actionable information where lower layers identify discrete events (such as network scan) and higher layers new states which are the result of multi-phase events of the lower layers. The concepts of this approach have been developed but the full potential has not been demonstrated. Author Contributions: The authors contributed equally to this work. Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflict of interest.

References 1. 2.

3. 4.

5.

6.

7.

8.

Bace, R.; Mell, P. Intrusion Detection Systems. In NIST Special Publication 800-31; November 2001. Available online: https://csrc.nist.gov/publications/detail/sp/800-31/archive/2001-11-01 (accessed on 7 July 2018). Brown, D.J.; Bill Suckow, B.; Wang, T. A Survey of Intrusion Detection Systems; Department of Computer Science, University of California: San Diego, CA, USA; Available online: http://citeseerx.ist.psu.edu/ viewdoc/download?doi=10.1.1.87.408&rep=rep1&type=pdf (accessed on 10 September 2018). Cisco 2018 Annual Cybersecurity Report. Available online: https://www.cisco.com/c/en/us/products/ security/security-reports.html (accessed on 14 July 2018). Johns Hopkins University. HMM Profiles for Network Traffic Classification (Extended Abstract). In Towards Better Protocol Identification Using Profile HMMs; JHU Technical Report, JHU-SPAR051201; Johns Hopkins University: Baltimore, MD, USA, 2004. Zegeye, W.K.; Moazzami, F.; Richard Dean, R.A. Design of Intrusion Detection System (IDS) Using Hidden Markov Model (HMM). In Proceedings of the International Telemetering Conference, Glendale, AZ, USA, 5–8 November 2018. Ourston, D.; Matzner, S.; Stump, W.; Hopkins, B. Applications of Hidden Markov Models to Detecting Multi-stage Network Attacks. In Proceedings of the 36th Hawaii International Conference on System Sciences, Big Island, HI, USA, 6–9 January 2003; p. 10. Shawly, T.; Elghariani, A.; Kobes, J.; Ghafoor, A. Architectures for Detecting Real-time Multiple Multi-stage Network Attacks Using Hidden Markov Model. Cryptography and Security (cs.CR). arXiv, 2018; arXiv:1807.09764. Rust, J. Using Randomization to Break the Curse of Dimensionality. Econometrica 1997, 65, 487–516. [CrossRef]

Mach. Learn. Knowl. Extr. 2019, 1

9. 10. 11.

12.

13. 14. 15. 16. 17.

18.

19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

32.

286

Cherki, D. Decomposition des Problemes de Decision Markoviens. Ph.D. Thesis, Faculty of Science Rabat, Rabat, Morocco, 2002; pp. 64–69. Amjad, M.; Tobi, A.; Duncan, I. KDD 1999 generation faults: A review and analysis. J. Cyber Secur. Technol. 2018. [CrossRef] Hindy, H.; Brosset, D.; Bayne, E.; Seeam, A.; Tachtatzis, C.; Atkinson, R.; Bellekens, X. A Taxonomy and Survey of Intrusion Detection System Design Techniques, Network Threats and Datasets. arXiv, 2018; arXiv:1806.03517v1. Sharafaldin, I.; Lashkari, A.A.; Ghorbani, A.A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proceedings of the 4th International Conference on Information Systems Security and Privacy (ICISSP), INSTICC, Funchal, Portugal, 22–24 January 2018; pp. 108–116. [CrossRef] Sharafaldin, I.; Gharib, A.; Lashkari, A.; Ghorbani, A.A. Towards a reliable intrusion detection benchmark dataset. Softw. Netw. 2017, 177–200. [CrossRef] CIC. CICFlowMeter (2017); Canadian Institute for Cybersecurity (CIC): Fredericton, NB, Canada, 2017. Manhart, K. Artificial Intelligence Modelling: Data Driven and Theory Driven Approaches. In Social Science Micro Simulation (Dagstuhl Seminar, May, 1995); Springer: Berlin/Heidelberg, Germany, 1996; pp. 416–431. Pivovarov, R.; Elhadad, N. A hybrid knowledge-based and data-driven approach to identifying semantically similar concepts. J. Biomed. Inform. 2012, 45, 471–481. [CrossRef] [PubMed] Wyse, N.; Dubes, R.; Jain, A.K. A critical evaluation of intrinsic dimensionality algorithms. In Pattern Recognition in Practice (North-Holland Amsterdam); Gelsema, E.S., Kanal, L.N., Eds.; North-Holland Pub: New York, NY, USA, 1980; pp. 415–425. Smith, L. A tutorial on Principal Components Analysis. In Technical Report OUCS-2002-12; Department of Computer Science, University of Otago: Dunedin, New Zealand; Available online: http://www.cs.otago.ac. nz/research/publications/OUCS-2002-12.pdf (accessed on 11 June 2018). Shlens, J. A Tutorial on Principal Component Analysis; Version 3.02; Google Research: Mountain View, CA, USA, 2014. Cattell, R.B. The scree test for the number of factors. Multivar. Behav. Res. 1966, 1, 245–276. [CrossRef] [PubMed] Kaiser, H.F. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 1960, 20, 141–151. [CrossRef] McCune, B.; Grace, J.B.; Urban, D.L. Analysis of Ecological Communities; MjM Software Design: Gleneden Beach, OR, USA, 2002. Stuart, P.L. Least squares quantization in pcm. IEEE Trans. Inf. Theory 1982, 28, 129–137. Alpaydin, E. Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2004. Ng, A. The K-Means Clustering Algorithm. CS229: Machine Learning. Available online: http://cs229.stanford. edu/notes/cs229-notes7a.pdf (accessed on 14 September 2018). Rabiner, L.R.; Juang, B. An Introduction to hidden Markov Models. IEEE ASSP Mag. 1986, 3, 4–16. [CrossRef] Rabiner, L.R. A tutorial on hidden Markov Models and selected applications in speech recognition. Proc. IEEE 1989, 77, 257–286. [CrossRef] Baum, L.; Petrie, T. Statistical inference for probabilistic functions of finite state Markov Chains. Ann. Math. Stat. 1966, 37, 1554–1563. [CrossRef] Bilmes, J. A Gentle Tutorial on EM Algorithm and Its Application to Parameter Estimation for Gasussian Mixture and Hidden Markov Models; International Computer Science Institute: Berkeley, CA, USA, 1997. Aminanto, M.E.; Choi, R.; Tanuwidjaja, H.C.; Yoo, P.D.; Kim, K. Deep Abstraction and Weighted Feature Selection for Wi-Fi Impersonation Detection. IEEE Trans. Inf. Forensics Secur. 2018, 13, 621–636. [CrossRef] Atli, B. Anomaly-Based Intrusion Detection by Modeling Probability Distributions of Flow Characteristics. Ph.D. Thesis, Aalto University, Helsinki, Finland, 2017. Available online: http://urn.fi/URN:NBN:fi: aalto-201710307348. (accessed on 14 July 2018). Hodo, E.; Bellekens, X.; Hamilton, A.; Tachtatzis, C.; Atkinson, R. Shallow and deep networks intrusion detection system: A taxonomy and survey. arXiv, 2017; arXiv:1701.02145. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Suggest Documents