This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. Digital Object Identifier 10.1109/ACCESS.2018.Doi Number
Framework for Mobile Internet of Things Security Monitoring based on Big Data Processing and Machine Learning 1,2
1,2
2
Igor Kotenko , (Senior Member, IEEE), Igor Saenko , and Alexander Branitskiy 1
St. Petersburg Institute for Informatics and Automation (SPIIRAS), 14-th line, 39, Saint-Petersburg, 199178, Russia St. Petersburg National Research University of Information Technologies, Mechanics and Optics, 49, Kronverkskiy prospekt, Saint-Petersburg, Russia 2
Corresponding author: Igor Kotenko (e-mail:
[email protected]).
This research was partially financially supported by grants of RFBR (projects No. 16-29-09482, 18-07-01369 and 18-07-01488), by the budget (the project No. АААА-А16-116033110102-5), and by Government of the Russian Federation, Grant 08-08.
ABSTRACT The paper discusses a new framework combining the possibilities of Big Data processing and machine leaning developed for security monitoring of mobile Internet of Things. The mathematical foundations and the problem statement are considered. The description of the used data set and the architecture of proposed security monitoring framework are provided. The framework specifies several machine learning mechanisms intended for solving classification tasks. The classifier operation results are exposed to plurality voting, weighted voting and soft voting. The framework performance and accuracy is assessed experimentally. INDEX TERMS Big Data, machine learning, security monitoring, mobile security, Internet of Things, classifier.
I.
INTRODUCTION
Nowadays, distribution of the concept of the Internet of Things (IoT) on mobile computing devices is a new perspective tendency that leads to creation of the new concept of mobile IoT. Mobile IoT assumes that the computer network envelops not only traditional computer elements (servers, workstations, network devices) and electronic user devices (“things”) of different types, but also the mobile computing devices which are connected to remaining elements of the IoT network by means of WiFi, mobile and/or sensor networks. Applications of the mobile IoT are continuously expanding. Mobile IoT finds successful application in such areas as medicine, transport management, security monitoring in public places, smart house and/or smart city, electricity consumption, industrial production, robotics, etc. However successful development and effective functioning of the mobile IoT are impossible without solving a set of problems in area of computer security. One of such problems having key value in computer security is a problem of the mobile IoT security monitoring. Security monitoring consists in continuous collecting the big arrays of heterogeneous data about security events taken place in
mobile IoT networks. These data are exposed to the further analysis to detect the signs of possible harmful activity for the purpose of framing of timely measures of counteraction to the existing and perspective attacks against the infrastructure of mobile IoT. Solutions of this problem will promote the substantial enhancement of mobile IoT security. One of the most perspective directions of solving this problem is combining the results obtained in the field of Big Data processing and machine learning. The use of the frameworks implementing parallel stream data handling is of great interest in the field of Big Data processing. Methods and algorithms of machine learning allow us to find regularities in processed data and to solve different problems in the field of Data Mining in the automated mode. Combining of opportunities of Big Data and machine learning frameworks is considered as a rather perspective direction of solving the online analysis problem for mobile IoT security data. The main contribution of the paper is as follows: 1) Creation of a framework in which the possibilities of Big Data processing and machine learning algorithms are integrated; 2) Implementation in this framework the procedures of
VOLUME XX, 2017
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
analyzing the reference data set, containing data on the mobile IoT traffic; 3) The experimental assessment of the developed framework. The further structure of the paper is as follows. Section 2 lists some of related works. Section 3 considers mathematical foundations and the problem statement. In section 4 the description of the used data set is provided. Section 5 discusses the proposed framework architecture. Section 6 contains the results of experiments. Section 7 contains the conclusion and the directions of future research. II.
RELATED WORK
Related works can be divided into three groups: (i) applying machine learning for computer security; (ii) development of deep learning frameworks for IoT security; (iii) development of Big Data frameworks; (iv) implementation of machine learning algorithms. (i) Applying machine learning for computer security. The works of the first group show that the problem of application of machine learning methods for computer security became extremely relevant. Chan et al. [1] showed that the use of computers in mobile networks (including the mobile IoT) became more ubiquitous and connected, and the attacks against these networks became more pervasive and diverse. Conventional security software requires a lot of human efforts to identify threats in such networks. This labor-intensive process can be more efficient by applying machine learning algorithms. Shamili et al. [2] showed a successful implementation of one of the machine learning algorithms, namely a distributed Support Vector Machine (SVM), in order to detect malicious software (malware) in a network of mobile devices. Sahs et al. [3] represented a machine learning based system to detect malware in Android devices which is also based on SVM algorithms. However, the greater effect can be achieved by realizing combined applications of different machine learning algorithms. The manifesto [4] directly claims that one of perspective directions for application of machine learning is the protection of mobile devices together with assisted malware analysis. Ford et al. [5] reviewed examples of intrusion detection systems, which are specifically based on machine learning methods due to their adaptability to new and unknown attacks. Among possible attacks, the attacks against mobile IoT networks, such as smart meter energy consumption profiling and surveillance camera robbery, are considered. Arslan et al. [6] carried out the analysis of the known works on application of machine learning methods for detection of mobile threats. They showed that, firstly, the SVM algorithms have the greatest popularity, and, secondly, that enjoys success rate of the detection systems conducted with machine learning methods are in between of 80-99.6%.
It is necessary to mark that in our work we reached bigger value of this index. Xiao et al. [7] investigated the IoT security solutions based on machine learning methods. These methods included supervised learning, unsupervised learning and reinforcement learning and were applied for IoT authentication, access control, secure offloading and malware detection. The results of the investigation show that one of main challenge has to be addressed to implement the machine learning based security methods in practical IoT systems are computation and communication overheads. In our proposed framework we set a goal to overcome this challenge. (ii) Development of deep learning frameworks for IoT security. Deep learning algorithms are popular tools when performing high-dimensional object analysis. This is due to their high ability to generalize data: the mechanism for tuning such structures is aimed at approximating construction of such a set which covers both training elements and elements which were not encountered in the training process. Zhou et al. [29] proposed a framework based on deep learning for detecting the internet intrusions in the IoT. The authors emphasize the applicability of the developed framework within the concept of "smart city". A particularity of the considered approach is that it allows enhancing the classification accuracy and creating a trade-off balance between the correctness of attack detection and the speed of this process. Nguyen et al. [30] investigated the applicability of deep learning in detection of cyber attacks on the example of three data sets (NSL-KDD, UNSW-NB15, KDDCup 1999). The framework developed by the authors includes four consecutive steps: (1) applying the principal component analysis for reducing the redundancy and dimension of input data, (2) the pre-learning process of the neural network using the Gaussian binary restricted Boltzman machine, (3) deep learning step, consisting in iterative adjustment of weights, (4) constructing the output signal using the softmax regression. The paper [31] is devoted to building a distributed attack detection system based on a deep learning model. The application of the developed system is the detection of attacks in the social IoT. The developed system architecture allows to update the learning parameters of each cooperative node on a single master node and provide the resulting updates to the worker nodes. In [32] the task of distributed learning the deep neural network is considered within the context of mobile big data analytics. The proposed approach was implemented in the environment Spark and consists of the following steps: (1) learning partial models (worker nodes perform a neural network tuning using different data partitions), (2) parameter averaging (the parameters of models are transmitted to the master node for their averaging), (3) parameter dissemination (updated parameters are passed to worker nodes which again
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
perform a neural network tuning). These steps are repeated several times until the convergence criterion is satisfied. At this moment, we limited ourselves to considering a two-layer neural network due to its simple structure and acceptable learning speed. In the future, it is planned to investigate the applicability of deep neural networks. (iii) Development of Big Data frameworks. Development of frameworks for Big Data is considered in many works. As a rule, these frameworks execute the MapReduce operations and are based on the special program systems as Hadoop, Spark, Flink, etc. [8, 9]. At the same time these frameworks have rather big scope. Some Big Data frameworks used in mobile networks and mobile IoT are known as well. Scherbakov et al. [10] presented the framework for processing the web applications. Kim et al. [11] discussed the framework for medical data analysis. Zygouras et al. [12] considered the framework for monitoring data on bus traffic control. The Big Data frameworks developed for the benefit of computer security [13-16] also exist. However, the machine learning algorithms in these frameworks are not used efficiently. An approach for combination of several classifiers has been already suggested by Branitskiy et al. [17-18]. However, the aspects of Big Data processing have not been considered. (iv) Implementation of machine learning algorithms. Machine learning algorithms are applied to solve different Data Mining tasks. For classification tasks, the most acceptable algorithms are K-Nearest Neighbors [19], Naive Bayes [20], and SVM [21]. The regression problems are solved with the help of Linear Regression [22], Random Forests [23], and Bagging [24] algorithms. The algorithms of K-Means [25] and Density-Based Spatial Clustering of Applications with Noise are applied in the clustering problem [26]. To improve the classification performance of individual classifiers, the method of their combining is used. For example, in [33, 34] the composition (rule for combining the basic classifiers) is represented as a Dempster-Shafer's rule, and in [35] three classification schemes (one-against-all, oneagainst-one, directed acyclic graph) designed for combining the binary classifiers are considered. In this paper, we realize three different types of voting methods: plurality voting and two its modifications (weighted and soft voting). These and other machine learning algorithms are actively developed. However, their implementation in parallel computing systems, especially for security problems, is considered poorly. Therefore, the results considered in the paper are relevant. The need for an integrated application of big data processing and machine learning methods is as follows. First, an increasing number of network devices can lead to the generation of large data streams transmitted between these devices, possessing a heterogeneous structure and containing
confidential user data. To solve the problem of such data processing, a well-proven MapReduce approach can be used: first, data are divided into several parts, each of which is independently processed by using some function. The results are aggregated to form a final decision. Secondly, a variety of devices (both in terms of the type of hardware component and software settings) can complicate the task of detecting anomalies in the IoT. The presence of implicit and hidden patterns in the analyzed parameters makes it difficult or even impossible to use such classical tools as expert-logical inference or signature-based analysis in relation to this task; therefore it is proposed to use machine learning methods. III.
MATHEMATICAL FOUNDATIONS
The problem solved in the paper is the task of classifying objects. It is formulated as follows. Let there are set of pairs P = {(xi, ci)}, i = 1,…, M, which consists of the feature descriptions of the classified objects in the form of the vector xi and the label of the class ci assigned to it. It is required to develop an algorithm R, which will allow us to approximate the set P based on the available information about the vectors {xi}: count {xi | R (xi) ≠ ci} → min.
(1)
To solve this task machine learning methods are used that are known for their ability to reveal hidden regularities in the analyzed data. In this paper we consider the following machine learning methods: (i) principal component analysis (PCA), (ii) the support vector machine (SVM), (iii) the K-nearest neighbor method (KNN), (iv) the gaussian naïve Bayes (GNB), (v) the two-layer artificial neural network (ANN), (vi) the decision tree (DT). (i) Principal component analysis: The PCA consists in decreasing the dimensionality of the analyzed vectors with the greatest preservation of variability in the initial data. The essence of the approach is the linear mapping of the vector z into a new space:
F ' z v1,..., v n' T z x ,
(2)
where v1, …, vn' – orthonormal eigenvectors (sorted in descending order of their corresponding eigenvalues) of a covariance matrix made up of elements of the training data set, x – mathematical expectation of a random vector represented in the form of training data, n' – selectable dimension of a new space, viT ∙ z – i-th principal component of the vector z. This method allows to discard those features, that are insignificant from the point of view of their informativeness, and take into account linear combinations of the most significant features. (ii) The support vector machine: The SVM consists in constructing a separating hyperplane which has the property of equidistance from the objects closest to it and belonging to different classes. The model is presented as follows:
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
Ms F (1) z sign b wi x i T z , i 1
(3)
where wi – weight coefficients, which are the product of nonzero Lagrange multipliers and the desired output values, xi – support vectors (i = 1, ..., Ms), b – bias parameter. In this formula, it is assumed that the training sample can be linearly divided. Otherwise, it is necessary to use special transformations – kernels. (iii) The K-nearest neighbor method: The KNN allows to match the analyzed vector a label of that class, which instances possess the greatest number among all K training objects closest to the given vector z. Formally, this approach is as follows: K
F ( 2) z arg max x'i c , c i 1
(4)
where x1',…, xk' – training vectors for which the value
K i 1
z x'i
is minimal among all training vectors, Ω –
classes. We can say that this method does not require preliminary tuning (training). For its functioning it is sufficient to preserve the entire training sample. (iv) The gaussian naïve Bayes: The functioning of the classifier naïve Bayes is based on the application of the Bayes theorem. It is assumed that each pair of features is independent, which allows us to apply the conditional probability formula:
F (3) (z) arg max[ P(z | c) P(c)] ,
(5)
c
where P(z|c) – probability of appearance of the record z among all analyzed objects, belonging to the class c; P(c) – the unconditional probability of appearance of the record of a class c in the data set. (v) The two-layer artificial neural network: A two-layer artificial neural network is a layered structure in which successive linear and nonlinear transformations of the input vector are performed. After passing through each k-th layer, the composition of the non-linear activation function and the weighted sum of the components of that vector that is output for the (k − 1)-th layer is performed: N1 n F (4) z 1( 2) w1(i2) i(1) wij(1) z j , (6) i 1 j 1
for the observed vector component, one of the two paths is selected in the decision tree: The approach is as follows:
F (5) z R(T , z ) c, if T contains only terminal node , R(T , z ) R(T ( L ) , z ), if P (T ) (z ) R(T ( L ) , z ), if not P (T ) (z )
where T(L) и T(R) – left and right subtrees of T, c – class label, P(T) – predicate located at the root of a tree T. This definition implies a recursive calculation of the class label by cutting the tree into one of two parts. To combine the classifiers F (1) ,..., F (5) we have investigated three compositions: plurality voting (PV), weighted voting and soft voting (SV). Plurality voting is represented as follows: N
G (1) (z ) arg max F ( i ) z c , c
(8)
i 1
where N = 5 – number of basic classifiers to be combined. Note that in formula 8 it is assumed that the basic classifiers F (1) ,..., F (5) have the same discrete set of output values represented as a set of class labels (e.g. {0,…,#Ω−1}). The improvement of the formula (8) is the weighted voting, in which the weighted encouragement of the classifiers is carried out, proportional to their level of correctness of recognition of the training instances according to the following formula: N
G (2) (z ) arg max wi F ( i ) z c , c
wi
i 1
# x j F (i ) x j c j
# x N
k 1
j
F
(k )
M j 1
x c j
(9)
j
M
,
j 1
where coefficients wi are calculated in such a way that their sum is equal to 1. Soft voting can be considered as an extension of the formula (9) by introducing weight coefficients assigned to both the basic classifier and the class label predicted by it: N
G (3) (z ) arg max wic F (i ) z c , c
(1)
( 2) where wij и w1i – the weights of the first (hidden) layer of dimension N1 and the second (output) layer, which which are customizable parameters in the learning process, – (1) ( 2) activation function, i и 1 – bias parameters. (vi) The decision tree: A decision tree is a hierarchical structure containing, numerical attributes and predicates as nonterminal nodes and class labels as terminal nodes. While descending down the tree and depending on a predicate truth
(7)
wic
i 1
# x j F (i ) x j c j c c j
M
# x j F (i ) x j c j c ' c j
c '
(10)
j 1 M
,
j 1
where coefficients wic are calculated in such a way that their sum is equal to 1 for each basic classifier F (1) ,..., F (5) . A soft voting method assigns each base classifier a vector weight. This weight can be interpreted as the probability of
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
the correct determination of classes for the elements of the training sample with the help of one of the basic classifiers. Thereby the dimension of this vector is the number of different class labels found in the training set, and each j-th component of this vector reflects how well this classifier is able to assign correctly the elements of the training subsample, marked by the j-th class label, to the j-th class. The result of the final classification is the class, the total probability of belonging an object to which is the highest in all classifiers. IV.
DATA SET
The experiments were carried out using the data set "detection_of_IoT_botnet_attacks_N_BaIoT" [27] which contains the records describing the the network traffic transmitted between 9 mobile IoT devices. These records are represented in the form of CSV-files: each record consists of 115 attributes separated by a comma. In the data set there are 11 classes among which 10 classes are attacks and 1 class is a benign class. Figure 1 demonstrates the distribution of the data set size by 9 IoT device types.
parameters of the initial vector. For this purpose, a linear transformation of the centered vector is performed using the eigenvectors of the covariance matrix of the training sample. Figures 2 and 3 demonstrate projection of one of the training subsamples onto the first two and three principal components for the device Danmini Doorbell. From these figures it can be seen that the training subsamples of different classes can be closely located, which complicates the process of constructing the classification model. Therefore, we should consider a higher-dimensional space. To accelerate the processing of high-dimensional vectors, we apply a compression of training instances to the first 10 principal components. The choice of just such a number of components is due to the fact that for the considered example this is the smallest number that delivers 99% of informativeness of initial data. From the Figure 4 we can see that with a larger number of components the curve becomes almost horizontal, and thus using a number of principal components greater than 10 does not add significant variability to analyzed data.
FIGURE 1. Distribution of the data set size by IoT device types
In total, the data set contains 7009270 records, some of which are duplicated, namely instances of classes gafgyt tcp and gafgyt udp. Therefore the first step is a removal of the identical records. This allowed to exclude from consideration 1.65% of records. The procedure for deleting a duplicate records will allow to train classifiers using different (nonduplicating) instances and thus to provide the best coverage of the training sample. The second step is a min-max normalization which is applied to each attribute of a training vector. Applying this procedure allows to reduce the variability in the values of particular parameters and to put all parameters inside the same range of values, e.g. [0, 1]. The third step is a principal component analysis (PCA) which compresses input vectors to vectors of smaller dimension without loss of significant informativeness about the
FIGURE 2. Projection of the training sample onto the first two principal components
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
V.
FIGURE 3. Projection of the training sample onto the first three principal components
FRAMEWORK ARCHITECTURE
Figure 6 shows the framework architecture designed for mobile Internet of Things security monitoring. The main idea in the design of this architecture is to explicitly distinguish two modes of functioning of the systems based on such architecture: the training mode and the analysis mode. At the first level, an analyzed data set is divided between several classifiers. In the training mode, a specific training sample is created for each binary classifier. In the analysis mode, first, the analyzed data set is split, then the independent copies of the classifiers are assigned the task of processing each of the formed pieces. At the second stage, the principal component analysis is used to reduce the dimension of the analyzed vector. At the third stage, in the training mode the parameters of basic classifiers and their compositions are adjusted, and in the analysis mode the classification indicators of the trained classifiers are calculated. Thereby within the framework of the proposed architecture, the MapReduce concept is realized: at the first two stages preliminary processing of input data and their separation between several processes is performed (map). At the third stage the aggregation of the results is performed (reduce). Bid Data
Layer #1
Module of data set decomposition
Module for extracting training records
Layer #2
Module of compressing the feature vectors
Module of compressing the feature vectors
Layer #3
Classification module
Learning module
Analyzing instances
Learning classifiers
FIGURE 4. Informativenes measure of principal components on the set of training data
Figure 5 shows the absolute correlation dependence of the first 10 principal components and the class label. In particular, the last column (or last row) depicts the dependence degree between components and predicted class labels. This figure shows that the components themselves are independent of each other (the value of the correlation coefficient is close to zero), and the second component is most significant for recognizing the class label (the absolute value of their pairwise correlation is equal to 0.724).
Framework FIGURE 6. Framework for mobile Internet of Things security monitoring used Big Data processing and machine learning
FIGURE 5. Correlation dependence on the training set containing the first 10 components and the class label
Within the developed framework there are two modes of functioning of the mobile IoT security monitoring system. The first mode is the mode of learning classifiers, and the second mode is the mode of analyzing instances. In addition, there are three layers for each mode: (1) extraction and decomposition of a data set, (2) compression of feature vectors, and (3) learning and classification. At the first level, the training sample is formed and the analyzed data set is decomposed. Such known algorithms for constructing representative training samples, like CADEX and DUPLEX [28], possess a quadratic complexity of the
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
cardinality of the original data set. This particularity negatively affects the performance of algorithms designed to handle large amounts of data. Therefore, in this paper we use the heuristic approach to extract training instances, which consists in the removal of strongly correlated records within randomly selected subsets but not within the entire data set. First, the initial data set D is divided into several smaller fragments D1, ..., DQ. Next, for each pair of elements (ei, ej) (1≤ iT (where T is the a priori established threshold), then the element ej is excluded from the data set Dk. Further, this procedure is repeated again for a truncated set D1 … DQ. In mode of analyzing instances, the dataset is divided into several disjoint subsets of approximately the same size. For processing each of the resulting subsets a separate parallel thread with its own set of classifiers is used. At the second level, the analyzed feature vector is compressed by the PCA. As was shown in Section 4, the dimension of the input vector has decreased from 115 to 10 components, while retaining more than 99% of the informativity from the initial set of training vectors. At the third level the classifiers are placed, which first perform the adjustment of their parameters, i.e. learn, and then predict the class label of the analyzed object by its features.To assess the effectiveness of classifiers, two parameters were used: (1) accuracy:
ACC
# xi F xi ci
M
i 1
M
.
(11)
(2) difference of true positive rate and false positive rate:
TPR FPR
# xi F xi cb ci cb # xi ci cb i 1 M
# xi F xi cb ci cb # xi ci cb i 1 M
M
i 1
(12)
M
i 1
.
Binary classifier ... Binary classifier Binary classifier ... Binary classifier Binary classifier ... Binary classifier Binary classifier ... Binary classifier Binary classifier ... Binary classifier
VI.
EXPERIMENTS AND DISCUSSION
The training data set consists of 27500 unique instances (2500 unique instances per class). The testing data set was formed from the remaining unique elements which were not seen in the training set. The training and testing processes were performed ten times for each of nine IoT devices, and each time the random generation of the contents of the training set was provided. In the role of performance indicators the accuracy (ACC) and difference of true positive rate and false positive (TRP−FPR) rate were used. Table I contains maximum values of performance indicators calculated for five basic classifiers and their combinations. For seven IoT devices out of nine possible, an increase of the indicator ACC is observed in the case of using combined
One-vs-one scheme
Principal component analysis
Training data set Testing data set Structures of trained classifiers
In the formula (12) the notation cb is understood as the class of normal traffic. A multi-level scheme for combining the classifiers was used for carrying out the experiments. Its representation is shown in Figure 7. After processing by PCA the input vector is analyzed using various binary classifiers. In the role of such classifiers we have used the support vector machine k-nearest neighbors method, gaussian naïve Bayes, artificial neural network and decision tree. The quantity of these classifiers is equal to 55 (the number of combinations of 2 units in a total of 11 units where 11 means a total number of classes). Each copy of the specific classifier is trained using a subsample containing only two classes. The usage of such a fragmentation of the training set allows to reduce the time costs for training of classifiers by introducing a parallel mode, and also to tune classifiers more sensitive to recognizing objects belonging to two classes (instead of eleven). The created binary classifiers are combined into a multi-class model F (i ) using the one-vsone approach (i = 1, ..., 5). The resulting classification is performed using a classifier F constructed on the basis of the plurality voting, weight voting or soft voting. After completing the training procedure the structures of classifiers are stored for the possibility of their deserialization and their performance calculation in the mode of analyzing of instances.
Classifier F (1)
Learning module
Classifier F (2) Classifier F (3)
Composition F
Classifier F (4) Classifier F (5)
Performance calculation
Learning of classifiers Analyzing of instances FIGURE 7. Multi-level scheme for combining the classifiers VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
TABLE I MAXIMUM VALUES OF PERFORMANCE INDICATORS OF CLASSIFIERS AND THEIR COMBINATIONS Device Danmini Ecobee Ennio Philips Provision Provision Samsung SimpleHome Doorbell Thermostat Doorbell B120N10 PT 737E PT 838 SNH 1011 XCS7 1002 Baby Security Security N WHT Monitor Camera Camera Webcam Security Classifier Camera ACC 99.3086% 98.0729% 99.2815% 89.8452% 97.2226% 97.1428% 99.2009% 88.5611% SVM TPR − FPR 99.8995% 99.8572% 99.8734% 99.8979% 99.715% 99.8098% 99.8621% 99.8204% ACC 99.1377% 97.1721% 99.4354% 96.8944% 97.2106% 97.4817% 99.3598% 98.1248% k-NN TPR − FPR 99.8406% 99.7115% 99.768% 99.8746% 99.6782% 99.7795% 99.7588% 99.5898%
99.8283% 97.6554% 99.7527%
75.6666% 71.4082%
64.2376% 79.2933% 72.9288% 75.9799% 66.2622% 70.8056%
68.1603%
TPR − FPR 99.4431% 99.5928% ACC 90.8075% 88.728%
99.3554% 99.3083% 99.6172% 99.711% 99.7334% 98.1972% 71.3483% 91.2059% 86.6745% 88.7023% 98.8189% 89.5733%
99.172% 88.0869%
TPR − FPR 99.6634% 99.6577% ACC 99.1287% 97.5543%
99.6457% 99.349% 99.6206% 99.7528% 99.761% 99.5902% 99.5212% 98.0183% 97.4918% 98.0422% 99.5311% 98.0592%
99.5274% 97.6382%
TPR − FPR 99.9122% 99.8919%
99.8828% 99.9228% 99.8447% 99.8583% 99.8967% 99.7183%
99.806%
PV
ACC 99.4611% 98.9797% TPR − FPR 99.8691% 99.811%
99.5361% 98.3458% 97.5095% 98.8028% 99.39% 99.211% 99.8341% 99.8888% 99.7109% 99.7927% 99.8326% 99.7464%
99.1102% 99.8072%
WV
ACC 99.4749% 98.9523% TPR − FPR 99.8694% 99.8023%
99.5361% 98.3464% 97.5286% 98.8423% 99.39% 99.1908% 99.8341% 99.887% 99.704% 99.8001% 99.8293% 99.7529%
99.0829% 99.808%
ACC
99.5289% 98.3362% 97.5193% 98.8498% 99.363%
99.1911%
99.1385%
99.8368% 99.8828% 99.7072% 99.7929% 99.8066% 99.7525%
99.7742%
GNB ANN DT
SV
ACC
SimpleHome XCS7 1003 WHT Security Camera 88.1491%
99.502%
99.0225%
TPR − FPR 99.8643% 99.7955%
classifiers PV, WV, SV in comparison with the usage of separate basic classifiers SVM, k-NN, GNB, ANN, DT. If we consider a specific combined classifier, namely SV, then we obtain a total increase in the indicator ACC by 4.685% in comparison with the greatest value of the indicator ACC demonstrated among the basic classifiers. To accelerate the analysis of records, the data set was divided into several approximately equal chunks, each of which was processed by a separate parallel thread. Figures 8 and 9 demonstrate dependence of time of processing the data sets on the number of threads for the device Danmini Doorbell. In the case of a training set the rate of data processing was increased in 7.065 times in the transition from one thread to eight, and in the case of a testing set – 6.296 times. FIGURE 9. Dependence of time of the testing data set (969039 instances) processing on the number of threads
We estimate the time spent on training and testing of each classifier. Figure 10 shows the time characteristics of the training process for each classifier. Averaging of this indicator was carried out at 10 launches, and the red vertical line indicates the amplitude of the deviation from the average value. As a result of the performed experiments it was established that the training process of a neural network is the longest in time in comparison with the same characteristic of other classifiers. The time for training the plurality voting is equal to 0, since no preliminary information on the aggregated basic classifiers is used in constructing the decision rule based on this composition. FIGURE 8. Dependence of time of the training data set (27500 instances) processing on the number of threads
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
FIGURE 10. Dependence of training time of the type of classifier
Among the basic classifiers (SVM, KNN, GNB, ANN, DT) the least time for training belongs to GNB and KNN, since their training process use rather primitive operations. In the first case it is necessary to calculate the frequency of the correspondence of certain values of features to a given label, and in the latter case it is necessary to keep the correspondence of training instances to the class label. Figure 11 shows the time characteristics of the testing process for each classifier. Among the basic classifiers, the neural network has the least processing time of the testing data set, and among the compositions – plurality voting and weighted voting. Among the compositions, the soft voting requires a greater amount of time when processing input vectors in comparison with other compositions.
operation results are exposed to plurality voting, weighted voting and soft voting. The experimental assessment of performance and accuracy of the framework operation is made on the data set generated on the basis of the network traffic transmitted between mobile IoT devices. Assessment showed that the offered framework provides the gain in information processing productivity and the higher accuracy of detection of attacks. It says about sufficient efficiency of the developed framework and high prospects of integration of Big Data and machine learning algorithms. In further work, it is planned to examine and experimentally investigate other methods of machine learning for the detection of anomalies, such as dynamic Bayesian networks (DBNs) and deep learning neural networks (DLNNs). DBNs are characterized by a graph-like structure, which defines a set of observable and unobservable random variables, changing over time in accordance with the transition model. This allows one to detect multi-step attacks and provides the ability to calculate the probability of how an observable event is anomalous. DLNNs are aimed at detecting anomalies by repeatedly and sequentially transforming the object being analyzed: on the first layers of the DLNN process, the signals specifying the low-level features of the object are processed, and as these signals are passed (transformed) to the output layer, a more abstract representation of this object is formed. The further direction of researches relates with implementation of the developed framework in the environment of the special software such as Hadoop and Spark. REFERENCES
FIGURE 11. Dependence of testing time of the type of classifier
VII.
CONCLUSIONS
The paper presents a new framework combining the possibilities of Big Data processing and machine leaning developed for security monitoring of mobile Internet of Things. The architecture of the Big Data and Machine Learning framework in which thread functioning of five machine learning mechanisms intended for solving the classification problem is realized is offered. Classifier
[1] Ph. K. Chan and R. P. Lippmann, “Machine Learning for Computer Security”, Journal of Machine Learning Research, vol. 7, 2006, pp. 2669-2672. [2] A. S. Shamili, C. Bauckhage, and T. Alpcan, “Malware detection on mobile devices using distributed machine learning”, in Proc. 2010 20th Int. Conf. on Pattern Recognition (ICPR), IEEE, 2010, pp. 4348-4351. [3] J. Sahs and L. Khan, “A machine learning approach to Android malware detection”, in Proc. 2012 European Intelligence and Security Informatics Conference (EISIC), 2012, pp. 141-147. [4] A. D. Joseph, P. Laskov, F. Roli, J. D. Tygar, and B. Nelson, “Machine Learning Methods for Computer Security (Dagstuhl Perspectives Workshop 12371)”, Dagstuhl Manifestos, vol. 3, no. 1, 2013, pp. 1–30. [5] V. Ford and A. Siraj, “Applications of Machine Learning in Cyber Security”, in Proc. 27th Int. Conf. on Computer Applications in Industry and Engineering, October 13-15, 2014. [6] B. Arslan, S. Gunduz, and S. Sagiroglu, “A review on mobile threats and machine learning based detection approaches”, in Proc. 2016 4th Int. Symp. on Digital Forensic and Security (ISDFS), 2016, pp. 7-13. [7] L. Xiao, X. Wan, X. Lu, Y. Zhang, and D. Wu, “IoT Security Techniques Based on Machine Learning”, CoRR, vol. abs/1801.06275, 2018, http://arxiv.org/abs/1801.06275. [8] A. Holmes, Hadoop in Practice. Manning Publications Co. Greenwich, CT, USA, 2012. [9] A. Gh. Shoro and T. R. Soomro, “Big Data Analysis: Ap Spark Perspective”, Global Journal of Computer Science and Technology: Software & Data Engineering, 2015, vol. 15, no.1, pp. 1–8.
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
[10] M. Scherbakov, D. Kachalov, V. Kamaev, N. Scherbakova, A. Tyukov, and S. Strekalov, “A Design of Web Application for Coomplex Event Processing Based on Hadoop and Java Servlets”, International Journal of Soft Computing, vol. 10, no. 3, 2015, pp.218–219. [11] M.-J. Kim and Y.-S. Yu, “Development of Real-time Big Data Analysis System and a Case Study on the Application of Information in a Medical Institution”, International Journal of Software Engineering and Its Applications, vol. 9, no.7, 2015, pp.93–102. [12] N. Zygouras, N. Zacheilas, V. Kalogeraki, D. Kinane, and D. Gunopulos, “Insights on a Scalable and Dynamic Traffic Management System”, in Proc. 18th Int. Conf. on Extending Database Technology, 2015, pp.653–664. [13] Ch. P. Garware and B. A. Tidke, “A Security Framework for Big Data Computing through Distributed Cloud Data Centres in GHadoop”, International Journal of Computer Science and Mobile Computing, vol. 5, no. 6, June 2016, pp.355–360. [14] Ph. Derbeko, Sh. Dolev, E. Gudes, and Sh. Sharma, “Security and privacy aspects in MapReduce on clouds: A survey”, Computer Science Review, vol. 20, May 2016, pp. 1-28. [15] I. Kotenko, I. Saenko, A. Kushnerevich, “Parallel Big Data processing system for security monitoring in Internet of Things networks”, Journal of Wireless Mobile Networks, Ubiquitous Computing, and Dependable Applications (JoWUA), vol. 8, no. 4, December 2017, pp. 60-74. [16] I. Kotenko, A. Kuleshov, and I. Ushakov, “Aggregation of Elastic Stack Instruments for Collecting, Storing and Processing of Security Information and Events”, in Proc. 14th IEEE Conference on Advanced and Trusted Computing (ATC 2017). San Francisco, August 4-8, 2017, USA. Los Alamitos, California. IEEE Computer Society, 2017, pp.1550-1557. [17] A. Branitskiy and I. Kotenko, “Hybridization of computational intelligence methods for attack detection in computer networks”, Journal of Computational Science, 2017, no.23, pp.145–156. [18] A. Branitskiy and I. Kotenko, “Network anomaly detection based on an ensemble of adaptive binary classifiers”, Computer Network Security, Lecture Notes in Computer Science, Springer-Verlag, Vol.10446, pp.143–157. [19] H.V. Jagadish, B.C. Ooi, K.-L. Tan, C. Yu, and R. Zhang, “An adaptive b+-tree based indexing method for nearest neighbor search”, ACM Transactions on Database Systems, vol. 30, no. 2, 2005, pp.364–397. [20] H. Zhang, “The optimality of naive bayes”, in Proc. 17th Int. Florida Artificial Intelligence Research Society Conference, 2004, pp. 562–567. [21] C. Cortes and V. Vapnik, “Support-vector networks”, Machine learning, vol. 20, no. 3, 1995, pp.273-297. [22] G.A. Seber and A.J. Lee, Linear regression analysis. John Wiley & Sons, 2012. [23] L. Breiman, “Random forests”, Machine learning, vol. 45, no. 1, 2001, pp.5–32. [24] L. Breiman, “Bagging predictors”, Machine learning, vol. 24, no. 2, 1996, pp.123–140. [25] A. Coates and A. Y. Ng, Learning Feature Representations with KMeans. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 561–580. [26] H.-P. Kriegel, P. Kröger, J. Sander, and A. Zimek, “Density-based clustering”, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 1, no. 3, 2011, pp.231–240. [27] Machine Learning Repository. Accessed: Aug. 14, 2018. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/detection_of_IoT_botnet_atta cks_N_BaIoT [28] Z. Reitermanova, “Data splitting”, WDS’s 10 Proceedings of Contributed Papers. Matfyzpress, Prague, vol. 10, 2010, pp. 31–36. [29] Y. Zhou, M. Han, L. Liu, J.S. He, and Y. Wang, “Deep learning approach for cyberattack detection”, IEEE INFOCOM 2018-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2018, pp. 262–267. [30] K.K. Nguyen, D.T. Hoang, D. Niyato, P. Wang, D. Nguyen, and E. Dutkiewicz, “Cyberattack detection in mobile cloud computing:
[31] [32] [33] [34]
[35]
A deep learning approach”, Wireless Communications and Networking Conference (WCNC), 2018, pp. 1–6. A.A. Diro and N. Chilamkurti, “Distributed attack detection scheme using deep learning approach for Internet of Things”, Future Generation Computer Systems, vol. 82, 2018, pp. 761–768. M.A. Alsheikh, D. Niyato, S. Lin, H. Tan, and Z. Han, “Mobile big data analytics using deep learning and apache spark”, IEEE network, vol. 30, no. 3, 2016, pp. 22–29. Z. Liu, Q. Pan, J. Dezert, J.-W. Han, and Y. He, “Classifier fusion with contextual reliability evaluation”, IEEE transactions on cybernetics, vol. 48, no. 5, 2018, pp. 1605–1618. Z. Liu, Q. Pan, J.Dezert, and A. Martin, “Combination of classifiers with optimal weight based on evidential reasoning”, IEEE Transactions on Fuzzy Systems, vol. 26, no. 3, 2017, pp. 1217– 1230. C.W. Hsu and C.J. Lin, “A comparison of methods for multiclass support vector machines”, IEEE transactions on Neural Networks, vol. 13, no. 2, 2002, pp. 415–425.
IGOR KOTENKO (M’03–SM’17) graduated with honors from St. Petersburg Academy of Space Engineering and St. Petersburg Signal Academy. He obtained the Ph.D. degree in 1990 and the National degree of Doctor of Engineering Science in 1999. He is Professor of computer science, Chief scientist and Head of the Laboratory of Computer Security Problems of St. Petersburg Institute for Informatics and Automation. Igor Kotenko is also Head of the International Laboratory "Information security of cyber-physical systems", ITMO University, St. Petersburg. He is the author of more than 350 refereed publications. Igor Kotenko has a high experience in the research on computer network security and participated in several projects on developing new security technologies. For example, he was a project leader in the research projects from the US Air Force research department, via its EOARD (European Office of Aerospace Research and Development) branch, EU FP7 and FP6 Projects, HP, Intel, F-Secure, etc. The research results of Igor Kotenko were tested and implemented in more than fifty Russian research and development projects. The research performed under these contracts was concerned with innovative methods for network intrusion detection, simulation of network attacks, vulnerability assessment, security protocols design, verification and validation of security policy, etc. He has chaired several International conferences and workshops, and serves as editor on multiple editorial boards.
IGOR SAENKO graduated with honors from St. Petersburg Signal Academy. He obtained the Ph.D. degree in 1992 and the National degree of Doctor of Engineering Science in 2001. He is Professor of computer science and Leading Researcher of the Laboratory of Computer Security Problems of St. Petersburg Institute for Informatics and Automation. Igor Saenko is also senior researcher of the International Laboratory "Information security of cyber-physical systems", ITMO University, St. Petersburg. He is the author of more than 200 refereed publications and participated in several
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2881998, IEEE Access
Russian and international research projects. His main research interests are security policy management, access control, management of virtual computer networks, database management, knowledge modeling, soft and evolutionary computation, information and telecommunication systems.
ALEXANDER BRANITSKIY graduated with honors from the Department of Mathematics and Mechanics of St. Petersburg State University with a degree in “Software and Administration of Information Systems” in 2012. He is a junior researcher of Laboratory of Computer Security Problems of St. Petersburg Institute for Informatics and Automation. Alexander Branitskiy is also researcher of the International Laboratory "Information security of cyberphysical systems", ITMO University, St. Petersburg. He is the author of more than 20 refereed publications. Main research interests are information security, artificial intelligence, and computer networks.
VOLUME XX, 2018
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.