Full Paper Proc. of Int. Conf. on Advances in Information and Communication Technologies 2012
A Qualitative Information Security Risk Assessment Model using Machine Learning Techniques Mete Eminagaoglu Yasar University / Dept. of Science Culture, Izmir, Turkey Email:
[email protected] without depending on the knowledge of information security experts. In addition, such new models must be implemented so as to minimize the drawbacks of qualitative risk methodologies such as subjectivity, uncertainty and false predictions [1], [2]. Machine learning simply provides the technical basis of data mining [14]. It is a new technology for mining knowledge from data in a manner that the system is trained so that it can establish improved and reliable performance in new situations with new data [14]. Data mining can be defined as the extraction of useful information from large data sets or databases. It is also globally accepted as a new discipline, lying at the intersection of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas [15]. Data mining is also accepted as a stage of a larger process known as “Knowledge Discovery in Databases” (KDD). Knowledge discovery is defined as a process for identification of new valid understandable and potentially useful patterns from large data sets. Hence, machine learning is a discipline that is used in cooperation with data mining where both disciplines have intersection points, even that some of the algorithms and models are commonly used for both implementations such as BayesNet, decision trees, neural networks [14], [15], [16]. Also, some recent literature use the term “statistical learning” instead of machine learning since most of the algorithms and methodologies are more or less based on statistical theories and methodologies [17]. Successful implementations and projects have been accomplished in the context of machine learning such as; optimizing industrial systems, uncovering financially valuable patterns, minimizing business risks, fraud detection, prediction and assessment of new investments, etc. [16], [17], [18].
Abstract—One of the most crucial concepts in information security management is the proper and accurate assessment of information security risks. In this study, a new model has been proposed for assessing the qualitative information security risks. This study’s basic aim is twofold. The first aim is to design, derive and implement a unique information security risk analysis survey for a specific institution. The second aim of this study is to implement an original machine learning classification model that deduces and prioritizes the risks with the data set that is derived from the survey results. The model is refined by observing and com-paring the performance values of binary classifier algorithms’ train and test results. The results show that the model can be accepted as a successful prototype. Some recommendations for further improvements and new research areas have also been included in the study. Index Terms—information security risk assessment, qualitative risk analysis, information security risk survey, machine learning, binary classifiers
I. INTRODUCTION In today’s business world and even in our daily life, proper and accurate assessment and management of the information security risks has become a crucial issue. There are several models, methodologies and standards that have been developed previously to assess and analyze information security risks [1], [2], [3], [4], [5]. However, information security risks cannot always be estimated reliably because each company or organization might be facing different risks or the same risks with different levels due to divergent environments, cultures, processes and organizations [6]. This yields to new methods and models for information security risk assessment and analysis. In the recent years, some researches have been made which implement information security risk assessments using machine learning and similar computational intelligence and reasoning models such as fuzzy logic and belief functions [7], [8], [9], [10], [11]. Most of these studies either focus on multi-classifier machine learning models for quantitative risks or technological aspects of information security such as network firewalls, intrusion prevention / detection systems, e-mail filter systems and so on [12] , [13]. However, in today’s business and daily life; there also exist some information security risks which cannot be properly quantified due to lack of sufficient statistical data or due to the intangibility as well as due to being based on human factors. In such situations, there is always a need for reliable and accurate automated qualitative risk assessment models which can be used easily by senior management © 2012 ACEEE DOI: 03.CSS.2012.3.17_1
II. MODEL AND THE CASE STUDY In this study, a proprietary qualitative information security risk assessment model based on machine learning is implemented in one of the medium-sized public hospitals in Izmir, where Izmir is known to be the third biggest (regarding the population size and socio-economic parameters) city in Turkey. Due to the privacy and legal concerns of hospital managers, the name of the hospital is not explicitly given in this study. First, an information security survey was conducted. Then, a machine learning model was generated using the results obtained from the survey. The model was tested among different selected binary classifier algorithms and the results were analyzed using some standard measures in order to 92
Full Paper Proc. of Int. Conf. on Advances in Information and Communication Technologies 2012 obtain their comparative performances. The comments on the observations are given in the conclusion section of this paper. All of the processes and main steps throughout the entire study have enabled to derive a preliminary information security risk assessment model for qualitative type of risks and this model is shown in Figure 1. This proposed model is aimed to be adjusted and used for assessing and estimating the information security risks within a qualitative approach where the risks cannot be measured or quantified in real values.
converted into two classes where the some of the qualitative risk scores would be accepted as above the organization’s acceptable risk level and the remaining would be defined as below. It is also necessary to make some refining in the sample data that were collected from the surveys. The obsolete or misleading answers must be discarded from the sample data. This was done by manual analysis in this study however some additional automated tools might be used in the future implementations of this model. The data will be input to the machine learning algorithms and several different binary classifiers will be executed using these data. It must be stressed that for each of the algorithms, not only train results but also test results must be obtained and observed. The selection of the best suited algorithms that give the best performance results must be done by both observing the train and test results. The performance analysis also needs the minimum threshold values for accuracy which could change from implementation to implementation or which could be decided upon the organization’s business requirements. After comparing the results, the algorithms that give the most satisfactory results will be chosen. Then, another analysis must be done in order to decide upon which risks will be accepted as risky or above the organization’s acceptable risk level. The results are compared amongst all of the chosen algorithms and the number of risky entities for each of the risks are counted and ordered. This is due to the fact the ones that are most commonly found as risky among most of the algorithms are to be chosen the essential risky instances. In this step, a suitable and reliable statistical methodology is required to interpret the distribution of the counted and ordered values. Since this is a primary proposed model, the quartiles are used and the risk instances that had counts which were above second quartile (50th percentile) are selected as the risky ones for the organization. These are listed as the risk assessment output as an executive summary report which is also the final phase in the model. A. Information Security Risk Survey The information security survey was implemented using a qualitative methodology. The survey’s respondents were the data entry operators and their department manager in the hospital. The department manager and the author of this study made several meetings for the design of the survey. Regarding the major business processes in the data entry department and the data entry personnel, some of the specific information security risk issues of the organization were identified. During these meetings, the information security issues that that cannot be quantified and also the ones that were the most crucial were included in the survey’s scope. Hence only some specific assets, vulnerabilities and threats were taken into consideration. These are grouped and given in Table 1. It should be noted that in most of the information security risk assessment surveys or qualitative information security risk assessment systems, the risks are given in the format
Figure 1. The proposed model for qualitative information security risk assessment.
The first step in the proposed model is definition of the information security risks and collecting some qualitative evaluations regarding those risks in an institution. This is recommended to be implemented by conducting a survey rather than assessing some qualitative scores by a single manager or information security expert. The second phase or step is to adjust the collected values in the surveys for the data analysis and machine learning implementation. Since binary classifiers were used in this study, the estimated scores for each of the risks must be © 2012 ACEEE DOI: 03.CSS.2012.3.17_1
93
Full Paper Proc. of Int. Conf. on Advances in Information and Communication Technologies 2012 assets, each of the 10 threats might impose a possible risk exploiting each of the 9 possible vulnerabilities. Regarding the many-to-many relationship, this would make up a total of (6 x 10 x 9) = 540 possible combinations which would imply 540 distinct risks. However, in real life situations most of these possible combinations and relations are neither relevant nor sensible and their probability is 0. These were automatically discarded from the survey. Some of the other possible combinations were not also taken into scope of the survey due to institution’s managerial primary requirements and strategic decisions. Consequently, the scope of the survey was limited to 30 distinct topics, or in other words, 30 possible combinations of assets, threats and vulnerabilities. The questionnaire forms were given to the survey respondents in print-out form where they have to fill in a total of 270 answers amongst 30 risks and 9 evaluation criteria. The evaluation criteria and relevant survey questions were also decided by the department manager and the author of this study. These questions were either to be answered on a (Yes / No) or (1 / 2 / 3 / 4 / 5) or (0 / 1 / 2 / 3 / 4) qualitatively ranked scale basis. They are shown separately in Table 2, Table 3 and Table 4.
with its relevant asset name, impacting threat name and its related vulnerability [1], [2], [5]. However, in this study the survey was designed in an alternative manner where only the risks were described and asked to the users in the questionnaire forms and these risks’ corresponding assets, threats and vulnerabilities were analyzed, evaluated and measured by the experts afterwards. This approach enabled the users to answer the questions and understand the risks more easily and more effectively. For instance, one of the risks in the scope of the study was related with natural disasters and its impact on hospital’s computers used for data entry. Rather than asking “if the asset is computer, the threat is natural disasters and the vulnerability is the lack of business continuity plans; how would you evaluate…”; it was structured as “if the computers are to be damaged or malfunctioning due to natural disasters and also there is no business continuity plan in your organization, how would you evaluate…”. A total of 6 assets, 10 threats and 9 vulnerabilities were included in the scope of the study. Thus, for each of the 6 TABLE I. LIST
OF
THE ASSETS, THREATS AND VULNERABILITIES DEFINED IN T HE INFORMATION SECURITY R ISK SURVEY
TABLE II. SURVEY QUESTIONS EVALUATED WITH YES / NO
All of the employees in the data entry department including their supervisor were grouped and gathered in several sessions and they all answered the survey by giving scores to these 9 questions for each of the 30 risk relations and filled in the answer forms in paper format. TABLE III. SURVEY QUESTIONS EVALUATED WITH THE SCALE B ETWEEN 0
TO
4
A total of 64 respondents attended the survey. The author of this study also attended all these sessions and assisted the survey respondents and also assured that the surveys were carried out independently and objectively. After collecting the answer sheets, all of the qualitative survey scores were entered into an MS Excel file specially designed for the questionnaire which also provides some cross-checks © 2012 ACEEE DOI: 03.CSS.2012.3.17_1
94
Full Paper Proc. of Int. Conf. on Advances in Information and Communication Technologies 2012 and necessary corrections. TABLE IV. SURVEY Q UESTIONS EVALUATED WITH THE SCALE BETWEEN 1
TO
with several data sample sets (train and test sets) using crossvalidation methodology [14], [19], [20]. For each of the performance observations among these classifier algorithms, there were two phases. First phase was the training phase where the whole data set was used for training the classifier model. In the second phase, the same data set was used as test set by the aid of cross-validation methodology [14]. 10-folds stratified cross-validation was chosen as a best-practice option [14]. This was a crucial point in this study because no classifier algorithm can be evaluated reliably only by observing its performance values for training [19]. This is due to the fact that some classifier models algorithms have the danger of over-fitting which could be overcome by using test sets as well as train sets [14]. The primary success or accuracy criterion for any of the classifier algorithms was to achieve a maximum of 10% error rate from the test set. In other words, the sum of TP (True Positive) and TN (True Negative) classifications should be at least 90% of the entire test set. This can be simply formulated as follows;
5
B. Machine Learning Implementation with Binary Classifiers The qualitative scores obtained from the respondents were analyzed by the department manager and the author of this study where the threshold value for overall risk was agreed upon as “4”. The overall risk scores evaluated as “1, 2 or 3” by the survey respondents were defined as acceptable risks and are marked as “Risk = No” where all the other scores (“4” or “5”) are marked as “Risk = Yes”. Thus, the basic model was to estimate whether an instance was risky or not. This method has made it feasible for the binary classifiers in machine learning models where each instance coming from the data set is to be identified in any one of the two possible classes [14]. After this classification, the results are reorganized as a proper data set to be used as input for the machine learning classifiers. The entire data set was made of 12 attributes (9 survey questions plus asset, threat, vulnerability parameters and one resultant binary Risk class) and 1920 instances (64 x 30, where 64 survey respondents each with 30 distinct risk topics). Also, all of the attributes were set to non-numeric (nominal) type. It should be mentioned that before the machine learning analysis process, all the answers and questionnaire sheets were checked for irrelevant answers or missing answers. Such answers were discarded from the data set and the resultant data set was reduced to 1750 instances where 940 of them were marked as “Risk = Yes” and the remaining 810 were marked as “Risk = No”. This data set was processed in the machine learning process where a total of 68 different binary classifiers and their performances were observed among approximately 3200 experiments with different parameter settings within the relevant classifier algorithms. In this research, all the experiments for the machine learning analysis and evaluation steps are carried out by using Weka software (version 3.6.0). For each of the 68 binary classifier algorithms, the data set is used for observing the learning performances (error rates and some other recommended metrics) © 2012 ACEEE DOI: 03.CSS.2012.3.17_1
C
TP TN * 100 TP TN FP FN
(1)
If C 90% then accept as accurate In the formula above, FP stands for False Positives and FN stands for False Negatives in the test set. This implies that if a classifier algorithm distinguishes at least 90% of the risky (Risk = Yes) and non-risky (Risk = No) instances from the test set correctly; then it is accepted as a reliable risk learner and classifier. This value was also decided upon the information security risk assessment model in this study and the department manager’s criteria. However, after the analysis of the preliminary results, this criterion had to be modified and the acceptable error rate value was set to 80% for the test performances. All the classifier algorithms having the best performances with the optimized parameter values (only the values that were changed from their default values predefined in Weka) are given in Table 5. As well as the error rates, all the algorithms were also compared and their performances were evaluated with respect to their Kappa Statistics and F-measures (weighted averages for “Risk = Yes” and “Risk=No”). These metrics are also recommended for observing the performance of a machine learning classifiers as well as error rates [14] [15]. The algorithms were also comparatively analyzed for different data set sizes to observe whether the total size of 1750 instances in the study was sufficient for their learning process or not. Some of the results regarding the learning performances of these algorithms are also denoted in the “Results” section of this paper. III. RESULTS The results obtained for the binary classifier algorithms using the whole data set for train and test phases where 10folds stratified cross-validation are denoted in Table 5. It should be noticed that there were some other algorithms with 95
Full Paper Proc. of Int. Conf. on Advances in Information and Communication Technologies 2012 higher training performances but they all had test accuracy values much lower than 0.8. All the algorithms denoted in Table 5 are the ones with the best test performances. It can be noticed from Table 5 that some of these algorithms are hybrid (ensemble) algorithms which a main functional classifier algorithm is combined with another classifier algorithm as a sub-function [14], [21] which is entitled as “sub-classifier” in Table 5.
mances of these algorithms were observed and it was shown that the data set size was sufficient for all of these classifier algorithms’ learning process. Some of these results are denoted within their learning curve progression in Table 6 and Figure 2 for the “RandomForest” algorithm with the parameters set as “Number of Trees = 92” and “Maximum Depth = 3” where the details of these settings could be found in the similar studies [14], [22], [23].
TABLE V. BINARY C LASSIFIERS WITH THE B EST PERFORMANCE RESULTS
Figure 2. Learning curve for RandomForest algorithm. TABLE VI. LEARNING PERFORMANCE FOR RANDOMFOREST CLASSIFIER
The last step in the qualitative information security risk assessment was to compare and analyze all of these chosen classifier algorithms’ results (risk assumptions) with respect to the specific risks classified as risky and non-risky. The resultant data from the Weka software were exported to another table and they were compared. The intersection sets of risky and non-risky results for each of the classifier algorithms were analyzed. As a result, sixteen different risks were derived to be “risky” for the hospital. These results were also shared with the hospital executives and head of the data entry department and they also accepted the results as feasible. These sixteen risks are shown in Table 7 where their corresponding asset, threat and vulnerability instances are also given.
All of these algorithms were also observed for their learning performances and their learning curves where some different random data sets with different sizes were established from the whole data set. Then, each classifier algorithm’s accuracy performance and F-measure was observed for each of these data sets and the results were analyzed. The sizes of the data sets were chosen as 87, 175, 350, 525, 700, 875, 1050, 1225, 1400, 1575, 1662, 1715 and 1750. The learning perfor © 2012 ACEEE DOI: 03.CSS.2012.3.17_1
96
Full Paper Proc. of Int. Conf. on Advances in Information and Communication Technologies 2012 It should be noted that the total number of instances tuples that were predicted and classified as “risky” by the algorithms are counted and then re-grouped into quartiles, which is a descriptive statistical method [24]. Only the risks having an instance count that is above the second quartile (50th percentile) were chosen as the risky ones.
two parameters (risky and non-risky), binary classifier algorithms were shown to be a suitable model for the hospital’s information security risk assessment. Similar models and promising implementations can also be derived for other hospitals, organizations and companies in similar or different business / service areas. In addition, based on this model; new models can be generated for other information security domains if risks are to be predicted by qualitative assessments. It was also observed that the data set size and sample test data sizes were sufficient for the model which enabled us to observe over-fitting issues and to deduce whether learning curve has reached to its maximum level. One of the crucial aims in this study was to derive a model and methodology which could predict or assess the qualitative information security risks in a more reliable and robust way rather than relying on the subjective judge of experts or only deciding upon the descriptive statistics derived directly from the survey answers. The results showed that by the aid of machine learning algorithms this aim could be fulfilled to some extent. Another important aspect in this study is the parameter selection and usage within the classifier algorithms. Some of the parameters might be changed and additional results could be observed within the same data set. By this way, enhanced performance values for the learning capabilities of these algorithms might be obtained. Similarly, additional experiments could be made using some other suitable classifier algorithms. In this study, the survey and the questionnaire form was designed so that only the risks themselves were described in single sentences and they were given to the users without mentioning their corresponding assets, threats and vulnerabilities. This was a strategic decision taken during the design of the questionnaire and this was especially done in order to simplify the questions for the users during the survey. It was observed that the users answered the questions and responded to the survey more easily and more effectively. This approach and strategy can be used similarly in all types of qualitative information security risk assessment surveys / questionnaires and it can also be implemented in qualitative information security risk assessment applications and tools. It should also be mentioned that a subjectivity problem could exist in this study (and similar studies as well) due to the values / scores provided by employees in qualitative risk assessments and relevant surveys. This problem might be mitigated by the addition of new mechanisms and methods in the risk assessment process. In order to re-evaluate and to decide upon which risk instances will be treated as the “risky” ones, quartiles were preferred in this study. However, for the new implementations of this model, some other statistical or numerical methods might be applied and used instead of quartiles. In addition, some fuzzification / defuzzification methodologies might also be implemented so that instead of counting and ranking the total number of instances, an alternative quantification and evaluation could be achieved.
T ABLE VII. 16 RISKS CLASSIFIED AS R ISKY (ABOVE THE HOSPITAL’S ACCEPTABLE LEVEL) BY THE C LASSIFIER ALGORITHMS
CONCLUSIONS In this study, a qualitative information security risk assessment model was developed by the aid of machine learning classifier algorithms. Since the risk deduction is based on © 2012 ACEEE DOI: 03.CSS.2012.3.17_1
97
Full Paper Proc. of Int. Conf. on Advances in Information and Communication Technologies 2012 ACKNOWLEDGMENT
[11] D-M. Zhao, J-H. Wang, J. Wu and J-F Ma, “Using fuzzy logic and entropy theory to risk assessment of the information security,” IEEE Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, vol. 4, pp. 2448-2453, August 2005. [12] T. Pietraszek and A. Tanner, “Data mining and machine learning - towards reducing false positives in intrusion detection,” Information Security Technical Report, Elsevier, vol. 10(3), pp. 169-183, 2005. [13] G. Giacinto, F. Roli and L. Didaci, “Fusion of multiple classifiers for intrusion detection in computer networks,” Pattern Recognition Letters, vol. 24(12), pp. 1795-1803, 2003. [14] I. H. Witten, E. Frank and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques 3rd ed. Burlington, MA: Elsevier Inc., 2011. [15] C. Hand, H. Mannila and P. Smyth, Principles of Data Mining. London: The MIT Press, 2001. [16] N. Nedjah, M. M. Luiza and J. Kacprzyk, Innovative Applications in Data Mining. Berlin: Springer-Verlag, 2009. [17] T. Hastie, R. Tibshirani and J. Friedman, The Elements of Statistical Learning, Data Mining, Inference and Prediction, 2nd edition. New York: Springer, 2009. [18] D. Delen, R. Sharda and P. Kumar, “Movie forecast Guru: A Web-based DSS for Hollywood managers,” Decision Support Systems, vol. 43(4), pp. 1151-1170, 2007. [19] G. M. Weiss and F. Provost, “Learning when training data are costly: the effect of class distribution on tree induction,” Journal of Artificial Intelligence Research, vol. 19, pp. 315354, 2003. [20] M. Vuk and T. Curk, “ROC curve, lift chart and calibration plot,” Metodoloski Zvezki - Advances in Methodology and Statistics, vol. 3(1), pp. 89-108, 2006. [21] E. Alpaydýn, Introduction to Machine Learning, 2nd edition. USA: The MIT Press, 2010. [22] L. Breiman, “Random Forests,” Statistics Department, University of California, Berkeley, pp. 1-35, 1999. [23] A. Liaw and M. Wiener, “Classification and Regression by randomForest,” R News, vol. 2/3, pp. 18-22, 2002. [24] R. J. Hyndman, Y. Fan, “Sample quantiles in statistical packages,” American Statistician, vol. 50 (4), pp. 361-365, 1996.
The author wishes to thank Prof. Dr. Saban Eren and Assoc. Prof. Dr. Yilmaz Kilicaslan for their precious collaboration and envisioning support regarding the statistical analyses and machine learning implementations in this study. The author would also like to thank Assoc. Prof. Dr. Yilmaz Göksen for enabling the formal endorsement from the hospital for the information security survey and conducting the survey. REFERENCES [1] T. R. Peltier, Information Security Risk Analysis. USA: Auerbach Publications, 2001. [2] D. J. Landoll, The Security Risk Assessment Handbook - A Complete Guide for Performing Security Risk Assessments. USA: Auerbach Publications, 2006. [3] ISO, Information Security Management Systems - Requirements ISO/IEC 27001:2005. ISO pub., 2005. [4] H. F. Tipton and M. Krause, Information Security Management Handbook. USA: Auerbach Publications, 2007. [5] ISO, Information Security Risk Management ISO/IEC 27005:2008. ISO pub., 2008. [6] S. L. Pfleeger and R. K. Cunningham, “Why measuring security is hard,” IEEE Security & Privacy, vol. 8(4), pp. 46-54, 2010. [7] W. Qu and D. Z. Zhang, “Security metrics, models and application with SVM in information security management,” IEEE Proceedings of the Sixth International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3234-3238, August 2007. [8] J. J. Lv, W. H. Qiu, Y. Z. Wang and N. Zou, “A study on information security optimization based on MFDSM,” IEEE Proceedings of the Fifth International Conference on Machine Learning and Cybernetics, pp. 2732-2736, August 2006. [9] L. Sun, R. P. Srivastava and T. J. Mock, “An information systems security risk assessment model under DempsterShafer Theory of belief functions,” Journal of Management Information Systems, vol. 22(4), pp. 109-142, 2006. [10] X. Long, Q. Yong and L. Qianmu, “Information security risk assessment based on analytic hierarchy process and fuzzy comprehensive,” IEEE Proceedings of the 2008 International Conference on Risk Management & Engineering Management, pp. 404-409, November 2008.
© 2012 ACEEE DOI: 03.CSS.2012.3.17_1
98