Using data mining techniques for diagnostic of

0 downloads 0 Views 425KB Size Report
programs and their analysis becomes impossible for the common user. However, the .... Linux kernel can be converted into a bare metal hypervisor thanks to ... WEKA is free software available .... Microsoft Excel xls and xlsx, .... WordStat6.pdf.
Chuchro, M., Szostek, K., Piórkowski, A., Danek, T.: Using Data Mining Techniques for Diagnostic of Virtual Systems Under Control of KVM. In: Emerging Trends in Computing, Informatics, Systems Sciences, and Engineering, LNEE, vol. 151, Springer, 2013, pp. 1011-1022.

Using data mining techniques for diagnostic of virtual systems under control of KVM Monika Chuchro, Kamil Szostek, Adam Piórkowski, and Tomasz Danek Department of Geoinfomatics and Applied Computer Science, AGH University of Science and Technology, Cracow, Poland. [email protected], {szostek, pioro, tdanek}@agh.edu.pl  Abstract — Analysis of logs of remote network services is one of the most difficult and time consuming task - its amount and variety of types are still growing. With the increasing number of services increases the amount of logs generated by computer programs and their analysis becomes impossible for the common user. However, the same analysis is essential because it provides a large amount of information necessary for the maintenance of the system in good shape thus ensuring the safety of their users. All ways of relevant information filtering, which reduce the log for further analysis, require human expertise and too much work. Nowadays, researches take the advantage of data mining with techniques such as genetic and clustering algorithms, neural networks etc., to analyze system’s security logs in order to detect intrusions or suspicious activity. Some of these techniques make it possible to achieve satisfactory results, yet requiring a very large number of attributes gathered by network traffic to detect useful information. To solve this problem we use and evaluate some data mining techniques (Decision Trees, Correspondence Analysis and Hierarchical Clustering) in a reduced number of attributes on some log data sets acquired from a real network, in order to classify traffic logs as normal or suspicious. The results obtained allow an independent interpretation and to determine which attributes were used to make a decision. This approach reduces the number of logs the administrator is forced to view, also contributes to improve efficiency and help identify new types and sources of attacks. Index Terms—data mining, Kernel Virtual Machine

I. INTRODUCTION Since the 90's, various data access services have become available through the Internet. It has changed our lives and society completely. On the one hand, it facilitates and speeds up the work, contact with the other hand. But it is also a new source of danger, became a new venue for the development of crime. It is therefore necessary for a design of a system which can analyze logs of virtual machines in real time, to minimize the threat posed by high activity of multiple users using the virtual machines’ server at the same time. The logging itself is a simple activity, but as services still generate more logs, various concerns might appear. They are connected to the storage amount, the time required to analyze the logs and difficulty of that process. Storage requirements is

This is the accepted version. The official version is available on: http://link.springer.com/chapter/10.1007%2F978-1-4614-3558-7_86

the only hardware concern, which is in fact not a real problem today. All the other results in necessity of human expert analysis, which is unfortunately slow and expensive. [1] With the increasing use of computers in every area of life, the question arose: Is it possible to replace part of the physical machines through virtualization. Nowadays, this idea became possible due to very high performance and low cost hardware. With virtualization and at least one physical machine we can create even a few dozens of full machine ready to work. This idea is also advantageous from the standpoint of environmental protection. The energy consumption due to a single physical machine is used for at least a few posts ready to work. This solution also opens up new opportunities for education (easy adaptation for use in e-learning). Furthermore, lower maintenance costs and operation of machines allow to create more new jobs in places where for reasons of economic problems it was not previously possible. But that explosive growth of machines number could cause high risk due to physical difficulties in the interpretation of the quantity of log supply. Methods of analysis presented in this paper were inspired by the architecture of workstations of the university computer rooms. Computers in classrooms are terminals for connecting to the server which is, due to installed software, hardware virtualization. KVM allows sharing the same environment for students. The solution is very convenient from the viewpoint of maintaining the system administrator. It does not require so much commitment as the maintenance of the system on a traditional machine. This solution also poses several risks. It is therefore necessity for design a system which can analyze logs of virtual machines in real time, to minimize the threat posed to high activity of multiple users using the virtual machines server at the same time. Network administrator is today equipped with tools and techniques that might be used to analyze system logs, such as system’s text parsers and regular expression searchers, artificial ignorance and intrusion detection system: A. Grep Grep is simple and useful tool that can be used to search for patterns in logs, which can improve efficiency of logs analysis

in easy way. This solution is simple, intuitive and quite powerful, often used by administrators to obtain relevant information from large amounts of information generated by the computer system. However, this method is inefficient and even impossible to use when handling a larger number of machines, due to the small automation the process of retrieving relevant information (in this case, all decisions must be taken as an administrator). B. Artificial ignorance Artificial ignorance is a method that filters, orders and valuates logged system events to collect important entries. This technique requires from the administrator of the network knowledge about system and events that occur, to separate them from real threats. C. Data Mining Methods This method boils down to the use of advanced mathematical and statistical methods to forecast future events based on the analysis of data collected in the past. These methods based on genetic algorithms, artificial neural networks and analysis of text can yield much better results because they used cameras to evolve along with the amount of collected and analyzed data. [2] Very frequent security incursions usually do not have specific targets. Worms and malware (malicious software) contain exploits, which are used to create network of compromised machines (botnets) by infecting as much computers as possible. Then this network is used to perform more sophisticated attacks that today are often connected to criminal organizations. These botnets might perform attacks on an enterprise or institution that results in denial of service. What is more, botnets may be used to send unsolicited email (spam) or to hinder identification of the attacker. Analysis and characterization of large amount of data is required by data mining. [1] Useful mechanism to gather information, necessary for the proper functioning of the data mining mechanisms, may be large quantities of logs generated by the services running across multiple machines. In our case, inference methods will be used for prediction of the possible risks of system failure on the server systems where the virtual machines are located. Also we will try to predict intrusion from the part of users of virtual machines or from outside the system. Application of data mining methods allows one administrator to maintain several machines in good condition. The virtual servers are often used to share computing power machine on which they are located. This environment allows you to create virtual machines. A network card, hard drive, graphics card are virtualized for every virtual machine. It makes the performance much higher than a purely software solution, because the KVM hardware is using virtualization technology found in newer processors. Under VMs it is possible to install a common action at a time any number of Linux, Windows and other systems.

This is the accepted version. The official version is available on: http://link.springer.com/chapter/10.1007%2F978-1-4614-3558-7_86

II. INTRODUCTION TO THE ANALYSIS The data used in analysis were gathered from virtual machines which were used by students. So access to the logs generated by users is huge and the logs are very diverse. With such a huge amount of data, exploration using data mining methods is always the best way to analyze the accumulated information into account of possible risks. A. Pre-processing Data mining algorithms require from the dataset to be assembled as well as this dataset must be large enough to contain patterns, which the algorithm will uncover. Preprocess is the most important in multivariate logs before any clustering or data mining. Then the target set is cleaned, which is performed by removing noise and missing data. Then results can be described as feature vectors that are raw data observations’ summarized versions. B. Tasks of data mining The tasks of data mining are as follows:  Classification – The data is arranged into predefined groups. Decision tree learning, nearest neighbor, naive Bayesian classification and neural networks are used in this stage.  Clustering – The algorithm tries to classify similar items together into groups.  Regression – Attempts to model the data.  Association rule learning - Searches for relationships between variables. The purpose of using association rules is to help finding items that imply the presence of the different items in the same moment or transaction. This tool is very useful in a supermarket management; it helps to uncover shopping patterns for groups of customers, for example the most famous shopping rule: men+ friday+ diapers = beer. C. Results Validation The last step of process is patterns verification, which were discovered by the data mining algorithms. The most important is if the patterns exist in wider data sets. Sometimes where process of learning is very advanced, algorithms could found patterns between data sets, which does not exist in general. It is called overfitting. For overcoming this process data sets are divide into three parts, learning data set, testing data set and validation data set. Algorithms are learning on the first data set, after then tested on the second and third, for checking if the founded pattern is real. If the founded pattern does not exist in testing or validation data set, the pattern does not exist or there were some problems with selecting data to each data sets; data sets should have the same percent of each kind of instances. D. Kernel Based Virtual Machine (KVM) The latest open source virtualization is represented by Kernel-based Virtual Machine (KVM) project, which objective was to produce a modern hypervisor. The idea was to create it on the experience of previous generations of technologies and to increase usability of the modern hardware. Linux kernel can be converted into a bare metal hypervisor thanks to loadable kernel module, which is KVM

implementation. KVM project have turned into stable and high performance hypervisor very fast thanks to two key design patterns adoption. These have also made KVM leader between other open source hypervisors. [3] E. Data mining software The article focused on the analysis carried out in the most popular programs that allow an analysis of data mining. We tried to compare the performance and capabilities of commercial software to those of free, and chose the best solution for our application. Since at the present stage, our project is not yet able to propose a real-time we could afford to identify available technologies and select the best of them both in terms of capacity, speed of calculation and availability. For the analysis of textual data on those referred to in this article are perfect software for data analysis with appropriate modules for data collection and analysis. In our considerations, we focus mainly on:  WEKA -(Waikato Environment for Knowledge Analysis) is a popular suite of machine learning, developed at the University of Waikato. WEKA is free software available under the GNU (General Public License). WEKA supports several standard data mining tasks, more specifically data preprocessing, clustering, classification, regression, visualization and feature selection.  Statistica - (trademarked in capitals as STATISTICA) is a statistics and analytics software package developed by StatSoft. Statistica provides data analysis, data management, data mining and data visualization procedures.  RapidMiner - (formerly YALE (Yet Another Learning Environment)) is an environment for machine learning and data mining experiments. It allows experiments to be made up of a large number of arbitrarily nestable operators, described in XML files which are created with RapidMiner's graphical user interface. RapidMiner is used for both research and real-world data mining tasks. It is distributed under a GNU license. It also integrates learning schemes and attributes evaluators of the WEKA learning environment.  WordStat - from Provalis Research is a commercial text mining tool integrated with SimStat and QDA Miner.  Microsoft Analysis Services (part of Microsoft SQL Server, a database management system). Microsoft has included a number of services in SQL Server related to Business Intelligence and Data Warehousing. It substitutes the functions offered by the package. It also features the latest versions of MS Excel, which can easily be applied to a rough analysis of simple data. The programs discussed are interested in possible ways of analysis and performance additives which have these programs. Also text analysis which will be made in the next paragraphs are good tests of usefulness of mentioned above applications for beginners and advanced data miners.

This is the accepted version. The official version is available on: http://link.springer.com/chapter/10.1007%2F978-1-4614-3558-7_86

III. METHODOLOGY A. Text analysis Text mining is a process of extracting meaningful information from unstructured data [4]. This method can be use for data bases texts, web pages, e-books, server logs, emails and other text documents. Text mining gives also chance to compare two or more documents. Text mining include text categorization, text clustering, relation modeling, sentiment analysis and document summarization and entity relation modeling [5]. Nowadays there are many open source and commercial programs for text mining analysis.  R- tm extension package,  Weka,  Statistica,  WordStat from Provalis Research,  Rapid Miner,  MS SQL Server. The task of text categorization is to assign document to one or few categories [5]. Categories are based on document contents. Classification can be divided into two methods: supervised and unsupervised document classification. First one is when some external mechanisms exist, for example human feedback. In the second case classification is made without reference of external information. Here we can use neural networks, decision trees, Bayes models, k-nearest neighbours, genetical algorithms, case-based reasoning and fuzzy logic. Text clustering is closely related to the concept of data classification. The main issue of this method is to find which documents resemble the other. If the documents have the common word they are stored in the same group. Numbers indicate how often each word exist in a document. In the same way clustering could be process inside a document in each paragraphs. The cluster analysis has two main types: hierarchical and partitional clustering. Correspondence analysis unveil distribution of words among subgroups to the total distribution, or relationship between words [6]. Interpretation of correspondence analysis is easy to be made with 3D or 2D graphs, but interpretation should be process with caution. In some text mining tools it is possible to perform automated text classification. It is supervised machine-learning task. This method can be used to automatically classify document into proper categories or to find relevant keywords. The same is with correspondence analysis. Data mining programs can find some connection between word and sentences, but the scientist are necessary to judge if the found rule is true. B. Text analysis process Text analysis with text mining process consist of 5 steps:

First of them is text preprocessing. Second is transformation of text, next one is feature selection [7]. The fourth step is pattern discovery with data mining tools. Interpretation and evaluation are the fifth and the last step of analysis. All the programs mentioned above have similar functions, so for analysis we choose the WordStat 6.0.1 with graphical interface. It is a easiest tool for beginning users.

were chosen for the analysis, the feature selection will be an important step. IV. RESULTS Discovery patterns from data were made in WordStat with English dictionary and without added substitution and exclusion dictionaries.

Analysis with WordStat 6.0.1 from Provalis Research Importing data to the program is very easy to do. For the analysis there is a possibility for using documents:  Microsoft Excel xls and xlsx,  SPSS sys and sas,  MS Access mdb,  Paradox db,  Lotus/Symphony wk and wr,  Quattro Pro wq and wb,  CSV and TAB, MMO, SSS, XML. There is also document conversion wizard for many different files extensions. Next step is to make a dictionary, specify how the textual information should be processed and which words should be ignored.

Fig 2. Cross table.

During this process univariate frequency analysis where made [8]. The frequency matrix may contain included, leftover or unknown words, with amount or percentage amount occurred in document. During this analysis it was discovered that the word connected with some threat appear only in 1% of all logs. The most common word in logs was “accepted”- 99%. Examining the relationship between included categories was the second step (Fig. 3) of discovery patterns. It was made by using the cross tables (Fig. 2). Here we discover words which come together. Those connected words will help with finding threats. For example “password+ change+ failed” or “login+ failed or unknown”.

Fig 1. Input data.

Preprocessing step was made in SimStat 2.5.7 [7]. Data had been imported to program where checked and save as dbf file (Fig. 1). In this step basic statistics can be made, if data have numerical variables. After changing program for WordStat preprocessing is continued. Common English suffixes can be removed here, and also the decomposition of every word into sequences of 3, 4 or 5 characters can be made on this stage. In transformation process individual words can be replaced with another word or with sequence of words. Automated spelling correction of common misspellings is also situated in this step of analysis. Words that should not be in analyzed document are added to exclusion list in exclusion process. Feature selection is performed to gain better subset, which describes data from an original dataset. The main aim of this step is to choose a relevant data and reduce the amount of data. In logs data all variables are relevant. If many logs files

This is the accepted version. The official version is available on: http://link.springer.com/chapter/10.1007%2F978-1-4614-3558-7_86

Fig 3. Trust data.

Correspondence analysis of data is the next step. The correspondence analysis graph for this case is shown on Fig. 4. Here we can make 2d or 3d graphs, which show the relationship between words [8].

VI. FUTURE WORK

Fig 4. Correspondence analysis graph.

Last one step in a pattern discovery is making dendrogram. Dendrogram is a graphical presentation of hierarchical clustering [8] (Fig. 5).

The article describes the proposal, only the possibility of using data mining methods for the analysis of system logs. The possibility of effective analysis of system logs recorded by computer programs to prevent attacks, system failures and data loss, therefore, optimizing the process of log analysis is very important. In the current phase of the project it is only in the planning phase of implementation. In this moment only few method where used. Next step of the project will be connected with full and complex analysis with classification tools using neural networks, fuzzy logic and many more. Also different methods of text mining will be applied, for finding proper and useful tools for log threats. Ultimately, we want to create a complete system for analysis of aggregate KVM server logs around the machinery using chosen tools. The program would allow for profiling the performance of machines as well as warning of potential threats and intrusions in real time. The project is so useful that will be used to monitor virtual machines on servers in the university where classes are conducted with the students. It can therefore be expected that in the course of system administration classes students will generate a lot of potential risks allowing the creation of rules of inference for our system. Ultimately, we want to create a web application allowing users to monitor system status, management entirely online. ACKNOWLEDGMENT This work was financed by the AGH - University of Science and Technology, Faculty of Geology, Geophysics and Environmental Protection as a part of statutory project number 11.11.140.561. REFERENCES [1]

[2]

Fig 5. Dendrogram

[3]

V. CONCLUSION Log analysis using data mining methods can significantly speed up the retrieval of the most important information within the data collected. Search patterns using the automated method allow text searches to prevent human errors resulting from omission of relevant information. Moreover, the use of mathematical reasoning through a thorough analysis of the relationship between the huge amounts of data (a quick comparison of the results of many observations, even in real time) can often detect threats which would be impossible to identify by a human. In addition, analysis of data at a right angle can prevent likely problems occur in the future forecast on the basis of phenomena that have already been registered. Log analysis using data mining methods is the only growing sector of having a large range of possibilities before us.

This is the accepted version. The official version is available on: http://link.springer.com/chapter/10.1007%2F978-1-4614-3558-7_86

[4]

[5]

[6]

[7]

[8]

A. Grégio, R. Santos, A. Montes, "Evaluation of data mining techniques for suspicious network activity classification using honeypots data," Data Mining, Intrusion Detection, Information Assurance, and Data Networks Security 2007, Proceedings of the SPIE, vol. 6570. M. Chuchro, A. Piórkowski, "Methods and tools for data mining of intensity variability inlet to municipal wastewater treatment plant," Studia Informatica Vol. 31, No. 2B, Gliwice 2010, pp 347-358. A. Shah, "Kernel-based virtualization with KVM," Linux Magazine, vol 86, 2008. R. Feldman, J. Sanger, "The text mining handbook - advanced approached in analyzing unstructured data," Cambridge University Press, 2007, pp. 64-92. W. Himmel, U. Reincke, H. W. Michelmann, "Text mining and natural language processing approaches for automatic categorization of lay requests to web-based expert forums," J Med Internet Res 2009;11(3):e25. Text mining and its applications: results of the NEMIS Launch Conference, Spiros Sirmakessis-editor Martin Rajman, Martin Vesely, From text to information: Document processing and visualization, text mining approach, pp 7-25, Springer 2003 Berlin. WordStat 6 Content Analysis Module for QDA Miner&SimStat User’s Guide-Provalis Research, http://www.provalisresearch.com/Documents/ WordStat6.pdf. R. Ivancsy, I. VajkFrequent, "Pattern mining in web log data", Acta Polytechnica Hungarica Vol. 3, No. 1, 2006, pp 76-90.