World Congress on Internet Security (WorldCIS-2014)
Anomaly Detection System Towards a framework for enterprise log management of security services Omer Ozulku ECS, University of Southampton UK
[email protected]
Nawfal F. Fadhel ECS, University of Southampton UK
[email protected]
Abstract— In recent years, enterprise log management systems have been widely used by organizations. Several companies such as (IBM, MacAfee and Splunk etc.) have brought their own log management solutions to the market. However, the problem is that these systems often require proprietary hardware and do not involve web usage mining to analyze the log data. The purpose of this paper is to investigate an approach towards a framework for managing security logs in enterprise organizations called of the anomaly detection system (ADS), built to detect anomalous behavior inside computer networks that is free from hardware constraints and benefits from web usage mining to extract useful information from the log files. Keywords-component; Anomaly Detection, enterprise log management, web usage mining algortithm, RESTful style log data collection
I.
INTRODUCTION
Securing the “Information Technology (IT)” environment is of paramount concern in cyber security. Security related meta data such as logs and audit information are crucial for an organization, especially for distributed IT systems. Gathering the security data from various sources and analysing them can be challenging because devices, applications and servers produce logs usually in different file formats and in different storage locations. For non-cyber security professionals, the vast amounts of log files in different locations and formats makes it hard to access and understand them as interpreting them require high IT skillset and knowledge. There are a number of propriety solutions for log management designed by several companies such as (IBM, McAffe and Splunk etc.) for centralised log management. However, this paper investigates an approach towards a framework that facilitates log management on an enterprise level log management system, which is able to collect and analyse the log data from various sources and provide an easy to use interface. Data sources include web servers, operating systems, networking devices or any kind of system that produces logs that can be utilized to understand the current status of the organization’s security infrastructure. A. Approch and application spesification The objective of this approach is to help organizations manage security related data that is stored in different storage locations with different file formats (Centralised Enterprise Log Management). Therefore the objectives of this system are:
978-1-908320-42/1/©2014 IEEE
David Argles ECS, University of Southampton UK
[email protected]
Gary B. Wills ECS, University of Southampton UK
[email protected]
•
Collection of data logs from different network infrastructure servers.
•
Monitor “Internet Information Server (IIS)” via its log files, parsing and analysing the log files in order to generate security alerts.
•
Extract meaningful information using a web usagemining algorithm.
The approach will be demonstrated through an application developed using .Net and targeted to work on all Windows platforms. A hybrid software development methodology was applied, in order to develop the system within the limited time available and while responding the possible changes in requirements. The development was divided into two major increments. First includes the main application that allows for collection of security related data from different sources via related software interfaces. The second is a sensor application that is used to send security log files from IIS to the main application. B. Application In response to the objectives of the system an application was deigned to fulfil the following: •
The system can be integrated with any organization; to collect security related log data in a central database and application.
•
Parsing and analysing different kinds of log data would allow detecting any possible attack and weak points in the IT infrastructure.
•
Simplified user-interface the system’s allows nontechnical professionals/employees to understand more about the security insights of the organizations IT systems without delving into much detail.
•
To detect the possible IT infrastructure security breaches, the framework will provide an efficient and easy way to analyse the log files.
Overall, the centralised enterprise log management system provides a convenient and easy way for managing security related log data that is stored in different locations and the user interface does not require high IT understanding for use. This capability will certainly help the organizations to report security breaches to related national authorities.
97
World Congress on Internet Security (WorldCIS-2014)
II.
LOG MANAGEMENT SYSTEMS
“Enterprise Log Management System (ELMS)” is an area that is widely studied by industry and academia; there are plenty of academic and white papers discussing ELMSs. Managing the log files is a key issue for system and network administration in enterprise organizations as Log files reflect the current status of the system and contain useful information related to IT security. Log files have the potential to used in security systems benchmarking and provide suggestions such as increasing security levels, fortifying authentication procedures, and tracking of malicious attempts by the users. Due to the expanding number of threats to networks and servers, the importance of log data becomes crucial. However, organizations that have a distributed IT environment usually experience problems on log collection and analysis. Moreover, analysis of the log data in an effective way is an another challenge [1]. III.
DATA MINING FOR SECURITY RELATED INFORMATION
During the last decade there has been an exponential growth in the web, web sites and of users therefore large volumes of data have been produced that are related to the user’s activities within the web sites. This data is stored in web log files and is broadly known as Web Usage Data (WUD) [2]. Web mining is considered the newest variant of data mining of web data and includes various web activities [3]. Log analysis is required in order to conduct web usage mining. Knowledge Discovery from Web Usage Data (KDWUD) is the discipline of web mining [4], and data mining techniques are utilized in order to process the high volume of WUD to discover potentially useful patterns [2]. Web access log files can provide the relevant datasets in order to make such analyses. In practice, web data mining is broadly divided into three distinct parts: Web content, Web structure, and Web usage mining. Automated discovery of the user access patterns using web servers is called web usage mining. Each organization’s web servers generate high volumes of data for serving their daily operations. Web usage mining involves the discovery and analysis of user access patterns from Web server logs. This paper focuses on the area of web usage mining therefore this paper only includes the algorithms that work for web usage mining. Some of the algorithms and methods that work for web usage mining with the log files are introduced in this section. IV.
LOG ANALYSIS ALGORITHMS AND METHODS FOR WEB USAGE MINING
Log analysis algorithms have been employed in domains such as security analysis of IT infrastructures, and web usage analysis for web servers. Log files contain relevant information about the users/agents for web access history. Web usage mining is utilised to understand the web access patterns of the users, Web usage mining involves web log mining as well as data pre-processing which is an essential task for web log mining [5]. Analysing the web log files provides meaningful information, subject to the perspective chosen (server point of
978-1-908320-42/1/©2014 IEEE
view, client point of view). Availability of the servers, vulnerability of servers and security loopholes can be identified by analysing the web log files from the server point of view [3]. Pattern reorganisation systems were used by Gao for studying client behaviours [6]. This study used web log data in order to understand the client behaviours and web usage patterns, to increase the quality of service. Log files were deeply analysed to discover the client behaviour patterns. Web mining research is an area that has emerged in recent years [7][4] and especially web usage mining (WUM) [8]. WUD is focused on discovery and extraction of cluster and sequential patterns. Sequential pattern mining was first proposed by Agrawal and Srikant [8], using the association rule mining technique, which is presented in the famous Apriori algorithm. Afterwards Apriori, Apriori All, and Apriori Some algorithms were published to solve the sequential mining problem. Following that, generalized sequential patterns (GSP) algorithm that gave 20 times better performance compared to the Apriori algorithm in earlier work. Prefix Tree for Sequential Patterns (PSP) algorithm is much the same as the GSP algorithm [8]. The concept of Graph Traversal mining, which is proposed in [9] uses a basic non-weighted chart to show the association between the pages within the web sites. At first sight, the FPtree structure re-arranges and saves the non-sequential database operations on a prefix tree, which occur frequently [10]. As a statistical approach, the web utilization miner (WUM) tool focused on the discovery of interesting sequential patterns [11]. Discovery of the frequent patterns called WAPmine was proposed in [12]. The presence of contiguous sequence patterns in web log files was investigated by Xiao and Dunham [13]. The FS-Miner algorithm depends on FSTree, and is a compressed tree for representing the sequences [14]. ApproxMAP merges clustering and sequential patterns into one, to explore the possibility of multiple alignment sequential pattern mining extraction [15]. Ezeife and Lu presented the Pre-Order Linked WAP-Tree Mining (PLWAP) algorithm in order to extract the sequential patterns from web log files [15]. Tug, Sakiroglu and Arslan proposed automated log mining to discover sequential accesses within web log files [16]. Sequential Web Access based Recommender System (SWARS) is the intelligent recommender system and utilises the sequential access pattern discovery [17]. Raju et al. presented the Cluster and Extract Sequential Pattern approach called CESP. The approach works by extracting behaviour of the users who browse the web site. In order to discover their behaviours, the CESP approach splits the log files into sub-parts and then analyses them as seen in Figure 1 [2].
Figure 1 Block diagram of CESP approach [2]
98
World Congress on Internet Security (WorldCIS-2014)
V.
ANOMALY DETECTION SYSTEM (ADS)
A data flow diagram is used to give high-level understanding of the application from a data point of view. Processing steps of the data through the system can be easily represented with data process flowcharts. The preprocessing step gathers log files using different processes. Figure 2 depicts the journey of the log data through the system. Different types of input log data are pre-processed and sent through the knowledge discovery engine, as it allows the detection of possible patterns. The final phase, called pattern analysis, is where the analysis results are gathered from the bulk of the log data. Each process in the flowchart relies heavily on the previous process’s output.
VI.
DETAILED REQUIRMENT
The IIS sensor application involves monitoring the web server by constantly listening to the log folder as the main processes of the system is log parsing and analyzing log data. Parsing involves extracting the existing information from the log files, while analysing is the phase in which the CESP web usage-mining algorithm is utilised. Both processes have their own decision mechanisms to run. After running both processes, system activity tables are updated to reflect the updated results. VII.
SENSOR APPLICATION FOR DATA COLLECTION FOR FRAMEWORK
The RESTful architecture style was applied to the ADS web service design, which introduces the possible log sources (DNS server, active directory server and IIS web server) and their requests (HTTP POST) to the web service, the web service processes the request and returns a JSON or XML response back. RESTful design of the web service provides a uniform interface and easy interaction with the system in terms of communication over HTTP. Any kind of log data source running on any kind of platform would integrate with the web service as long as the basic HTTP POST request is possible. Sensor applications can be developed with any language and technology. The only requirement is being able to make HTTP POST requests with in that application. RESTful design of the web service provides a uniform interface to handle the complexity of receiving the log data from different sources. Each sensor application creates an HTTP POST message to pre-defined URLs. The sensor application creates http packets that carry relevant log file information. Structure of the HTTP POST message can be found in Figure 3. Three parameters – fileName, fileSize and fileBase64 – are used for representing the log files within the http packet.
Figure 2 System process flowchart Figure 3 HTTP POST message exchange
978-1-908320-42/1/©2014 IEEE
99
World Congress on Internet Security (WorldCIS-2014)
Figure 4 Centralised log management system
VIII. ADS DEMONSTRATION One of the aims of the project is providing easy to use interfaces to non-technical people. The architecture of the ADS is illustrated in Figure 4; the re-use principle was implemented with the component library of .Net framework. In order to benefit from any existing familiarization of the user with Windows forms components, traditional UI components are heavily used in the interfaces. Usually forms include date/time pickers, data grid view, buttons and textboxes, etc., which an average computer user can easily recognise. 1.1
Alert and log file viewer
The Log viewer in Figure 5 allows users to find a particular log file to make in depth-analysis. In case an alert for a particular date and time is received, the user can easily navigate to the log file viewer form and filter the relevant range. After finding the log entry, all the details can be accessed within the rich textbox available in the same form.
five different patterns that have the potential to lead to a breach. TABLE 1 Title
Figure 5 shows the alert viewer form. The system constantly monitors the new log entries that are coming through the data collection web service. There are some specific patterns in web server log files to detect whether there is unusual activity. To detect these possible unusual activities, some patterns are pre-defined in the system. TABLE 1 shows
978-1-908320-42/1/©2014 IEEE
Description
Unusual GET request size
Average of the GET requests calculated, and if any GET request detected 50% higher than the average, then alert generated.
Unusual POST request size
Average of the POST requests calculated, and if any POST request detected 50% higher than the average, then alert generated.
Concurrent request overflow Unusual invalid URL requests
If more than 1000 requests coming from the same IP address in less than 10 seconds, then alert generated.
Directory browsing attempts
Unexpected directory browsing attempts to any specific folder like /images, /js, /css trigger an alert.
1.2
Figure 5 Main Application / Alert Viewer
FIVE PATTERNS FOR ALERT GENERATION
More than 100 GET/POST requests for a non-existent URL in less than 10 seconds triggers an alert.
Log Analysis Results
The log analysis results were computed using CESP algorithm. CESP algorithm is one of the well-known web usage mining algorithms that are used for extracting meaningful information from the log files. There are two main sets of information extracted with this algorithm: frequency and recency. In order to find the frequency of the visitors, unique sessions are counted. After unique sessions, numbers of page views within these sessions and percentage of new and returning users are also counted. The CESP algorithm provides those percentages very efficiently in terms of time and computing resources. This form also allows users to filter the log analysis with start and end date. IX.
RESULTS: SYSTEM COMPARISON
The framework is evaluated using the demonstration system, the ADS compared against the similar systems in the market as there are quite a lot of log management systems in
100
World Congress on Internet Security (WorldCIS-2014)
the market. Detailed information about McAfee® Enterprise Log Manager, Splunk Enterpise are good examples of enterprise log management application 1. TABLE 2 shows the results of the comparison. Splunk and McAfee products compared against the ADS. TABLE 2
COMPARISONS WITH OTHER AVAILABLE SYSTEMS IN THE MARKET
Splunk
McAfee
ADS
Requires hardware
No
Yes
No
Web usage mining involved
No
No
Yes
Any kind of log source can be introduced Log search and view
Yes
No
Yes
Yes
Yes
Yes
Automatic alert generation
No
Yes
Yes
Communication over HTTP
No
No
Yes
RESTful style data collection web service Multiple platforms availability
No
No
Yes
Yes
No
Partial2
The system managed to provide larger and better functionality over its competitors. None of these competitors analyses the gathered log data via a web usage-mining algorithm. Also with the advantage of RESTful style developed data collection web service ADS provides hardware free solution. X.
•
Log files include important pieces of evidence as well as being very useful in generating the chain of custody. For example, a username can be found in logs but information relating to the user’s role and privileges may be absent. Usually, it is easy to understand which system is used but may be difficult to determine what data was accessed and the user’s identity. At this point McAfee’s Enterprise Log Manager provides a comprehensive log management solution including an intelligent log collection system, rich context analysis, flexible storage options and full text search capability 3. Security Information and Event Management (SIEM) and log management tools cover much the same context. SIEM emerged 10 years ago with the expectation of reducing the load on firewalls and IPS devices. SIEM is broadly seen as too complex and slow to implement, and it struggles to prove that the investment is worthwhile. On the other hand, end users want simple solutions which can satisfy the compliance requirements as well as helping them improve their security operations [20]. The key features of SIEM/Log Management solutions in the market usually include [20]:
RELATED WORK
Accorsi and Hohl [18] provided a secure logging system to store log information which marginally trusted collectors. First they proposed an approach for storing the log data in a secure way and introduced the non-repudiation mechanism for collector applications. Secondly, accepting that storage computing resources are limited, they also investigated how much their protocol could be securely delegated to relay the log information. Söderström & Moradian [1] proposed a system for secure log management that utilizies database, file system, and log auditing, in parallel with managing original log files. The system focuses on security, flexibility, performance and portability. Their proposed system enables organizations to send their encrypted audit log transactions to one centralized server. Kent, in the Computer Security Log Management guide, proposed four measures for organizations that face challenges in log management [19]. These are
1
•
Prioritize log management appropriately throughout the organization.
•
Establish policies and procedures for log management.
•
Create and maintain a secure log management infrastructure.
Provide adequate support for all staff with log management responsibilities.
•
Log Aggregation: Collection and aggregation of log data from different sources, e.g. network, security, servers, databases, identity systems, and applications.
•
Correlation: Detection of attacks by analysing different log data sets from multiple data sources. This is not possible when only looking at one data source.
•
Alerting: Generating alerts on the user interface depending on the pre-defined rules and thresholds.
•
Dashboards: Showing the current status system with key security indicators and possible alerts.
•
Forensics: Functionality for exploring incidents by indexing and searching relevant events.
•
Reporting: Producing the needed documentation, which includes control sets, other relevant security operations and compliance activities. XI.
CONCLUSION
The aim of this research is providing an initial step towards a framework that is extensible and flexible on an enterprise level system for organizations to manage their log files. Software engineering principles were applied during the design, build and testing of the demonstration system. Testing was carried out with using wide range of tools and techniques. Telerik’s Test Studio, cucumber and built-in functions of the Visual Studio have been utilised. The ADS is capable of constructing a centralised enterprise log management system for organizations to collect, parse, analyse and manage their log files. The system comparison demonstrated that the
Splunk, “Splunk Enterprise Data Sheet.” 2013.
3 2
(Data collection web service is available for every platform. But main and sensor applications only work with Microsoft Operating Systems).
978-1-908320-42/1/©2014 IEEE
McAfee®, “McAfee Enterprise Log Manager Data Sheet,” 2012.
101
World Congress on Internet Security (WorldCIS-2014)
framework has a number of advantageous against the potential competitors. XII.
FUTURE WORK
The ADS and its demonstration are aligned with the following requirements: •
•The system can be integrated with any organization; to collect security related log data in a central database and application.
•
•Parsing and analysing different kinds of log data would allow detecting any possible attack and weak points in the IT infrastructure.
•
•Simplified user-interface of the system allows nontechnical professionals/employees to understand more about the security insights of the organizations IT systems without delving into much detail.
•
•Detect the possible IT infrastructure security breaches; The framework will provide an efficient and easy way to analyse the log files.
In the future the functionality of demonstration application will be extended such as adding the ability to change the alerts and conditions via a user interface to increase system flexibility, because they are hardcoded for this demonstration application. And, stored log files may require archiving in the future due to limited storage. An archiving strategy should be considered and integrated with the system. Also due to single sensor application, currently only IIS web server is providing data to the system, more sensor applications would make the log database even richer and this will allow different kinds of log analysis on the data like aggregation analysis.
[10] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns Without Candidate Generation: A Frequent-Pattern Tree Approach,” Data Min. Knowl. Discov., vol. 8, no. 1, pp. 53–87, Jan. 2004. [11] M. Spiliopoulou, “The Laboriuos, Way from data mining to Web mining,” J. Comput. Syst. Eng., vol. 14, no. Special Issue on Semantics of the Web, pp. 113–126, 1999. [12] J. Pei, J. Han, B. Mortazavi-Asl, and H. Zhu, “Mining Access Patterns Efficiently from Web Logs,” in Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications, 2000, pp. 396–407. [13] Y. Xiao and M. H. Dunham, “Efficient mining of traversal patterns,” Data Knowl. Eng., vol. 39, no. 2, pp. 191–214, Nov. 2001. [14] M. El-Sayed, C. Ruiz, and E. A. Rundensteiner, “FS-Miner: Efficient and Incremental Mining of Frequent Sequence Patterns in Web Logs,” in Proceedings of the 6th Annual ACM International Workshop on Web Information and Data Management, 2004, pp. 128–135. [15] H. Kum, “Approximate Mining of Consensus Sequential Patterns,” University of North Carolina, 2004. [16] E. Tuğ, M. Şakiroğlu, and A. Arslan, “Automatic discovery of the sequential accesses from web log data files via a genetic algorithm,” Knowledge-Based Syst., vol. 19, no. 3, pp. 180–186, Jul. 2006. [17] B. Zhou, S. C. Hui, and K. Chang, “An intelligent recommender system using sequential Web access patterns,” IEEE Conf. Cybern. Intell. Syst. 2004., vol. 1, pp. 393–398, 2004. [18] R. Accorsi and A. Hohl, “Delegating secure logging in pervasive computing systems,” Secur. Pervasive Comput., pp. 58–72, 2006. [19] M. Kent, K Souppaya, “Guide to Computer Security Log Management,” Natl. Inst. Stand. Technol. / Guid. to Comput. Secur. Log Manag., vol. Special Pu, no. 800–92, 2006. [20] Securosis, “Understanding and Selecting SIEM / Log Management White Paper,” 2010.
XIII. REFERENCES [1] [2]
[3]
[4]
[5]
[6]
[7] [8]
[9]
O. Söderström and E. Moradian, “Secure Audit Log Management,” Procedia Comput. Sci., vol. 22, pp. 1249–1258, Jan. 2013. G. T. Raju, P. S. Satyanarayana, and L. M. Patnaik, “Knowledge Discovery from the Web Usage Data (KDWUD),” Int. J. Innov. Comput. Inf. Control, vol. 4, pp. 381–389, 2008. M. Pratap Yadav, P. K. Keserwani, and S. G. Samaddar, “An efficient web mining algorithm for Web Log analysis: E-Web Miner,” 1st Int. Conf. Recent Adv. Inf. Technol., pp. 607–613, Mar. 2012. R. Cooley, B. Mobasher, and J. Srivastava, “Data Preparation for Mining World Wide Web Browsing Patterns,” Knowl. Inf. Syst., vol. 1, no. 1, pp. 5–32, Jul. 2013. F. Yuan, L. Wang, and G. Yu, “Study on data preprocessing algorithm in web log mining,” in Machine Learning and Cybernetics, 2003, no. November, pp. 2–5. D. Guo, “Collector Engine System: A Web Mining Tool for ECommerce,” in First International Conference on Innovative Computing, Information and Control (ICICIC’06), 2006, vol. 1, pp. 632–635. R. Cooley, B. Mobasher, and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web,” pp. 558–567, 1997. R. Srikant and R. Agrawal, “Mining Sequential Patterns: Generalizations and Performance Improvements,” in Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, 1996, pp. 3–17. A. Nanopoulos and Y. Manolopoulos, “Finding generalized path patterns for web log data mining,” Data Knowl. Eng., vol. 37, no. 3, pp. 243–266, 2000.
978-1-908320-42/1/©2014 IEEE
102