Anomaly Teletraffic Intrusion Detection Systems on Hadoop-Based

0 downloads 0 Views 420KB Size Report
software platform Hadoop has been investigated and proposed. Keywords - Anomaly teletraffic intrusion detection system; big data; Hadoop; HDFS; MapReduce ...
2012 15th International Conference on Network-Based Information Systems

Anomaly Teletraffic Intrusion Detection Systems on Hadoop-based Platforms: A Survey of Some Problems and Solutions Hae-Duck J. Jeong, WooSeok Hyun*, Jiyoung Lim, and Ilsun You Department of Computer Software Korean Bible University Seoul, South Korea {joshua, wshyun, jylim, isyou}@bible.ac.kr

implementation of the MapReduce programming model for cloud computing. The core fraction of Hadoop is a distributed file system. Meanwhile, the explosive growth of teletraffic for user services threatens the current networks and we face menaces from various kinds of intrusive incidents through the Internet. A variety of network attacks on network resources have continuously made serious damage [4], [5]. Therefore, active and advanced technologies for early detecting of anomaly teletraffic on Hadoop-based platforms should be understood. In this paper, a survey of anomaly intrusion detection systems based on the open-source software platform Hadoop has been investigated. Some problems and solutions for those systems have been also suggested. The rest of this paper is organized as follows: Section II presents a brief overview of Hadoop. Section III introduces anomaly teletraffic intrusion detection systems. A survey of some problems and technical solutions for anomaly intrusion detection systems on Hadoop has been investigated and a new platform has been proposed in Section IV, and Section V summarizes conclusions and future work.

Abstract — Telecommunication networks are getting more important in our social lives because many people want to share their information and ideas. Thanks to the rapid development of the Internet and ubiquitous technologies including mobile devices such as smartphones, mobile phones and tablet PCs, the quality of our lives has been greatly influenced and rapidly changed in recent years. Internet users have exponentially increased as well. Meanwhile, the explosive growth of teletraffic called big data for user services threatens the current networks, and we face menaces from various kinds of intrusive incidents through the Internet. A variety of network attacks on network resources have continuously caused serious damage. Thus, active and advanced technologies for early detecting of anomaly teletraffic on Hadoop-based platforms are required. In this paper, a survey of some problems and technical solutions for anomaly teletraffic intrusion detection systems based on the open-source software platform Hadoop has been investigated and proposed. Keywords - Anomaly teletraffic intrusion detection system; big data; Hadoop; HDFS; MapReduce; cloud computing

I.

INTRODUCTION

Today, we are experiencing a flood of information. There has been an intense need for efficient data processing by communities of enterprises and science. Thus far, several approaches have been employed to ensure that applications can deal with increasing data volumes. First of all, cloud computing, now emerging as a promising way, provides more practical access to a large number of resources for computation, storage and networking [1]. Secondly, the introduction of the MapReduce model of the Hadoop framework expands its potential spectrum of application providing high-level abstraction for massive data processing. The MapReduce programming model was developed as an important implement of cloud computing. However, in spite of smart structuring of the system, the concise configuration of MapReduce has not been evaluated in detail. It faces a number of concerning issues to achieve the best performance [2]. Hadoop is basically a large scaled data processing system operated by a distributed computing platform in which distribution is a core frame (concept) [3]. Hence, Hadoop efficiently enriches data storage spaces and computation capability based on parallel processing and it can enable

II.

Hadoop is an open-source software framework that supports data-intensive distributed applications. It enables applications to work with thousands of computationally independent computers and with petabytes of data [6], [7]. Hadoop was developed for the distributed web search and derived from MapReduce, a distributed system, and Google File System, a distributed file system. Essential components of Hadoop are the distributed storage (HDFS, Hadoop Distributed File System) and the distributed processing (MapReduce). Hadoop increases the storage space and the processing power by uniting many computers into one. A small Hadoop cluster will include a single master and multiple worker nodes (slaves) as in Figure 1. The master node consists of a Job Tracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker. In a large cluster, HDFS is managed through a dedicated NameNode server to host the file system index and a secondary NameNode that can generate snapshots of the NameNode’s memory structures,

* Corresponding Author.

978-0-7695-4779-4/12 $26.00 © 2012 IEEE DOI 10.1109/NBiS.2012.139

UNDERSTANDING HADOOP

766

Figure 1.

pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser. A limitation of Hadoop is that it cannot be directly mounted to an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job can be inconvenient. Another limitation is that Hadoop is not efficient for sub-second data reports and frequent data changes. Hadoop is less effective than RDBMS where frequent data insertion, deletion and update occur and the complete multi-step transaction is used for data processing. An advantage of Hadoop is scalability, which Yahoo and Facebook show very well. Another advantage is costeffectiveness. A lot of data cause an intensive amount of RDBMS data storage by keeping raw data in isolation. Hadoop considers the significance of raw data and processed data equally.

Hadoop architecture.

thus preventing file system corruption and reducing loss of data. A. Hadoop Distributed File System HDFS is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single DataNode. A cluster of DataNodes form the HDFS cluster that has a NameNode in a role of metadata server. A NameNode performs various utilities of the file system namespace such as open, close, rename and etc. for files and directories. A file is divided into more than a block that is stored in a DataNode and HDFS determines mapping between DataNodes and blocks. DataNodes run operations such as read and write that file system clients require. Since HDFS is written in Java, any Java-operating computers run NameNode or DataNode software, which saves the huge cost at the hardware architect phase and the management phase. HDFS is a common file system in Hadoop but is not the only file system. For example, Elastic Computer Cloud (EC2) of Amazone uses S3, its own file system instead of HDFS. Brisk of DataStax uses Cassandra File System that includes data query and analytic functions for integration of real-time data storage and analysis.

III.

ANOMALY TELETRAFFIC INTRUSION DETECTION SYSTEM

An Intrusion Detection System (IDS) is a security tool, and the goal is to strengthen the security of information and communication systems by monitoring network or system activities, identifying intrusions, and making reports [9], [10], [11]. In terms of analysis technologies, IDSes can be classified into two types: Signature-based Teletraffic Intrusion Detection System 1 (ST-IDS) and Anomaly-based Teletraffic Intrusion Detection System (AT-IDS). ST-IDSes apply patterns of known attacks to find attacks. For this approach, it is necessary to build a signature database expressing known attacks in advance. However, ST-IDSes are not effective against new or unknown attacks in spite of their efficiency. On the other hand, AT-IDSes decide if an attack happens by checking if the deviation between a given event and the normal behavior is greater than a pre-defined threshold. It is worthwhile to note that this approach can detect previously unseen attacks. In considering the characteristics of the Hadoop framework, we focus on the AT-IDS approach for its security in this paper.

B. MapReduce MapReduce is a software framework for easily writing applications which process vast amounts of data (multiterabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner [8]. Above the file systems come the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker

ST-IDS is also called as Misuse-based Teletraffic Intrusion Detection System.

767

Figure 2.

Anomaly intrusion detection system stages.

Generally, an AT-IDS is composed of the following three stages as shown in Figure 2: z Parameterization: In this stage, the events obtained from the target system are expressed in a predefined form. z Training stage: The system’s normal behavior is characterized, and as a result a model is established. z Detection stage: The built model is applied to check if the deviation between a given event and the normal behavior goes beyond the pre-defined threshold.

Figure 3.

TABLE I.

Approaches

Anomaly detection technologies can be categorized into three types: statistical-based, knowledge-based, and machine learning-based approaches [9]. First, the statisticalbased approach obtains the system or network traffic activity, and then generates a profile corresponding to its stochastic behavior. Through this profile, observed events are checked by the deviation between their behavior and the normal one. This approach involves univariate, multivariate, time series, and self-similar models. Second, in the knowledge-based approach, audited events are classified based on the pre-defined rules such as knowledge. This approach includes finite state machines, description languages, and expert systems. Finally, the last approach, including Markov models, genetic algorithms, neural networks, Bayesian networks, fussy logic, and clustering and outlier detection, applies the machine learning method to detect attacks. Figure 3 and Table I show the three approaches and their related technologies. For details, also see [9].

Statisticalbased

Knowledgebased

Machine learningbased

768

Techniques for anomaly teletraffic intrusion detection. ADVANTAGES AND DISADVANTAGES OF THE AT-IDS TECHNIQUES. Techniques

⋅ Univariate models ⋅ Multivariate models ⋅ Time series models ⋅ Self-similar models ⋅ Finite state machines ⋅ Description languages ⋅ Expert systems ⋅ Markov models ⋅ Genetic algorithms ⋅ Neural networks ⋅ Bayesian networks ⋅ Fussy logic ⋅ Clustering and outlier detection

Advantages and kisadvantages ⋅ Prior knowledge about normal activity not required. ⋅ Accurate notification of malicious activities ⋅ Susceptible to be trained by attackers ⋅ Difficult setting for parameters and metrics ⋅ Unrealistic quasi-stationary process assumption ⋅ Robustness. Flexibility and scalability ⋅ Difficult and time-consuming availability for high-quality knowledge/data

⋅ Flexibility and adaptability ⋅ Capture of interdependencies ⋅ High dependency on the assumption about the behavior accepted for the system ⋅ High resource consuming

TABLE II.

Attributes

Storage Volume

Velocity

Variety

IDS Cost

IV.

PROBLEMS AND TECHNICAL SOLUTIONS FOR EACH ATTRIBUTE OF BIG DATA. Major Characteristics and Problems

Difficuties to store large volume of data in current systems Requiring at least the processing power of supercomputers to obtain the final results from big data ⋅ Various types of data such as texts, images, and videos ⋅ Structured and unstructured data New or unknown anomaly teletraffic High cost to implement a platform

Technical Solutions

HDFS Parallel/Distributed Process (MapReduce) Non-relational DBMS (NoSQL) AT- IDS Open-source framework (Hadoop)

Figure 4.

A new proposed platform for AT-IDS on Hadoop.

z Collector module collects web page data, SNS data, and system log data from SNS open API (e.g., Facebook, Twitter), distributed file collection tools, data collection robots, and data aggregators. z Storage module is used to store and manage big data in file storages, data storages and structured data storages through filters and real-time analysis. z Analyzer module analyzes, clusters, and classifies parallel and distributed data. This module also plays an important role in contents analysis, descriptive analysis, predictive analysis, natural language processing, text mining, etc. Especially, AT-IDS is implemented to detect anomaly teletraffic using the MapReduce framework. z GUI module provides automated statistical analysis of results such as starting and monitoring the system, obtaining information about real-time status of the system, and real-time statistics for ATIDS.

SOME PROBLEMS AND TECHNICAL SOLUTIONS FOR AT-IDS

Table II shows some problems and technical solutions for each attribute of big data. Using those technical solutions, we have proposed a new platform for AT-IDS on Hadoop [12], [13]. z Storage volume: It is hard to store a massive volume of data (e.g., Exa Byte, Zetta Byte) in current systems. Those big data are stored in HDFS and HBase, as a distributed, scalable, and portable file system. z Velocity: We need at least the processing power of supercomputers to get the final results from big data. A parallel/distributed process framework MapReduce is one of the best frameworks for easily writing applications which process massive amounts of data. z Variety: We also consider how to store structured and unstructured data, and various types of data sources such as texts, images, and videos. Nonrelational DBMS such as NoSQL saves such types of data. z IDS: When considering the characteristics of the Hadoop framework, anomaly-based intrusion detection systems are suitable because they can detect new or unknown attacks. z Cost: We normally pay a high cost to implement a framework, but we can implement one with low cost if an open-source framework Hadoop is used.

V.

CONCLUSIONS

The massive volume of data called big data for user services threatens the current networks and we have faced menaces from various kinds of intrusive incidents through the Internet. A variety of network attacks on network resources have continuously created serious damage. Therefore, active and advanced technologies for early detecting of anomaly teletraffic on a Hadoop-based platform are required. In this paper, we proposed an anomaly teletraffic intrusion detection system based on the opensource software platform Hadoop, and some problems and solutions for this system have been also investigated. The proposed framework will be developed and experimented with on Hadoop in the future.

Figure 4 shows that the proposed platform for AT-IDS on Hadoop consists of four main modules, collector, storage, analyzer, and GUI.

769

ACKNOWLEDGMENT The authors would like to thank the funding agency for providing financial support. Parts of this work were supported by a research grant from Korean Bible University, South Korea.

[6] [7] [8] [9]

REFERENCES [1] [2]

[3]

[4]

[5]

J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large cluster,” Communictions of the ACM, 51(1):107-113, 2008. G. Wang and A. R. Butt, “Using realistic simulation for performance analysis of MapReduce setups,” LSAP '09: Proceeding of the 1st ACM Workshop on Large-Scale System and Application Performance, 19-26, June 2009. X. Su and G. Swart, “Oracle in-database Hadoop: when MapReduce meets RDBMS,” SIGMOD ’12: Proceedings of the 2012 International Conference on Management of Data, 779-790, May 2012. J.-S. Lee, H.-D. Jeong, D. McNickle, and K. Pawlikowski. “Selfsimilar properties of spam,” in the Fifth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS-2011): Future Internet and Next Generation Networks (FINGNet-2011), Seoul, Korea, 2011, pp. 347–352. J.-S. Lee, H.-D. Jeong, D. McNickle, and K. Pawlikowski. “Selfsimilar properties of malicious teletraffic,” International Journal of

[10]

[11]

[12] [13]

770

Computer Systems Science and Engineering (IJCSSE), 2012 (Accepted). http://en.wikipedia.org/wiki/Apache_Hadoop T. White, Hadoop: The Definitive Guide. O’Reilly Media, Inc., USA, 2009. http://hadoop.apache.org/ P. Garcia-Teodoro, J. Diaz-Verdejo, G. Macia-Fernandez, and E. Vazquez, “Anomaly-based network intrusion detection: techniques, systems and challenges,” Computers & Security, vol. 28, pp. 18-28, 2009. T. Verwoerd and R. Hunt, “Intrusion detection techniques and approaches,” Computer Communications, vol. 25, no. 15, pp. 13561365, September 2002. Y. Bai and H. Kobayashi, “Intrusion detection systems: technology and development,” In Proc. of the 17th International Conference on Advanced Information Networking and Applications (AINA 2003), pp. 710-715, IEEE Computer Society, March 2003. H.J. Lee, “Big data platform and open-source,” 2012 IT 21 Global Conference, pp. 173-186, June 2012. J.-Y. Ko, S.-S. Hong, H.-S. Kim, D. Yang, D. Lim, and H.-D.J. Jeong, “A new Hadoop-based platform for detecting anomaly teletraffic,” In Proc. of 2012 Korean Society for Internet Information Summer Conference (PyeongChang, South Korea), June 2012.

Suggest Documents