Outlier detection in audit logs for application systems - Semantic Scholar

6 downloads 58104 Views 303KB Size Report
Mar 26, 2014 - Data analysis software. – Network security assessment software. – Assessment software for operating systems and data- base management ...
Information Systems 44 (2014) 22–33

Contents lists available at ScienceDirect

Information Systems journal homepage: www.elsevier.com/locate/infosys

Outlier detection in audit logs for application systems H.D. Kuna a,n, R. García-Martinez b, F.R. Villatoro c a Computer Science Department. School of Sciences, National University of Misiones, Félix de Azara 1552, Zip Code: N3300LQH, Posadas, Misiones, Argentina b Information Systems Research Group. Productive and Technological Development Department, National University of Lanús, Argentina c Language and Computer Science Department, University of Malaga. Spain

a r t i c l e in f o

abstract

Article history: Received 15 January 2014 Received in revised form 5 March 2014 Accepted 8 March 2014 Available online 26 March 2014

An outlier is defined as an observation that is significantly different from the other data in its set. An auditor will employ many techniques, processes and tools to identify these entries, and data mining is one such medium through which the auditor can analyze information. The enormous amount of information contained within transactional processing systems' logs means that auditors must employ automated systems for anomalous data detection. Several data mining algorithms have been tested, especially those that deal specifically with classification and outlier detection. A group of these previously described algorithms was selected for use in designing and developing a process to assist the auditor in anomalous data detection within audit logs. We have been successful in creating and ratifying an outlier detection process that works in the alphanumeric fields of the audit logs from an information system, thus constituting a useful tool for system auditors performing data analysis tasks. & 2014 Elsevier Ltd. All rights reserved.

Keywords: Data mining Systems audit Outlier detection

Contents 1. 2. 3.

4.

n

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1. Related work. data mining in systems auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data mining applied to outlier detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1. Outlier detection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2. Classification algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Algorithm selection process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1. Selection of outlier detection specific algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2. Classification algorithm selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Designing the proposed process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimentation on real databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1. Academic management system of a university . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Purchase management system of a local government . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3. Inventory control system of a wholesaler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Corresponding author. E-mail address: [email protected] (H.D. Kuna).

http://dx.doi.org/10.1016/j.is.2014.03.001 0306-4379/& 2014 Elsevier Ltd. All rights reserved.

23 23 24 24 25 25 25 25 26 27 28 28 28 29 30 30

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

5.

4.4.1. Academic management system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2. Purchase management system of a local government . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3. Inventory control system of a wholesaler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5. Discussion on the experimentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1. Introduction Information has become a key resource for all organizations, and Information Technologies (IT) have become increasingly widespread and are now firmly rooted in every aspect of organization management. Organizations have made gathering quality information to aid both in everyday activities and in decision making an important goal. Corporations must have formal processes in place to guarantee the legality, security and quality of their information. Systems auditing is composed of a series of tasks aimed at ensuring that all information systems within an organization function properly and at providing the basis that enables corporations to fulfill their strategic objectives. The list of good practices developed by the ISACA (Information Systems Audit and Control Association) [8] within its COBIT framework provides guidelines to aid organizations in achieving their corporate IT governance and management objectives. Audit logs contain records of every operation carried out within a software information system and play a key role in guaranteeing that each organization's procedures and regulations are observed. Finding anomalies through manual queries or analyses of the audit logs' stored data requires highly trained staff and a significant expenditure of man-hours. An outlier is an observation [16] suspected of originating from an alternative input mechanism due to its distinguishing features. Detecting these outliers in audit logs is extremely useful, as their existence can provide the auditor with crucial information, but manual searches would be too time-consuming due to the huge amounts of data found in these logs. Automated mechanisms, and data mining in particular, are of great use in this field because of their ability to detect patterns and non-obvious correlations among different pieces of data. Data mining, described as the process of intelligently extracting useful, non-apparent information from databases, has been utilized widely in systems auditing [7]. Some data mining techniques focus on outlier detection. Anomalous data may stem from the software systems' operating noise, and detecting these entries should be of paramount importance for the system auditor. Anomalous data detection in transactional software audit data logs is particularly important, as the risks posed by these anomalies may threaten the system's proper operation. Real databases contain anomalies related to different causes, including errors in data collection, errors in the information systems, probable malicious actions, and so on. In the particular case of the audit logs of application systems, anomalies can occur due to operations carried out within the system that are not considered common, errors

23

30 31 31 31 32 32

in the information system recording the audit logs, modifications to the audit log, and so on. In all cases, these audit trails signifying the detection of anomalous values must be eventually analyzed by the auditor because the underlying cause of these outliers may imply a risk to the security or quality of the data. This paper aims to introduce a process that employs data mining techniques to automate outlier detection in system audit logs that include alphanumeric data. Automated detection can allow an auditor to detect hints of anomalous activities, which will most likely require closer scrutiny. This system must also be usable by system auditors, even if they are not experts in data mining. 1.1. Related work. data mining in systems auditing Computer-Assisted Auditing Techniques (CAATs) make it possible to use computers as part of the auditing process. Data mining is one of the available techniques, but there are other options, namely, the following: – Data analysis software. – Network security assessment software. – Assessment software for operating systems and database management systems. – Software and source code testing tools. Many works have been published on intrusion detection in the log files of network operating systems, although fewer studies exist on management systems logs. [22,31,38] Outlier detection applications may also be found within databases [40,41]. The extant papers define a taxonomy for anomalies found through outlier detection [7], while some other papers make mention of work conducted on fraud detection for credit cards [4,37] and cellular phones [12]. Clustering is a data mining technique that may be employed for outlier detection. This strategy consists of unsupervised learning, during which data are automatically assigned to one of several clusters according to certain shared characteristics. A tuple is more likely to be tagged as an outlier the further it falls from the rest of the sample. Several clustering techniques are available, including the following: – Hierarchical clustering, which produces a hierarchy of clusters within the dataset. This technique's results are usually presented in a dendrogram. – Partitioning methods, in which the dataset is successively partitioned. Objects are clustered into different groups and therefore each object's deviation from the cluster's centers must be kept to a minimum.

24

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

– Density-based clustering, in which clusters are defined by object density. Objects in low density regions are considered anomalous. Other clustering procedures include fuzzy, neural networking, evolutionary algorithms, and entropy-based methods, just to name a few. We have already mentioned that some methods for outlier detection exist within operating systems audit logs. However, procedures have not been formally established to construct a system auditing tool from data mining techniques applicable to alphanumeric fields, which is the goal of this article. Furthermore, the tool we develop must not require its user to be an expert in data mining. The paper is organized as follows. In Section 2, we briefly review the data mining techniques applied to the detection of outliers. All materials and methods are described in Section 3. Experimentation is presented in Section 4. Finally, our main conclusions can be found in Section 5.

Normal and abnormal objects are identified by mapping them in a characteristic space in which the outliers are either far away from the rest of the dataset or located in areas with low object density [33,36]. – Subspace-based methods consider cluster density distributions within a subspace with a few dimensions. Those with lower than normal density are considered abnormal, and the outliers will be those points within said abnormal projections [1].

3. Materials and methods There are three approaches for the detection of anomalous values in databases [7,17], as follows:

 Approach type 1. Detection of outliers with unsupervised

2. Data mining applied to outlier detection Currently, data mining plays a major role in outlier detection. It is composed of a wide selection of techniques that use several types of algorithm to classify and define outliers according to their specific characteristics, as stated by Zhang [43], among others. We must note that these techniques have evolved in both efficiency and efficacy. The data mining-based methods for outlier detection [17] include the following: – Distribution-based methods assume that the data structure follows a probability distribution, such as a normal, Poisson or binomial distribution. Once the data distribution has been defined, it is statistically tested to determine which points are significantly different from the defined distribution and are therefore considered anomalous [15]. – Depth-based methods define each object as the representation of a point in a k-dimensional space. These points have their own depths, and the shallower ones are classified as outliers [19]. – Distance-based methods identify outliers by measuring the distance [20,21] between a point and its neighbor. Measures are usually taken using Euclidean distances. – Density-based methods take local density into account to identify anomalous data [5]. They do not separate examples into categories, i.e., outliers and non-outliers, but instead provide a value for each example to signify how likely that object is to be an actual outlier. – Clustering-based methods employ data mining techniques to isolate outliers in a cluster [11]. These methods were not designed for this particular purpose, but rather, to group data that share certain characteristics. – RNN-based methods are known for their ability to distinguish normal from abnormal cases. Abnormal cases are not reproduced well in the exit layer [35,39]. – Support Vector-based methods are usually employed for classification or regression analysis in data mining.





learning. In this case, no previous knowledge is necessary for the determination of the anomalous values in the databases. The data are processed with a static distribution, which determines the more distant points of the data sets. These points are marked as potential atypical values. Once a database has covered a sufficient number of cases, the new existing records are compared, and the possibility of anomalous data is determined. Approach type 2. Detection of outliers with supervised learning. In this case, the data must be previously classified, and the normal and atypical data must be determined. The approach with supervised learning may be utilized, for example, for on-line classifications, in which the algorithm learns from the model to have better predict new examples when they type of data is unknown. Ultimately, this process allows this approach to be able to classify both normal and anomalous data. Approach type 3. Detection of outliers with semisupervised learning. This approach is also known as news detection or news recognition and consists of a semi-structured classification technique in which the algorithm is trained to recognize a normal type of data. However, given that the algorithm knows the normal data, it learns to recognize the anomalous data. The final objective of this approach is to define a normality limit. As no anomalous data are required during the training period, this strategy is useful for cases when determining the presence of previous abnormal data are is very expensive.

The latest methods studied for the detection of anomalous values in databases have been termed hybrids [3,9,27]. These methods are characterized by the combination of at least two of the approaches previously described. The objective of combining different algorithms to detect outliers is their abilities to overcome the deficiencies of a given algorithm, benefitting from the strengths of each algorithm and minimizing the weaknesses. In this paper, a hybrid methodology was proposed to detect atypical values within an audit log of an application system. Previous work showed that a sole technique would not be sufficient to obtain the quality of results required by such activities as systems auditing [32].

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

Moreover, the hybrid approach has been successfully used in several studies [3,6,9,27,34]. In our opinion, the great diversity of uncertain scenarios in the detection of anomalous values in auditing logs may massively benefit from the integration of different algorithms for data mining. The experimentation will be carried out initially in an artificially created database; subsequently, the results will be validated with three real databases. To decide which algorithms would be integrated into our hybrid approach, the following elements were taken into consideration:

 The ability of the algorithm to produce results that are comprehensible for the final user.

 The efficacy in its detection of outliers [5,11].  The false positive rate, i.e., the fraction of data misclassified as outliers [2,20].

 The compatibility among the algorithms with the objectives of the procedure [24].

 The expected improvement of the efficacy by combin 



ing several techniques in comparison to using them separately [32]. That the algorithm can operate on alphanumeric data. That the algorithms did not require a large number of parameters and can be easily automated. This is very important in cases lacking expert auditors for data mining. Finally and importantly, that the algorithms are capable of improving their capacity to specifically detect outliers.

3.1. Algorithms Two criteria were used to select the algorithms and techniques. The first criterion dealt with those algorithms specifically designed for outlier detection in alphanumeric data due to the type of problem to be solved. The second criterion involved the selection of classification algorithms of a more general nature but was not designed for outlier detection. Instead, the selected algorithms had to be able to complement anomalous data-detecting algorithms to validate the detected outliers. Our aim when designing these criteria was to ensure that the process met the efficiency and efficacy demands of systems auditing. The second group of techniques was put in place to obtain models that would reinforce and/or improve the results from the first group. When selecting the techniques for the second group, Pyle's [28] recommendations on optimal algorithm selection according to the data mining environment were taken into account. Considering the particular characteristics of the problem we set out to solve, the literature suggests the use of the following techniques: – Rule extraction. – Decision trees, Top-Down Induction of Decision Trees (TDIDT). – Bayesian networks. – Neural networks.

25

3.1.1. Outlier detection algorithms We have considered the following algorithms for tuple detection (note that all are density-based methods): – LOF [5]: This technique detects outliers based on data density. The deviation of a given data with respect to its neighbors is measured by the so called Local Outlier Factor (LOF). This algorithm is unsupervised; hence, it does not require a prior classification of the data. – DBSCAN [11]: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm quite similar to LOF. This is one of the most commonly used algorithms for unsupervised data clustering. – DB-Outliers [20]: Distance-Based Outliers (DB-Outliers) is based only on a distance, instead of a density of data, as in LOF and DBSCAN. – COF [30]: Class Outlier Factor (COF) is an algorithm quite similar to DB-Outliers, except for it detects outliers in classes.

3.1.2. Classification algorithms We considered the following algorithms for data classification: – C4.5 [25]: This algorithm generates Top-Down Induction Decision Trees (TDITs) that can be used for the classification of the tuples. – Bayesian network [26]: This technique makes a probabilistic graphical model to classify the data into classes using a directed acyclic graph. – PRISM [18]: This rule extraction algorithm based on decision trees is similar to Quinlan's ID3 (Induction of Decision Trees) algorithm but uses a different induction strategy to induce rules that are modular. – PART [13]: Projective Adaptive Resonance Theory (PART) neural network is a very effective algorithm to cluster data sets in high dimensional spaces. This approach uses partial decision trees obtained using the C4.5 algorithm to select the best branch to make into a rule. – Perceptron Network [29]: This neural network acts as a linear classifier to define whether a tuple belongs to a certain class. Its use is limited to problems that are linearly separable. – Multilayer Perceptron Network [14]: To address nonlinearly separable classification problems, multiple layers of neural networks are employed. Usually, this approach consists of an input layer, an output layer, and several hidden layers in between. 3.2. Algorithm selection process The algorithms presented in the previous section, both those designed for outlier detection and those for general use, were analyzed according to the following criteria: – Their efficacy in detecting outliers [5,11]. – The number of errors in their classification processes [20,2].

26

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

– End-utilization legibility of the algorithms' results. – Compatibility of the algorithms with our processing objectives. – Improved efficacy when the techniques were employed together [32]. – The algorithms' ability to work with alphanumeric data. – That the algorithms did not require numerous parameters, as algorithms that can be automatically determined will be more user-friendly for systems auditors who are not experts in data mining. – That the classification algorithms could work with and complement other algorithms designed for outlier detection to merge both types of algorithm. Testing and selection were based on an artificial database that was created in accordance with the guidelines set forth by several authors. [2,19,39] Outliers were added to the database in a random fashion, and the resulting database was compared with the original to detect all anomalous values in the test database. Table 1 details the characteristics of the test database, as follows: 3.2.1. Selection of outlier detection specific algorithms Table 2 shows our test results on two separate performance metrics, efficacy percentage and false positives, for each tested algorithm [10,42,23]. Efficacy is defined as the number of outliers successfully detected, divided by the total number of outliers in the database. False positives represent the number of tuples mistakenly classified as outliers divided by the number of non-outlier tuples in the database. The goal of this process is to strike a balance between high efficacy (at least 65%) and a percentage of false positives below 5%. DB-Outliers and COF performed excellently, but they were discarded as they required that the total number of outliers to be detected be provided beforehand. As this number is unknowable in real conditions, LOF and DBSCAN were chosen instead. 3.2.1.1. Merging results from LOF and DBSCAN. To optimize results, we proposed that the results from applying Table 1 Artificial database characteristics. Characteristic

Quantity

Number of tuples Attribute number per tuple Value number per attribute Outlier tuple percentage

500 8 5 5%

LOF

DBSCAN

Fig. 1. Algorithm merging.

individual algorithms be merged, as shown in Fig. 1. After each algorithm is applied, a binary attribute is added to determine whether a tuple is an outlier. Values over 1.2 are considered to be outliers using LOF. DBSCAN places outliers in cluster 0. The way to accomplish this involves the addition of four columns to the audit table under evaluation: “LOF”, “LOF_value”, “DBSCAN_value” and “outlier_type”. These values should be completed for each tuple, according to the following criteria: Apply LOF Save the value returned by the algorithm to the “LOF” value. If “LOF”r1.2, then “LOF_value” is “0”. If “LOF”41.2, then “LOF_value” is “1”. Apply DBSCAN. If the tuple belongs to a cluster other than cluster 0, “DBSCAN_value”¼ “0”. If the tuple belongs to cluster 0, “BDSCAN_value” ¼“1”. 3.2.1.2. Outlier determination rules for LOF and DBSCAN algorithms. To reduce the amount of false positives, the database undergoes a cleaning process, during which each tuple is analyzed employing the following criteria: If “LOF_value”¼“1” and “DBSCAN_value”¼“1” “outlier_type”¼ “double” If “LOF_value”¼“1” and “DBSCAN_value”¼“0” “outlier_type”¼ “simple” If “LOF_value”¼“0” and “DBSCAN_value”¼“1” “outlier_type”¼ “simple” If “LOF_value”¼“0” and “DBSCAN_value”¼“0” “outlier_type”¼ “no_outlier” If “outlier_type” ¼“simple” and “LOF_value”¼“1” “LOF”41.26 then “outlier_type” ¼“simple”. If “outlier_type” ¼“simple” and “LOF_value”¼“1” “LOF”r1.26 then “outlier_type” ¼“no outlier”.

then then then then and and

(The 5% increase in the LOF value is due to an increased need for outlier determination, taking into account the results from the technique to avoid false positives.) If “outlier_type”¼“simple” and “DBSCAN_value” ¼“1” and “LOF”41.14 then “outlier_type” ¼“simple”. If “outlier_type”¼“simple” and “DBSCAN_value” ¼“1” and “LOF”r1.14 then “outlier_type”¼ “no_outlier”.

Table 2 Results for outlier-detecting algorithms. Algorithm

Total outliers

Detected outliers

Efficacy (%)

False positives (%)

LOF DBSCAN DB-OUTLIERS COF

25 25 25 25

17 12 19 18

68 48 76 72

2.2 1.6 1.2 1.5

(In this case, the 5% decrease in the LOF is due to a higher level of trust in the results from DBSCAN, as per the amount of errors recorded.) 3.2.1.3. Combining LOF and DBSCAN. A new database is produced from merging LOF and DBSCAN to include a new attribute, “outlier_type”.

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

27

Table 3 Results comparison. LOF

DBSCAN

Efficacy (%)

False positives (%)

Efficacy (%)

False positives (%)

Efficacy (%)

False positives (%)

68

2.2

48

1.6

72

1.2

Bayesian Network

C4.5

LOF plus DBSCAN

PART Fig. 2. Combination of the C4.5, Bayesian Network and PART algorithms.

Table 4 General purpose algorithm implementation results. Algorithm

C4.5 Bayesian network PRISM PART Perceptron network Multilayer perceptron network

Table 5 Results from classification algorithm implementation.

Efficacy (%)

Efficacy False Positives difference (%) (%)

False positives difference (%)

72 4 72 66 6 32

1.2 0 1.2 1 0.4 0.4

0  1.2 0  0.2  0.8  0.8

0  68 0 6  64  40

In Table 3, we can see that outlier detection generated a 4% improvement based on the use of LOF and a 24% improvement based on the use of DBSCAN. False positives were reduced by 1% when compared to LOF and by.4% when compared to DBSCAN. 3.2.2. Classification algorithm selection Our search for general purpose algorithms to be applied after analyzing the database with LOF and DBSCAN aimed to further improve the results of combining these algorithms, thereby improving outlier detection efficacy and reducing false positives. These classification algorithms predict the value of the “Class” attribute based on other attributes from the dataset. Applying outlier detection algorithms produces an attribute that tells us whether the tuple is an outlier. Afterwards, the classification algorithms check their own rule set and confirm or deny the outlier status. The merger of LOF and DBSCAN provides a unique attribute, “outlier_type”, that can express the result of the combined analysis. This eliminates the need to modify the database to include a target attribute for the classification data mining algorithms, thus greatly simplifying experimentation. Table 4 shows the different results from implementing each of the algorithms presented in point 4.1.2. The “Efficacy Difference” and “False Positives Difference” columns relate the results obtained from each general purpose algorithm combined with LOFþDBSCAN. The C4.5 and PRISM algorithms maintained the outlier detection percentage observed with LOF and DBSCAN; every other algorithm decreased the efficacy percentage. 3.2.2.1. Classification algorithms combination. As observed in Fig. 2, the choice was made to combine the C4.5, Bayesian Network and PART algorithms. PART was chosen over PRISM based on the points made by Pyle [28].

C4.5 Difference from LOF þ DBSCAN

BN difference from LOF þ DBSCAN

PART difference from LOF þDBSCAN

 4%

þ64%

 6%

 0.4%

0%

 0.2%

C4.5 and the Bayesian Network (BN) are included because the former kept the efficacy level intact but did not affect the percentage of false positive results, whereas the latter eliminated all false positives but drastically affected outlier detection efficacy. PART was included in an effort to obtain the best possible global results through their respective models. 3.2.2.2. Results from implementing C4.5, Bayesian network and PART algorithms individually. When the three classification algorithms were applied to the ad hoc test database, we were able to observe how false positive results are reduced under a combined approach. Table 5 shows the results of this test. After applying LOFþ DBSCAN, a set of rules is established within the procedure to optimize the combination of results from the C4.5, PART and Bayesian Network algorithms. This means there are two possible values for the “outlier_type” attribute in each tuple: “outlier” or “clean”. Thus, C4.5's results outweigh BN's when classifying outliers, but BN outweighs C4.5 in reducing false positives. PART is key for providing a global balance to outlier detection, and the results from LOF are considered a validation element. All of these rules arose due to the experimentation. The final objective of implementing these three classification algorithms is to optimize the results provided by the outlier detection algorithms by following a set of rules. 3.2.2.3. Outlier determination rules for classification algorithms. The final value for the “outlier_type” Class attribute in each tuple is determined by applying the following set of rules: – If “outlier_type”¼simple_outlier then “outlier_type”¼ double_outlier. (The reason for this operation is to generate only two values of the “outlier_type” attribute, thus simplifying classification.) – Classify each tuple with C4.5, BN and PART algorithms. – For each tuple: If for all classification algorithms “outlier_ type”¼double_outlier then “outlier_type”¼outlier.

28

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

– If “outlier_type”¼ double_outlier for BN and PART algorithms, then “outlier_type”¼outlier. – If “outlier_type”¼double_outlier for BN algorithm and “outlier_type”¼ no_outlier for C4.5 and PART algorithms and “LOF_value”41.26, then “outlier_type”¼ outlier. – If “outlier_type”¼double_outlier for C4.5 and PART and “outlier_type”¼no_outlier for BN algorithms, then “outlier_ type”¼outlier. – If “outlier_type” ¼double_outlier for C4.5, and “outlier_ type”¼no_outlier for PART, and “outlier_type”¼ no_outlier for BN, and “LOF_value”4 1.32, then “outlier_ type”¼outlier. (The 10% increase over the original LOF_value required for a tuple to be considered an outlier was obtained through experimentation and significantly improves outlier detection.) – If “outlier_type” ¼double_outlier for PART, and “outlier_ type”¼no_outlier for C4.5, and “outlier_type”¼ no_outlier for BN and “LOF_value” 41.32 the “outlier_type”¼ outlier. (The 10% increase over the original LOF_value required for a tuple to be considered an outlier was obtained through experimentation and significantly improves outlier detection.) – If any tuple meets none of these conditions, then “outlier_type”¼ clean.

b. Apply LOF. Add “LOF_value” attribute to each tuple and record the LOF results as per the criteria developed in point 4.2.1.1. c. Apply DBSCAN. Add the “DBSCAN_value” attribute to each tuple and record the DBSCAN results as per the criteria developed in point 4.2.1.1. d. Merge the results. Add the “outlier_type” attribute to each tuple. Record the results as per the criteria developed in point 4.2.1.2. e. Read the database f. Apply C4.5, determine the value of the “outlier_type” target attribute for this algorithm and save. g. Apply BN, determine the value of the “outlier_type” target attribute for this algorithm and save. h. Apply PART, determine the value of the “outlier_type” target attribute for this algorithm and save. i. Merge the results. j. Apply the rules for outlier tuple determination as per the criteria in point 4.2.2.3. k. Save the final results of the “outlier_type” target attribute for each tuple; this can be either “clean” or “outlier”. l. End he procedure. 4. Experimentation on real databases

3.3. Designing the proposed process The hybrid procedure presented in this section combines two approaches. On the one hand, a type 1 approach based on four unsupervised learning algorithms: LOF, DBSCAN, PART and Bayesian networks. On the other hand, a type 2 approach based on the supervised algorithm C4.5. Fig. 3 shows that the process is composed of the following steps: a. Read and pre-process the database.

To validate the performance trends observed through experimentation with artificial databases, we proceeded with experimentation over the audit logs of three databases from the real world: the academic management system of a university, the purchase management software of a local government, and the inventory control system for a wholesale company of edible products. In three cases, audit logs of diverse tables corresponding to the aforementioned systems were used. The software to carry out the experimentation has been written in a Java version of the integrated development environment NetBeans 7.1. The source code is available at https://mega.co.nz/#F!1YsWGaTI!blKGdzqnxlE2gSlXRvkiRA.

Read

4.1. Academic management system of a university LOF

DBSCAN

Merge results

Apply outlier detection rules

Read DB C4.5

BN

Merge results

Outlier determination rules

Outliers detected Fig. 3. Suggested process.

PART

The first experimentation was carried out on a database from a university student management system, investigating the audit records from the “Exam Management”, “Course Management”, and “Enrollment Management”. The selection of modules and tables to test the suggested process was made jointly with the staff who manage the system. Table 6 shows the tables containing the audit logs from each analyzed module. The system experts recommended using the data belonging to the 1998–2001 time period, during which anomalous operations took place within the system. A minimum efficacy of 65% and a 1% maximum for false positives were established after consultation with the aforementioned experts, two system administrators who work on the student management system and have ample experience in academic management systems. They performed an in-depth manual analysis of all the data in the audit logs from the “Exam Management” module and pinpointed all anomalous tuples to validate

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

29

Table 6 Log tables selected for analysis. Modules

Table Name

Rows

Columns

Exam management

log_exam_certificate log_certificate_details

10,805 84,263

28 23

Course management

log_course_certificate log_class_certificate log_course_certificate_details log_class_certificate_details

18,572 74,03 17,321 12,776

19 22 20 19

Enrollment management

log_students log_degree_aspire

5701 9049

16 15

Table 7 Tables employed to test the process.

Table 8 Database metadata utilized in the experimentation.

Table name

Row quantity

Properties

Values

log_first_exam_2000 log_second_exam_2000 log_third_exam_2000 log_fourth_exam_2000 log_fifth_exam_2000 log_sixth_exam_2000 log_seventh_exam_2000 log_eighth_exam_2000 log_ninth_exam_2000 log_special_exam_2000

7840 4468 13,590 7830 3985 1896 17,009 7912 9663 2236

System modules

Check-books Accounting entries Checks Purchases Delivery order Purchase petty cash Receipts End utilization Bank book Purchase order Payments petty cash Records Suppliers Resources Seals Accounting parameters 38,120 10 1524 Year 2012

the results by applying our process. The module was chosen due to the critical task it performs within the system. Preprocessing was performed on the tables related to the Exams module, which included eliminating attributes that provided no information or that had null values. The “log_exam_certificate” and “log_certificate_details” tables were merged to apply the process to a single audit logs table within the Exams module, thus creating the “log_ exams” database. The data selected for the test came from the year 2000, as it was determined by the experts that most anomalous operations occurred during that year. To further optimize analysis and outlier detection, we followed another piece of advice from the experts and divided the database into ten separate tables, each corresponding to each exam shift, as shown in Table 7. To clarify the meaning of outliers in the academic management system, let us list some examples of activities that the experts consider as anomalies: activities in the audit log during either holidays or outside the shifts of the personnel, operations not meeting the profile or permissions for a given user, activities going against the internal regulations defined by the university, and data recorded outside the date established under the calendar of the university. Let us also mention some negative effects that these activities could generate, such as the following: errors in the academic reports handed over to students or lecturers, errors in the issuance of degrees, low quality of the information employed for decision-making, and anomalous data in performance indicators sent to the control bodies. The auditor must analyze these anomalous activities in the record of the audit log to verify that they are truly anomalous and they are not carried out within a framework of legality.

Number of rows Number of columns Record average per module Period considered

4.2. Purchase management system of a local government For this experimentation, data corresponding to the audit logs of certain modules of the purchase management system of a local government were used. This selection was carried out in cooperation with the system administrators, and the results of the evaluation were carried out likewise. Theirs is an old system that has been modified and updated in recent years. Their method for the auditing of operations of the users is centralized. It consist of a unique table at which the operations are recorded in all the modules of the system. In Table 8, some metadata of the database used are shown. Experts in the management of this system recommended the analysis of operations corresponding to the first semester of 2012. During this period, several changes in the system were carried out. Hence, anomalous data could be introduced in the audit log in this critical occasion. The system administrators, with wide experience in this sector, recommended an efficiency minimum of 70% and a maximum percentage of false positives of 1%. Therefore, these limits were established for the performance. They carried out a manual analysis of the data to

30

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

Table 9 Tables generated for the experimentation. Table name

Row quantity

log_january_2012 log_february_2012 log_march_2012 log_april_2012 log_may_2012 log_june_2012

1729 2163 3103 3682 3152 2893

Table 10 Log tables selected for analysis. Table Name

Rows

Columns

purchase_log log_detail_purchase concepts_log Items_log supplier_item_log suppliers_log payments_log detailed payments_log

5650 8772 6087 5970 13,469 14,394 13,575 6551

19 13 5 20 9 25 13 22

mark those tuples that could be considered as anomalous and those that should be detected by our procedure. In the context of this system, let us list some common operations considered anomalous in the audit log: those made outside the shift of the employees of the local government, user access to functionalities not inherent to its duties, purchases or expenditure authorizations out of the regular parameters, or unauthorized transactions according to the internal procedures of the entity. These fraudulent activities could affect the system in terms of completion, truthfulness, and legality. They also lead to problems of budgetary, administrative, and even legal character for the authorities involved. For such a reason, the anomalies detected require a subsequent manual analysis on the part of the systems auditor. To carry out our experimentation, no particular model was selected, given the characteristics of the data available. Each month of the semester was analyzed separately due to a recommendation of the experts. Their aim was to obtain the best benefit from the results. Six different tables corresponding to each month of the semester were generated, as observed in Table 9. 4.3. Inventory control system of a wholesaler Additional experimentation was carried out using the data belonging to the audit logs of one of the modules of the inventory control system of a company, that of “Purchases”. This module contains details of the operations of the purchases of physical inventories the company carries out. Our choice for this module is in agreement with the system administrators. Table 10 shows some of the tables containing the module in evaluation. Experts in this system recommended that data corresponding to 2012 be used. The reason for this is because this period contained several updates to the system,

altering the position of the database. As with the previous experiments, minimum values of efficacy and maximum values of false positive were established for this experimentation; the experts recommended 65% and 1%, respectively. Likewise, the experts carried out a manual analysis to determine which of the rows in the database could be considered outliers and should be detected when the generated procedure was executed. In relation to this inventory control system, outliers could represent anomalous activities, such as the following: error in the amounts of operations to record, wrong accesses on the part of registered users, amounts in certain purchases superior to the usual ones, excessive volume in the purchase of products in relation to the average quantity, recording of operations inconsistent with the regulations both internal and from the different control bodies, and security errors. This type of activity would have damaging effects on the company, mainly in the quality and validity of the information the executives take decisions and over which tax reports are generated, before being periodically submitted before the control bodies, at which point a company could have legal problems. The anomalies detected on the audit trail require subsequent analysis by the auditor. We must emphasize that previous to the completion of the experimentation, the data were pre-processed. The attributes that were not considered relevant were eliminated, either because they did not provide important information or because they presented a high percentage of null values. Additionally, the “purchase_log” and “detailed_purchase_log” tables were joined, using only the data of the second semester of 2012 to generate a new table named “detailed_purchase_log”. The experimentation was effectively carried out with this table. All these decisions were based on recommendations from the experts. Table 11 shows the characteristics of the new table generated.

4.4. Results 4.4.1. Academic management system Table 12 shows the results of applying our process compared against those from the manual analysis performed by the system administrators. The efficacy was always over 66%, with a mean value of 76%. The number of false positives in all the cases is smaller than 0.67%, with a minimum value of 0.10%. Table 13 shows the classification of the types of outliers detected by using our procedure applied to the academic management database. The false positives were not included. The most common abnormalities were the following: Table 11 Properties of the table generated for the analysis. Property

Value

Number of rows Number of columns Name of the table Period considered

7132 23 “detailed_purchase_log” July–December 2012

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

Table 12 Results for the academic management system. Table Name

log_first_exam_2000 log_second_exam_2000 log_third_exam_2000 log_fourth_exam_2000 log_fifth_exam_2000 log_sixth_exam_2000 log_seventh_exam_2000 log_eighth_exam_2000 log_ninth_exam_2000 log_special_exam_2000

Table 15 Classification of the outliers detected in the local government's management system.

Total Detected outliers outliers

Efficacy (%)

False positives (%)

94 55 93 88 57 38 92 77 97 60

65.95 76.36 69.89 65.90 84.21 78.94 66.30 88.31 76.28 90.00

0.10 0.24 0.20 0.17 0.40 0.47 0.11 0.27 0.26 0.67

70 53 93 72 64 39 81 90 100 69

31

Table 13 Classification of the outliers detected in the academic management system. Types of outliers detected

Percentage (%)

Operations at non-habitual hours Wrong accesses of registered users Anomalous operations non-consistent with operation regulations of the system Records not correspondent with the defined spaces in the academic calendar

26.98 24.52 29.46 19.04

Table 14 Results for the local government's purchase management system. Table Name

Total outliers

Detected outliers

Efficacy (%)

False positives (%)

log_january_2012 log_february_2012 log_march_2012 log_april_2012 log_may_2012 log_june_2012

33 38 41 57 27 36

27 42 31 41 51 28

72.73 89.47 75.61 70.18 100 75.00

0.18 0.38 0.00 0.03 0.77 0.04

non-consistent operations with current regulations of the system (over 29%), operations performed outside the schedule hours (over 27%), wrong access of registered users (about 25%), and records not corresponding with the defined spaces in the academic schedule (19%). 4.4.2. Purchase management system of a local government Table 14 shows a comparison between the results of the execution of our procedure with the purchase system of a local government and the manual analysis carried out by the system administrators. The efficacy was always over 70%, with a mean value of the 80%. In fact, on the data from May, a 100% efficacy was achieved. The number of false positives in all the cases was smaller than 0.77%, with a minimum value of 0.00% in the data of March. Table 15 shows a classification of the types of outliers detected in an automatic way by our procedure's application to the local government's database. Let us note that the false positives were not considered. The most common

Types of outliers detected

Percentage (%)

Operation at non-habitual hours Wrong accesses of registered users Operations out of the habitual parameters Operations inconsistent with internal regulations

25.88 17.25 23.85 33.02

Table16 Results for the commercial management system. Table name

Total Detected outliers outliers

“detailed_purchase_log” 108

96

Efficacy (%)

False positives (%)

87.04

0.03

Table 17 Classification of the outliers detected in the commercial management system. Types of outliers detected

Percentage (%)

Wrong accesses of registered users Operations with amounts or volumes of purchase superior to the average Inconsistencies with internal/external norms Security errors

28.72 24.48 26.59 20.21

anomalies were operations inconsistent with internal regulations (33%), followed by those outside customary hours (26%), by those out the standard parameters (24%), and by wrong accesses at the system by registered users (17%). 4.4.3. Inventory control system of a wholesaler Table 16 shows the results of a comparison between the manual and the automatic audit of the inventory control system of a wholesaler; the former was carried out by the administrators of the system and the latter using our procedure. As shown in Table 16, the efficacy of our procedure was approximately 87%, and the number of false positives was very small (0.03%). Table 17 shows a classification of the types of outliers detected using our automatic procedure in the commercial management database; false positives were not included. The most common anomaly audited by our procedure was wrong accesses of registered users (29%). Inconsistencies with the internal/external norms of the company composed 27% of the anomalies. Operations with amounts or volumes of purchase significantly above the average corresponded to the 25% of the outliers in the audit log of the system. Finally, approximately 20% of the abnormalities in the log were security errors. 4.5. Discussion on the experimentation The evaluation of new CAAT procedures requires the collaboration of expert auditors. They must be asked to

32

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

advise the maximum acceptable amount of false positives. Furthermore, the efficiency of the results obtained by the computer must be calculated by comparison with those obtained by manual auditing. By means of the experimentation carried out on the three databases of real systems, our procedure, in call cases, overcame the minimum values imposed by the experts. The average of efficacy in the detection of outliers was near 80%, with a minimum of 66% and, occasionally, a maximum of 100%. In relation to the percentage of false positives, in none of the cases studied was it greater than 1%, which allows us to state that such a requirement was also solved. 5. Conclusions We have successfully created a process to combine different data mining algorithms aimed at outlier detection in alphanumeric data within information systems' audit logs. This aim was achieved by combining outlier detection algorithms and classification algorithms to validate the results for each tuple. All of the proposed quality goals were met. Based on these findings, we can conclude that the data mining-algorithm merged approach can be considered to be a resounding success, allowing us to develop a process and to apply it on audit tables from a real database and thus to facilitate the system auditor's job. We have also contemplated the further optimization of the process' efficacy and would also like to reduce the rate of false positives to even lower levels in our future work. We have analyzed the convenience of employing fuzzy logic, as in many cases a tuple does not respond to the two values that our process establishes. References [1] C.C. Aggarwal, P.S. Yu, Outlier detection for high dimensional data, in: T. Sellis and S. Mehrotra (Eds.), Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, ACM, New York, 2001, pp. 37–46. [2] C.C. Aggarwal, P.S. Yu, An effective and efficient algorithm for highdimensional outlier detection, VLDB J. 14 (2005) 211–221. [3] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [4] R. Bolton, D. Hand, Unsupervised profiling methods for fraud detection, in: Thomas, J.N. Crook, D.B. Edelman, (Eds.), Proceedings of the Conference on Credit Scoring and Credit Control VII, L.C.CRC, Edinburgh, Scotland, 1999, e-pub. [5] M.M. Breunig, H.P. Kriegel, R.T. Ng, J. Sander, LOF: Identifying density-based local outliers, in: W. Chen, J.F. Naughton, P.A. Bernstein, (Eds.), Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, ACM, 2000, pp. 93–104. [6] C.E. Brodley, M.A. Friedl, Identifying and eliminating mislabeled training instances. In AAAI/IAAI, 1, 1996, pp. 799–805. [7] V. Chandola, A. Banerjee V. Kumar, Anomaly detection: a survey, ACM Comput. Surveys 41, 2009, pp. 15–58. [8] COBIT 5, Control Objectives for Information and related Technology, ISACA, 2013, 〈http://www.isaca.org/cobit/〉 (e-pub). [9] Q. Deng, G. Mei, Combining self-organizing map and K-means clustering for detecting fraudulent financial statements, in: Proceedings of the Granular Computing, 2009, GRC'09. IEEE International Conference, 2009, pp. 126–131. [10] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data, in: D. Barbara S. Jajodia, (Eds.), Proceedings of the Applications of Data Mining in Computer Security, Kluwer, Dordrecht, 2002, (Chapter 4). [11] M. Ester, H. Kriegel, J. Sander X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: E. Simoudis,

J. Han, U.M. Fayyad, (Eds.), Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, 1996, pp. 226–231. [12] T. Fawcett, F. Provost, Activity monitoring: noticing interesting changes in behavior, in: S. Chaudhuri D. Madigan, (Eds.), Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York, 1999, pp. 53–62. [13] E. Frank I.H. Witten, Generating accurate rule sets without global optimization, in: J.W. Shavlik, (Ed.), Proceeding of the 15th International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1998, pp. 350–358. [14] J.A. Freeman, D.M. Skapura, Neural Networks: Algorithms, Applications, and Programming Techniques, Addison-Wesley, Boston, MA, 1991. [15] F.E. Grubbs, Procedures for detecting outlying observations in samples, Technometrics 11, 1969, pp. 1–21. [16] D. Hawkins, Identification of Outliers, Chapman and Hall, London, 1980. [17] V. Hodge and J. Austin, A survey of outlier detection methodologies, Artif. Intell. Rev. 22 (2004), 85–126. [18] C. Jadzia, PRISM: an algorithm for inducing modular rules, Int. J. Mach. Stud. 27 (1987) 349–370. [19] T. Johnson, I. Kwok R. Ng, Fast Computation of 2-Dimensional Depth Contours, in: R. Agrawal P. Stolorz, (Eds.) Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining,, AAAI Press, Menlo Park, CA, 1998, pp. 224–228. [20] E.M. Knorr R.T. Ng, Algorithms for mining distance-based outliers in large datasets, in: A. Gupta, O. Shmueli, J. Widom, (Eds.), Proceeding of the 24rd International Conference on Very Large Data Bases, Morgan Kaufmann, San Francisco, CA, 1998, pp. 392–403. [21] E.M. Knorr, R.T. Ng, V. Tucakov, Distance-Based Outliers: Algorithms and Applications, VLDB J. 8 (2000) 237–253. [22] W. Lee, S.J. Stolfo, K.W. Mok, Mining audit data to build intrusion detection models, in J. McLean and J. Millen, (Eds.), Proceedings of the 1999 IEEE Symposium on Security and Privacy, IEEE Computer Society, Los Alamitos, CA, 1999, e-pub. [23] K. Leung C. Leckie, Unsupervised anomaly detection in network intrusion detection using clusters, in: V. Estivill-Castro, (Eds.), Proceedings of the Twenty-eighth Australasian Conference on Computer Science, 38, Australian Computer Society, Darlinghurst, Australia, 2005, pp. 333–342. [24] D. Luebbers, U. Grimmer, M. Jarke, Systematic development of data mining-based data quality tools, in: J.C. Freytag, P.C. Lockemann, S. Abiteboul, M.J. Carey, P.G. Selinger, and A. Heuer, (Eds.), Proceedings of the 29th international conference on Very large data bases, VLDB Endowment, 2003, pp. 548–559. [25] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Francisco, CA, 1993. [26] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, Morgan Kaufmann, San Francisco, CA, 1988. [27] K.I. Penny, I.T. Jolliffe, A comparison of multivariate outlier detection methods for clinical laboratory safety data, Journal of the Royal Statistical Society: Series D (The Statistician) 50 (2001), 295–307. [28] D. Pyle, Business Modeling and Data Mining, Morgan Kaufmann, San Francisco, CA, 2003. [29] F. Rosenblatt, The perceptron: A probabilistic model for information storage and organization in the brain, Psychol. Rev. 65 (1958) 386–408. [30] M.K. Saad, N.M. Hewahi, A comparative study of outlier mining and class outlier mining, Comput. Sci. Lett. 1 (2009). (epub). [31] D. Said, L. Stirling, P. Federolf, K. Barker, Data preprocessing for distance-based unsupervised intrusion detection, in: Proceedings of the Ninth Annual International Conference on Privacy, Security and Trust, IEEE Computer Society, Washington, 2011, pp. 181–188. [32] C. Schaffer, A conservation law for generalization performance, in: W.W. Cohen H. Hirsh, (Eds.), Proceedings of the Eleventh International Conference on Machine Learning, Morgan Kaufmann, San Francisco, CA, 1994, pp. 259–265. [33] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimating the Support of a High-Dimensional Distribution, Neural Comput. 13 (2011) 443–1471. [34] P. Smyth, Markov monitoring with unknown states, selected areas in communications, IEEE J. 12 (9) (1994) 1600–1612. [35] P. Sykacek, Equivalent error bars for neural network classifiers trained by Bayesian inference, in: M. Verleysen, (Ed.), Proceedings of ESANN 1997, 5th European Symposium on Artificial Neural Networks, D-Facto public., Bruges, Belgium, 1997, pp. 121–126. [36] D.M. Tax, R.P. Duin, Support vector domain description, Pattern Recogn, Lett, 20 (1999) 1191–1199. [37] H. Teng, K. Chen, S. Lu, Adaptive real-time anomaly detection using inductively generated sequential patterns, in: Proceedings of the

H.D. Kuna et al. / Information Systems 44 (2014) 22–33

IEEE Computer Society Symposium on Research in Security and Privacy, IEEE Computer Society, Washington, 1990, pp. 278–284. [38] R. Vaarandi, A data clustering algorithm for mining patterns from event logs, in: D. Medhi, (Ed.), Proceedings of the 3rd IEEE Workshop on IP Operations and Management,, IEEE Computer Society, Washington, 2003, pp. 119–126. [39] G. Williams, R. Baxter, H. He, S. Hawkins, L. Gu, A comparative study of RNN for outlier detection in data mining, in: V. Kumar, S. Tsumoto, N. Zhong, P.S. Yu, and X. Wu, (Eds.), Proceedings of the IEEE International Conference on Data Mining, IEEE Computer Society, Los Alamitos, CA, 2002, pp. 709–712. [40] N. Wu, L. Shi, Q. Jiang, F. Weng, An outlier mining-based method for anomaly detection, in: IEEE International Workshop on Anti-counterfeiting, Security, Identification, IEEE Computer Society, Washington, 2007, pp. 152–156.

33

[41] K.A. Yoon, O.S. Kwon, D.H. Bae, An approach to outlier detection of software measurement data using the K-means clustering method, in: Proceedings of the First International Symposium on Empirical Software Engineering and Measurement, IEEE Computer Society, Washington, 2007, pp. 443–445. [42] Y. Zhang M. Zulkernine, anomaly based network intrusion detection with unsupervised outlier detection, in: E. Panayirci and M. Ulema, (Eds.), IEEE International Conference on Communications ICC'06, IEEE Computer Society, Washington, 2006, pp. 2388–2393. [43] Y. Zhang, N. Meratnia, P. Havinga, A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets, Technical Report TR-CTIT-07-79, Centre for Telematics and Information Technology, University of Twente, Enschede, 2007.

Suggest Documents