Intelligent Clustering Scheme for Log Data Streams

Intelligent Clustering Scheme for Log Data Streams Basanta Joshi, Umanga Bista, and Manoj Ghimire Immune APS Nepal, Lalitpur, Nepal {baj,umb,mg}@dev.immunesecurity.com

Abstract. Mining patterns from the log messages is valuable for realtime analysis and detecting faults, anomaly and security threats. A datastreaming algorithm with an efficient pattern finding approach is more practical way to classify these ubiquitous logs. Thus, in this paper the authors propose a novel online approach for finding patterns in log data sets where a locally sensitive signature is generated for similar log messages. The similarity of these log messages is identified by parsing log messages and then, logically analyzing the signature bit stream associated with them. In addition to that the approach is intelligent enough to reflect the changes when a totally new log appears in the system. The validation of the proposed method is done by comparing F-measure of clustering results for labeled datasets and the word order matched percentage of the log messages in a cluster for unlabeled case with that of SLCT. Keywords: event log mining, similarity search, log clustering, local sensitive hashing, sketching.

1

Introduction

Log messages are generated to record events from one service or different services within the network and are an important indication of the current status of the system(s) they monitor. Modern computer systems and networks are increasing in size and complexity at an exponential rate and the massive amount of log messages are generated from these networks. These messages can be used for network management and systems administration to monitoring and trouble shooting the system(s) behavior and and even for security analysis. Therefore, efficient methods for processing the ever growing volumes of log data has taken an increased importance and mining patterns from these log messages are indispensable. Numerous log file analysis tools and techniques are available to carry out a variety of analyses. There have been several researches providing insights of varying degrees to log file analysis and tools developed applied in the areas of network management, network monitoring, etc. [1–6]. One of the simplest tool for log message clustering, Simple Log file Clustering Tool (SLCT) was proposed by Risto Vaarandi[1]. The basic operation is inspired by apriori algorithms for mining frequent item A. Gelbukh (Ed.): CICLing 2014, Part II, LNCS 8404, pp. 454–465, 2014. c Springer-Verlag Berlin Heidelberg 2014

Intelligent Clustering Scheme for Log Data Streams

455

sets. In this iterative approach, the clustering algorithm is based on finding frequent item sets from log file data and requires human interaction; an automatic approach is not mentioned [7]. The clustering task is accomplished in three passes over the log file. The first is to build a data summary, i.e. the frequency count of each word in the log file according to its position in each line. In the second step, the cluster candidates by choosing log line with words that occur more than the threshold, specified by the user. In the third step, the clusters are chosen from these candidates that occur at a frequency higher than the user specified threshold. The words in each candidate that have a frequency lower than the threshold are considered as the variable part. This algorithm was designed to detect frequently occurring patterns and this frequency is a user specified threshold. All log lines that don’t satisfy this condition are considered as outliers [8]. The clusters of log files produced by SLCT can be viewed by a visualization tool called LogView [2]. This method utilizes tree maps to visualize the hierarchical structure of the clusters produced by SLCT. It basically speeds up the analysis of vast amount of data contained in the log files by summarization of different application log files and can be used to detect any security issues on a given application. There are other tools proposed for mining of temporal patterns from logs with various association rule algorithms [9–11]. In most of these algorithms, there is an assumption that the event log has been normalized, i.e., all events in the event log have a common format. But, in reality log coming from different sources might have different formats and the temporal correlation of data’s can only be established by normalizing the log received from different sources. To achieve this, many machine-learning techniques have been applied to aid the process of extracting actionable information from logs. Splunk with Prealert [5] is one of popular log management tool using intelligent algorithms. Another important aspect of log analysis system is real-time analysis of millions of log data per second. This can only be possible if the system knows the pattern of log message it receives or can create the pattern for future references if it receives new data. To address these issues, the authors have proposed locally sensitive streaming data algorithm where a similarity search is done to create clusters of log messages from big volume of log data. The clustered logs have a single signature pattern and the list of cluster pattern is maintained as a universal signature. Inspired by the cluster evaluation strategy used in Makunju et al. [12], the performance of the proposed algorithm is compared with widely used Simple log clustering tool.

2

Related Literature

Efficient processing over log messages has taken an increased importance due to the growing availability of large volumes of log data from a variety of resources connected to a system. In particular, monitoring huge and rapidly changing streams of data that arrive online has emerged as the data-streaming model and has recently received a lot of attention. This model differs from computation

456

B. Joshi, U. Bista, and M. Ghimire

over traditional stored data sets since algorithms must process their input by making one or a small number of passes over it, using only a limited amount of working memory. The streaming model applies to settings where the size of the input far exceeds the size of the main memory available and the only feasible access to the data is by making one or more passes over it [13]. A fundamental computational primitive for dealing with massive dataset is the Nearest Neighbor (NN) problem. The problem is defined as follows: given a collection of n objects, build a data structure, given arbitrary query object, reports the dataset object that is most similar to the query [14]. Nearest neighbor search is inherently expensive due to the curse of dimensionality. For high dimensional spaces, there are often no known algorithms for nearest neighbor search that are more efficient than simple linear search. As linear search is too costly for many applications, this has generated an interest in algorithms that perform approximate nearest neighbor search, in which non-optimal neighbors are sometimes returned. Such approximate algorithms can be orders of magnitude faster than exact search, while still providing near optimal accuracy. A detail study of exact search schemes and approximate search schemes has been presented by Panigrahy[15]. One of the method for dramatic performance gains are obtained using approximate search schemes, such as the popular Locality Sensitive Hashing (LSH) [16]. Several extensions have been proposed to address the limitations of this algorithm, in particular, by choosing more appropriate hash functions to better partition the vector space[17, 18]. One of the methods to generate the hash function is by using, sketches, a space-efficient approach. A sketch is a binary string representation of an object. The similarity of the objects can be estimated by using the hamming distance of their respective sketches [3]. A fast hash function and sketches of an object can be created using a Bloom filter which is a simple space-efficient randomized data structure for representing a set in order to support membership queries [19, 20]. Inspired by the methods, the authors are proposing a hybrid method for clustering log messages in real time, in which LSH method is combined with a method of generating the hash map by using sketching. The details of this method is discussed in Section. 3.

3

Log Clustering Scheme

Common sources of log information are facilities, such as Syslog under Unix like operating systems and Windows Event Logs under Microsoft Windows. Several log facilities collect events with a text field, which is used in many ways and not further standardized. A special scheme is necessary for analyzing this unstructured text field. So, locally sensitive streaming data algorithm is a viable option, where a similarity search is done to create clusters of log messages from big volume of log data. A signature for each log message is generated using sketching scheme and similarity is identified by analyzing the difference of bit patterns


457

Table 1. Signature pattern for log messages Tokenized word User Manoj Ghimire logged out. H1 5 11 4 37 23 H2 14 20 20 34 1 Hashed function H3 10 38 19 25 38 Signature 0100110000110010000110010100000000100110 Log2: User Basanta Joshi logged out. H1 5 1 22 37 23 H2 14 19 10 34 1 Hashed function H3 10 33 17 25 38 Signature 0100010000100010010100110100000001100110 Log1:

between current log message and that of the bit pattern stored in the system database. The steps are described in the section below. 3.1

Signature Generation

In this algorithm, at first, a log data is tokenized such that a log message is represented as a list of words and each word is assigned a numerical value by using a non-cryptographic hash function specifically Murmur hash function [21]. The seeding value of this Murmur hash function is changed to generate 3-4 random hashed values for each tokenized word. Then, a log message is represented as a set of vectors; each vector represents a particular word. Inspired by algorithms proposed by M¨ uller et al. [3], an n-bit signature pattern is generated for each log message. Only those bits in the pattern are set which are indicated by 3-4 random hashed values of the tokenized word in a log message [19]. The range of randomized hashed function is defined by the number of the bits in the signature pattern. For example: Lets say we have two log messages and three hashed functions and a log message is represented by 40-bits signature as shown in Table. 1. Here, for each log message, there are five tokenized words. The hashed value for word user with three randomized hashed function is 5, 14 and 10 respectively which sets 6th , 15th and 11th bit of the signature. Similarly, all those bits in the signature are set, which are indicated by random hashed values of the other tokenized words. 3.2

Similarity Search

Lets assume that we are creating lists of universal signatures for all the log files. In this case, a log signature is matched with the universal signature list. If the log signature is similar to a universal signature in the list then it is assigned to that group, otherwise a new universal signature is created as shown in Fig. 1. The criteria for choosing whether the log message belong to a particular pattern is given by the ratio of the number of common set bits between log signature

458


Fig. 1. Log clustering algorithm

and universal pattern to the number of the set bits in the log signature and universal pattern. The ANDing of the bits of log signature and universal pattern represents the common set bits while ORing of the bits of log signature and universal pattern represents the set bits in both as shown in Table. 2.

Table 2. Similarity estimation terms Term

Symbol

Operation

No of common set bits

Nc

ANDing of universal signature and log signature

Total no of set bits

Ns

ORing of universal signature and log signature

As seen in Eq. 1, the ratio must be greater than some threshold percentage, and then the log message can be considered similar to a universal signature. In this case, the universal signature is updated with ANDing of the bits, which is considered to be learning in the algorithm. Ideally, and after a rigorous learning, the set bit in the universal bit stream reflects the exact pattern for a log cluster. Nc ≥ Tp Ns

4

(1)

Results and Discussion

The performance assessment of the proposed algorithm for clustering of log messages was done by using data collected from different servers and comparing the results with a simple log clustering tool. The F-measure was compared for the


459

clusters generated with proposed algorithm against the clusters by SLCT, for the datasets in which events (or classes) were known. While the results with unlabeled log datasets, where the manual identification of events (or classes) is time consuming or almost impossible, is evaluated using our validation algorithm. In this validation algorithm, the percentage of the matched word order between the messages in a particular cluster is used as a evaluation measure. The validation algorithm is only used for the proposed clustering algorithm. All the process was carried out in Mac OSX Lion with i7 processor 2.9 GHz speed and 8 GB RAM. 4.1

Test Data Set

The test data comprises of variety of log data collected from different servers. It includes Syslog, Windows server logs, Windows firewall logs, Authentication logs, SSHD login logs, Linux server logs, Webserver logs, Mail server logs etc. For the present test, nine samples of data are used and listed in the order of message volume as shown in Table. 3. Identifier used in this table identifies the different types of log files with varying number of log messages and log types. The first four sets are classified as labeled with known events (or classes), whereas the other five are classified as unlabeled with unknown events (or classes). Table 3. Test data statistics SN.

Identifier

1 2 3 4

auth win sshd misc

5 6 7 8 9

unlabeled1 unlabeled2 unlabeled3 unlabeled4 unlabeled5

4.2

Description

No. of Data size Average Logs (MB) Word Count Datasets with known Labels for Log messages Authentication logs 132204 16.4 12 Windows firewall 59531 52.8 134 SSHD logs 50661 5.3 11 Mixed of Windows 91375 16.3 18 server, and Linux logs Datasets with unlabeled Log messages 465 0.091 20 260386 53 23 235381 254 118 684035 430 70 798459 568 10

No. of Classes

14 42 7 20

Unknown Unknown Unknown Unknown Unknown

Clustering and Validation of Labeled Datasets

In the first test, the performance of SLCT and proposed algorithm was evaluated by F-measure. SLCT has various parameters, but the test was run by tuning support threshold only, and other parameters were left to default. Similarly, the

460


computation of F-measure in case of clustering is not straightforward because the classes always do not conform with the clusters, and there may be split, i.e. same class may split to multiple clusters. In these type of clusters, the appropriate formula for calculating F-measure[22] is summarized in Fig. 2.

Let N be the number of data points, C be the set of Classes, K be the number of Clusters, and nij be the number of data points of class ci ∈C, that are elements of cluster kj ∈K. F (C, K) =

|ci | max {F (ci , kj )} N kj ∈K c ∈C i

2 ∗ P (ci , kj ) ∗ R(ci , kj ) P (ci , kj ) + R(ci , kj ) nij R(ci , kj ) = |ci | nij P (ci , kj ) = |kj |

F (ci , kj ) =

Fig. 2. Computation of F-measure

A summary of comparison of the proposed method with the SLCT method is shown in Fig. 3. This F-measure evaluation was done on labeled datasets. From the results, it is revealed that the algorithm performs better than SLCT. Also, it is non trivial to tune the support threshold in case of SLCT for optimum F-measure due to dependency of support threshold on number of log messages in the dataset. However, the proposed method has no such weakness, and performs well with F-measure around 0.7 irrespective of number of logs in datasets. Further, SLCT is a batch processing scheme and require entire dataset to be available before analysis, whereas proposed algorithm is an online scheme and hence can be deployed for real time log analytics. Another parameter that has to be analyzed for the proposed method is the number of bits used to represent a log message. For the labeled datasets, the performance, and number of clusters were evaluated with varying bit signature as shown in Fig. 4. It can be observed that the proposed algorithm performs well for 256 bit signature, in spite of the dataset and is a good trade-off between performance and space requirement. 4.3

Clustering and Validation of Unlabeled Datasets

For the unlabeled datasets, the result of clustering is summarized in Table. 4. The number of clusters due to SLCT is always more than the proposed method, and the choice for appropriate support threshold is non trivial. However, the


461

Fig. 3. Clustering comparison of SLCT and proposed method

Fig. 4. Algorithm performance for varying signature bit length

threshold choice was fixed to be 0.65 in case of proposed method. Table. 4 demonstrates the number of clusters created due to both methods for some choice of support/threshold. In the case of the unlabeled datasets, we do not have apriori information about the number of events (or classes) as in the case of labeled dataset. So, the clusters created by the proposed method can easily be validated with validation scheme described below. At first, a log message in a particular cluster is tokenized such that a log message is represented as list of words and this message is matched against all

462


the other messages included in that cluster to identify the common words. For all the common words, we find the individual index of the words in the strings and then the difference of the index between the string is calculated. If the index difference is continuous, then it means that the words are occurring in order else the word is either not present or not in the order it appears in previous log message. As shown in Table. 5, the common words in two different log message falling in the same cluster is identified and the difference of index of the common word is calculated. Here, first four common words are in order such that the continuous index difference 0 occurs four times. So for all the continuous index difference, we calculate the number of occurrence N1 , N2 ....Nn . The measure of the order of words to common words between two log messages is expressed as a ratio of the sum of square of all these values to the square of number of the common words Nc . Special care is taken to generate a normalized value. The dynamic range of numerator and denominator is changed further by taking the natural log of them and the ratio is manipulated to generate the word order match percent as shown in Eq. 2. Mp =

log(N12 + N22 .......... + Nn2 ) ∗ 100 log Nc

(2)

The validation of all clusters generated for the unlabeled datasets as shown in Table. 3 is done by this algorithm. The process similar to above is used to identify the number of log messages within a threshold for word order matched percentage. The percentage of the valid messages in a cluster i.e similarity score for a cluster is identified by dividing it with total no of messages falling to the cluster. The average percentage for the valid messages for a sample i.e. similarity score for a sample is calculated by dividing the similarity score for a cluster by number of clusters in a sample. Similarly, the similarity score is calculated for all the samples and are summarized in Table. 6. In this table, the similarity score for unlabeled dataset 1,2 are calculated with word order matched percentage of 80% while the results for remaining two unlabeled dataset are calculated with word order match percent of 75%. Table 4. Clustering of log messages Approach Dataset

SLCT

Proposed method

Support Threshold

No of Clusters

Threshold

Clusters

Unknown1

10

14

0.65

50

Unknown2

30

394

0.65

307

Unknown3

60

965

0.65

35

Unknown4

60

1785

0.65

619

Unknown5

30

21

0.65

18


463

Table 5. Method for cluster validation Tokenization Log1 Index 1

Conn attempt from machine XYZ 0 1 2 3 4

Log2 Index 2

Conn attempt from machine ram kr. 192.168.2.1 to 192.168.2.2 on port 23 0 1 2 3 4 5 6 7 8 9 10 11

Common Index diff

✓ 0

✓ 0

✓ 0

192.168.1.1 to 192.168.1.2 on port 80 5 6 7 8 9 10

✓ 0

✓ 1

✓ 1

✓ 1

Table 6. Validation of generated clusters Log files Unlabeled dataset No.

1

2

3

4

5

Percentage word order match

96

99

85

89

95

It can be observed that the average percentage for the valid message is within the appreciable range even if the samples files are with varying number of log messages and message length. This proves the robustness of the proposed clustering scheme.

5

Conclusion

An intelligent algorithm for signature generation of a log message has been proposed and the clustering of log messages is done based upon their percentage similarity with the signature of the patterns stored in the system database. If the percentage of similarity of log message signature is within the limit, the signature in the database is modified to reflect the effect of that log in the cluster otherwise a new signature is stored in the database for the pattern. From the evaluation of F-measure, it was observed that proposed method performs better than SLCT. For the unlabeled datasets, it was observed that the percentage of word order match was greater than 80% for most of the tests. It was observed that the results with SLCT tool is highly dependent on threshold and regards log messages as anomaly if fails to comply the criterion. Further, SLCT being batch processing method is a limitation for real-time applications. But, the method proposed by present authors not only generates clusters better than SLCT but also applicable for real-time log message pattern identification. The identified pattern can give insights to anomalies and security threats. However, some level of modification is desired in the proposed method to use it in fully distributed environment. Acknowledgment. The authors would like to appreciate the valuable inputs from the members of LogPoint family at different times during this research work.

464


References 1. Vaarandi, R.: A data clustering algorithm for mining patterns from event logs. In: Proceedings of the IEEE IPOM 2003, pp. 119–126 (2003) 2. Makanju, A., Brooks, S., Zincir-Heywood, A.N., Milios, E.E.: Logview: Visualizing event log clusters. In: Sixth Annual Conference on Privacy, Security and Trust, PST 2008, pp. 99–108 (2008) 3. Muller-Molina, A.J., Shinohara, T.: Efficient similarity search by reducing i/o with compressed sketches. In: Proceedings of the Second International Workshop on Similarity Search and Applications, SISAP 2009, pp. 30–38. IEEE Computer Society, Washington, DC (2009) 4. Hansen, S.E., Atkins, E.T., Todd, E.: Automated system monitoring and notification with swatch. In: Proceedings of the 7th Systems Administration Conference, NMonterey, CA, pp. 145–155 (1993) 5. Stearley, J., Corwell, S., Lord, K.: Bridging the gaps: Joining information sources with splunk. In: Proceedings of the Workshop on Managing Systems via Log Analysis and Machine Learning Techniques (2010) 6. Yamanishi, K., Maruyama, Y.: Dynamic syslog mining for network failure monitoring. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD 2005, pp. 499–508. ACM, New York (2005) 7. Seipel, D., Neubeck, P., K¨ ohler, S., Atzmueller, M.: Mining complex event patterns in computer networks. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2012. LNCS, vol. 7765, pp. 33–48. Springer, Heidelberg (2013) 8. Nagappan, M., Vouk, M.A.: Abstracting log lines to log event types for mining software system logs. In: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR), pp. 114–117 (2010) 9. Mannila, H., Toivonen, H., Inkeri Verkamo, A.: Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery 1, 259–289 (1997) 10. Zheng, Q., Xu, K., Lv, W., Ma, S.: Intelligent search of correlated alarms from database containing noise data. In: 2002 IEEE/IFIP Network Operations and Management Symposium, NOMS 2002, pp. 405–419 (2002) 11. Wen, L., Wang, J., Aalst, W., Huang, B., Sun, J.: A novel approach for process mining based on event types. Journal of Intelligent Information Systems 32, 163– 190 (2009) 12. Makanju, A.A., Zincir-Heywood, A.N., Milios, E.E.: Clustering event logs using iterative partitioning. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1255–1264. ACM, New York (2009) 13. Demetrescu, C., Finocchi, I.: Algorithms for data streams. Handbook of Applied Algorithms: Solving Scientific, Engineering, and Practical Problems, 241 (2007) 14. Andoni, A.: Nearest Neighbor Search: the Old, the New, and the Impossible. PhD thesis, Massachusetts Institute of Technology (2009) 15. Panigrahy, R.: Hashing, Searching, Sketching. PhD thesis, Stanford University (2006) 16. Paulev, L., Jgou, H., Amsaleg, L.: Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognition Letters 31, 1348–1358 (2010)


465

17. Slaney, M., Lifshits, Y., He, J.: Optimal parameters for locality-sensitive hashing. Proceedings of the IEEE 100, 2604–2623 (2012) 18. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: 47th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2006, pp. 459–468 (2006) 19. Broder, A., Mitzenmacher, M.: Network applications of bloom filters: A survey. Internet Mathematics 1, 485–509 (2004) 20. Song, H., Dharmapurikar, S., Turner, J., Lockwood, J.: Fast hash table lookup using extended bloom filter: an aid to network processing. SIGCOMM Comput. Commun. Rev. 35(4), 181–192 (2005) 21. Appleby., A.: Murmurhash 2.0 (2010), http://sites.google.com/site/murmurhash/ 22. Fung, B.C., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proceedings of the Third Siam International Conference on Data Mining (2003)

Intelligent Clustering Scheme for Log Data Streams

Intelligent Clustering Scheme for Log Data Streams

Suggest Documents

Clustering Text Data Streams - CiteSeerX

Clustering Categorical Data Streams - arXiv

Online Clustering of Data Streams

Clustering Categorical Data Streams - arXiv

Clustering Distributed Sensor Data Streams

Online Clustering of Data Streams - CiteSeerX

Clustering Performance on Evolving Data Streams: Assessing ...

On clustering large number of data streams

On clustering large number of data streams

Clustering Performance on Evolving Data Streams - ADReM

Clustering in Data Streams - Google Sites

ON CLUSTERING MASSIVE DATA STREAMS: A SUMMARIZATION ...

Clustering Distributed Sensor Data Streams - Springer

Clustering Performance on Evolving Data Streams ... - ADReM

Clustering Based Active Learning for Evolving Data Streams

SCLOPE: An Algorithm for Clustering Data Streams of ... - CiteSeerX

A Framework for Clustering Evolving Data Streams - Semantic Scholar

A Framework for Clustering Uncertain Data Streams - Charu Aggarwal

A Framework for Clustering Massive Text and Categorical Data Streams

Temporal Structure Learning for Clustering Massive Data Streams in ...

Intelligent instance selection of data streams for smart sensor

Loadstar: A Load Shedding Scheme for Classifying Data Streams

A Clustering Methodology of Web Log Data for Learning Management ...

Intelligent Well Log Data Analysis for Reservoir ... - CiteSeerX