A Novel Mind Map Based Approach for Log Data

17 downloads 0 Views 260KB Size Report
decoding the log file syntax and interpreting data semantics. The expected output ... one or few log file formats. Rather than ... Furthermore, the regular expression based parsers employed with them are .... After that it continues according to the following criteria. ... When parsing a type B log entry, the parsing mode changes.
A Novel Mind Map Based Approach for Log Data Extraction P.W.D.C. Jayathilake

Abstract- Software log file analysis helps immensely in software testing and troubleshooting. The first step in automated log file analysis is extracting log data. This requires decoding the log file syntax and interpreting data semantics. The expected output of this phase is an organization of the extracted data for further processing. Log data extractors can be developed using popular programming languages targeting one or few log file formats. Rather than repeating this process for each log file format, it is desirable to have a generic scheme for interpreting elements of a log file and filling a data structure suitable for further processing. The new log data extraction scheme introduced in this paper is an attempt to provide the advanced features demanded by modern log file analysis procedures. It is a generic scheme which is capable of handling both text and binary log files with complex structures and difficult syntax. Its output is a tree filled with the information of interest for the particular case. Index Terms—log data extraction, log file modeling, log analysis, mind map

I.

INTRODUCTION

Software log files are analyzed for many purposes. In certain cases, the goal is to verify the functional conformance of the software with a given specification where the log file entries are observed to confirm that the application generates desired outputs at intended instances [1]. Log files are heavily used in software troubleshooting and profiling too. When an application malfunctions in production, the only trace available for developers to investigate the cause, more often than not, is the application log file. In addition, log file analysis helps in extracting vital usage statistics from web servers and in security monitoring. Potential uses of log file analysis extend beyond the popular practices mentioned above. Many powerful software analysis tools have been emerging over the recent past with capabilities to monitor software with respect to memory leaks, corruptions, IO overhead, access failures, performance bottlenecks, security pitfalls, failed lower level API calls, etc. Almost every tool generates some form of a log which can be used to identify software defects that can be hardly detected in conventional testing. Correlating data extracted from a monitoring tool with an application log can reveal valuable information on system level events caused by important application events [2]. P.W.D.C. Jayathilake is with the University of Moratuwa, Sri Lanka. (phone: +94 776 985444; fax: +94 11 4721198; e-mail: [email protected]).

Despite the benefits of it, software log file analysis is a labor intensive and error prone activity when performed completely manually [3]. Furthermore, it generally demands domain and tool expertise which is an expensive resource. Therefore, automating at least a part of the process is of paramount importance. The first step in automated log analysis is automatic extraction of relevant information from log files. Doing this in a generic way is a challenging task given that different log files have different structures and formats. Due to this, all the existing commercial log file analysis tools are restricted to a limited set of log file formats. For example, there are tools available to analyze the logs generated by popular web server software such as Apache and IIS [4]. There have been very few attempts in academic research [5] for devising a generic mechanism to extract data from log files. They start with a formal definition to a log file [6]. This follows with a scheme for expressing the structure and format of it. The scheme is typically declarative and is essentially simpler than implementing a log data extractor in a general purpose programming language. It maps the log file data of interest into a set of variables. Despite their usefulness in certain contexts, these schemes share a set of drawbacks that hinder them from being utilized in many practical analysis processes. They are usually limited to interpreting simple, line log entries whereas many log files generated by modern applications and tools consist of complex and often hierarchical entries. Furthermore, the regular expression based parsers employed with them are unable to capture many log entry patterns. The output of these schemes, which is a set of standalone variables, does not provide information on relationship between data entries for further processing. In addition, the power of xml technologies cannot be used to process this output. This paper describes a novel log data extraction scheme which generates a mind map containing extracted data. Mind maps are an effective means for organizing data and visualizing relationships between them [7]. The scheme interprets a log file as a structural organization of data and provides a declarative language to map log entries into nodes in a mind map. It is driven by a log file definition model devised after analyzing common patterns in current day log files. It is robust as it includes an error recovery mechanism to deal with corrupted log files. The mind map generated by the scheme can be analyzed further using a framework [2] developed by the author.

Section II describes the goals of the new scheme which is followed by a formal definition of the scheme in Section III. Then the parsing mechanism is explained in Section IV with examples. Section V illustrates the usage of the scheme through case studies and Section VI includes a discussion focusing on the practical aspects. Related works are mentioned in Section VII. Section VIII concludes the work and potential future developments are detailed in Section IX. II.

GOALS OF THE SCHEME

Information extraction schemes generally employ statistical methods [8] because the formats of the files used are usually not known at the time of analyzer design. Log file analysis, on the other hand, is a domain where it is safe to assume prior knowledge on the file structure and format. Therefore the new log data extraction scheme should include a mechanism to provide a formal definition of the syntax and semantics of the log file to analyze. In addition, its data entity separation mechanism should be strong enough to handle tricky log file formats. The scheme should be rich enough to handle various types of log files such as line logs (flat text files containing data line by line), xml logs, message logs, tabular logs and binary logs. Its way of defining entries in a log file should be capable of using different attributes such as length, minimum length, maximum length, delimiters and possible values to distinguish it from others. The output of log data extraction should resemble a mind map in order to be easily processed by the framework described in [2]. Since malformed log files are not uncommon, the scheme should be immune to them. It should provide mechanisms to recover from the presence of unexpected data chunks and resume parsing from the next correct log entry. Because the users understand the semantics of a log file best, they should be able to guide the parser via appropriate parameters on how to recover from an error condition. III.

DEFINITIONS

In this section we provide a formal definition to the new scheme to express the format of a log file and semantic interpretation of its elements. In the following section we describe the rules used when extracting information from a log file using this definition. Log File Unit (LFU) This is the smallest data entity in a log file. A log file is viewed as a sequence of LFUs. For example, character (single byte or multi byte) can be the LFU in a text log and byte can be the LFU in a binary log. Log Entry (LE) A log entry stands for a single semantic unit in a log file. It can fall into one of the following three definitions. Type A: A sequence of other log entries defined by the pair ([LE1, LE2, …, LEN], ERROR_RECOVERY) where LEi

are log entries. The sequence should be built with the same order of log entries as specified inside the square brackets in the first element of the pair. ERROR_RECOVERY is a flag that indicates whether the system should try to recover from parse errors for this type of log entries. Error recovery is discussed in more detail later. Type B: A sequence of other log entries defined by the 4tuple ({LE1, LE2, …, LEN}, MAX, MIN, ERROR_RECOVERY) where LEi are log entries. The sequence can be built with those log entries by putting them in any order. Each LEi can appear in the sequence zero or more times. The list containing LEis is termed the candidate list for the sequence. MAX is the maximum number of log entries permitted in the sequence. If its value is -1, there is no limit for the length of the sequence. Similarly MIN is the minimum number of log entries that should be present in the sequence. -1 indicates that there is no lower bound for the length of the sequence. ERROR_RECOVERY is a flag having same semantics as in definition (a). Type C: A singleton (k) where k is a LFU; for example, (‘u’). Log File (LF) A log file itself is defined as a log entry. IV.

PARSER

A log file is interpreted as a sequence of LFUs. Current log file position (This is a variable used to keep track of the LFU being accessed at a given point during the parsing operation. It will be denoted by CLFP in the remainder of the text) is set to the first LFU in the sequence. Parsing starts from the top parsing rule which manifests the definition of the log file as a log entry. Therefore the log file becomes the current log entry. After that it continues according to the following criteria. A. Parsing log entries 1. Parsing a log entry of type A Log entries in the definition sequence are processed in the given order. The operation succeeds if all the constituent log entries are parsed successfully. Otherwise the parsing operation returns with a failure. 2. Parsing a log entry of type B Each log entry in the candidate list is processed in the order they appear in the list. When a log entry succeeds in parsing, it is added to the resulting sequence and the next parsing iteration starts from the beginning of the candidate list again. Parsing terminates with success if one of two conditions is satisfied. (i) Current length of the resulting sequence is not less than the MIN length and the current parsing iteration fails for every log entry in the candidate list. (ii) MAX length of the sequence is reached. Parsing terminates with failure if the current parsing iteration fails for every log entry in the candidate list while the resulting sequence is shorter than MIN length.

a. Lazy Parsing When parsing a type B log entry, the parsing mode changes to lazy parsing after the resulting sequence reaches MIN length. When in this mode, after each successful parse operation, the parser transfers control to the parent log entry before trying to parse entries in the candidate list again. Then the parent log entry makes an attempt to perform its next parse operation as if the child parse operation is terminated with success. If this attempt is successful, the child log entry which is in the lazy parsing mode terminates its parse operation with success and the parsing is continued with the parent. If the parent too is in the lazy parsing mode, parsing control transfers to its parent and so on. However, if the parent’s parse attempt fails, the child gets chance to attempt for its next parse operation with the candidate list. 3. Parsing a log entry of type C Parsing is successful if CLFP is equal to the LFU given in the log entry definition and fails otherwise. The general parsing operation is depicted in Fig. 1. In each iteration, the parser is given the next candidate log entry definition and the CLFP. If the parsing ends up successfully the CLFP is updated. Otherwise it is unchanged. The final result of the parse operation is a tree containing the evaluated values for log entries. B. Example Let’s consider the following line appearing in a log file. val = 2.3 It can be identified as a string formed by concatenating the string “val” with zero or any number of spaces, an equal sign, zero or any number of spaces again and a positive real number. We can define it as a log entry as follows. LE ≡ ([A,S,E,S,B], NO); A ≡ ([A1,A2,A3], NO); A1 ≡ (‘v’); A2 ≡ (‘a’); A3 ≡ (‘l’); S ≡ ({SPACE, TAB}, -1, -1, NO); SPACE ≡ (‘ ‘); TAB ≡ (‘\t’); E ≡ (‘=’); B ≡ ({ZERO, ONE, …, NINE, DECIMAL_POINT}, -1, 1); ZERO ≡ (‘0’); ONE ≡ (‘1’); … ; NINE ≡ (‘9’); DECIMAL_POINT ≡ (‘.’) LE and A are log entries of type A while S and B are type B entries. Others fall into type C. -1 in the definitions for S and B indicates that there is no upper bound for the length of those log entries. \t in the definition for TAB stands for the tab character. One possible instance of this log entry is val=2.3 which will result in the tree shown in Fig. 2 after the parse operation. C. Error recovery If parsing a log entry terminates with failure and the ERROR_RECOVERY flag is set to YES, the parser, instead of returning the control to the parent log entry, continuously tries to parse the log entry by advancing the CLFP by one position each time. This feature is useful when dealing with corrupted log files. A log file can get corrupted due to many reasons such as multiple application threads writing to the log

file without proper synchronization, abrupt application termination and disk file corruptions. It is highly likely that certain parts of a corrupted log file do not follow the definition and hence lead the whole parsing process to failure. This can be avoided by using the ERROR_RECOVERY flag in appropriate log entry definitions. With this mechanism, the parser can resume after a failure by ignoring only the corrupted part. This will be illustrated in the case study on trading system log file under Section V. V.

CASE STUDIES

This section demonstrates the usage of the proposed scheme to interpret different types of log files by providing definitions for three log files that significantly differ from each other in their structure and format. Usage of the new scheme’s features in addressing complexities inherent to each log file is also discussed. For the sake of convenience when expressing the log file definitions, we define few special entities beforehand. Char is the set containing all lowercase and uppercase English characters along with underscore. Num is the set containing the numbers 0-9. Spaces is the set containing the space character and the tab character. All is the set containing all the characters that can be entered using the keyboard except carriage return and new line. \r, \n and \t denote the carriage return, new line and tab respectively. Furthermore, we use the shorthand notation “c1c2…cN” for the log entry ([C1, C2, …, CN], NO) where C1 ≡ (c1); C2 ≡ (c2); … ; CN ≡ (cN) and ci are characters.

Fig. 1. The parsing operation. Parser moves CLFP forward in the log file unit sequence only after a successful parse.

Fig. 2. The parse tree for: val=2.3. This is the mind map used for further processing.

Fig. 3-(a). The left part (containing first four columns) of Microsoft Sharepoint log file

Fig. 3-(b). The right part (containing 5th column to 8th column) of Microsoft Sharepoint log file

Fig. 4. A simplified version of the log file generated by Microsoft Application Verifier

A. Microsoft Sharepoint log file Fig. 3-(a) and Fig. 3-(b) show a part of the log file generated by Microsoft Sharepoint Server. There are 8 columns in the log file. One record stays in one line. Values of a record corresponding to the 8 columns are separated by spaces (space and tab characters). However, spaces appear also inside some values. Log file grammar is shown in Fig. 5. 1. Remarks The log file consists of the headings line followed by record lines. Log entries are defined for each of the 9 values in a record line and their components. Determining the ends of the “Area” value and “Category” value in a record is challenging because those strings themselves can contain spaces. In order to handle this, we define a separator log entry SPACES2 which evaluates to either a single tab or multiple spaces (or tabs). Occurrence of a SPACE2 entry signifies the end of the current string because a SPACE2 cannot appear inside a string. In a regular expression based interpreter this separation would be very difficult if not impossible.

B. Microsoft Application Verifier log file Microsoft Application Verifier is a monitoring tool which reports on memory leaks, corruptions, user access failures, etc during an application run. It generates an xml log file containing all the information required for an analysis. Fig. 4 shows a simplified form of its log file. Log file grammar is shown in Fig. 6. 1. Remarks Since xml is a popular format used in log files, an implementation of the scheme can provide easy, reusable shorthand notations for expressing xml elements and attributes. This will reduce the amount of typing required and will improve readability of the grammar too. Deriving the log entry definition from the document type definition of an xml is another desirable feature in an implementation.

C. Trading System log file Fig. 7 shows a simplified version of a log file generated by an electronic trading system. The log comprises different messages generated in the system. Each message has a number of fields. A field has a value which can be a string, a number or another message (called a sub message). The second message shown in the example is corrupted because part of it is missing in the log file. Log file grammar is shown in Fig. 8. 1. Remarks The highlight in this case is the error recovery after the corrupted second message. In the definition, MESSAGE entry has its ERROR_RECOVERY flag set to YES. This tells the parser that, whenever the parse operation for a MESSAGE terminates with failure it should try to parse a new MESSAGE by incrementing CLFP until success. Because of this, the parsing operation successfully resumes from the third message after the failure due to the corrupted second message. Without the error recovery mechanism, parsing would stop prematurely from the second message. I.

DISCUSSION

An implementation of the scheme should ideally provide a user interface to display the resulting mind map containing extracted log data. It is also desirable to provide the capability to collapse a subtree and merge the contents of its nodes into the root of it in order to manage the tree size. Though all the case studies provided deal with text log files (which is the most common form for log files), the scheme is

Fig. 6. Interpretation grammar for Microsoft Application Verifier log file

Fig. 7. A part of the log file generated by a trading system. The log file comprises messages generated by the system. Second message is corrupted and doesn’t have a proper ending.

Fig. 8. Interpretation grammar for Trading System log file

equally applicable for binary log files too. Byte is the appropriate log file unit for such cases. Most probably the users will require converting a sequence of bytes to a single value; for example, converting contiguous 4 bytes into an integer value. It is desirable to provide easy functions in the implementation to perform these frequently used operations. A careful analysis of the structure and format of the log file is encouraged before writing the log file definition grammar. The separators between log entries and possible values for each log entry need to be clearly identified. The appropriate level of error recovery for the log file also should be decided. II.

IV.

FUTURE WORK

Ability to identify the length limits of a log entry at run time rather than specifying it in the definition grammar is a desirable feature particularly when analyzing binary log files. For example, the length (number of bytes) of a subsequent field in a binary file may be specified somewhere in the file itself as an integer value. Since we have a mind map based language to process the extracted data further [2], the next step is to formulate a system to present the results of the analysis in formats such as html and pdf.

RELATED WORK

One important attempt to standardize system event logging is the IETF draft called Universal Format for Logger Messages [9] which is expired as of today. It provides a guideline for formalizing the semantics of log file messages. It defines a set of tags for commonly used fields in log messages such as filename, line, protocol and ip address. Hierarchical tags are also supported. Since the tags are standardized and the log files that follow the standard associate data with tags, log file readers can extract the semantics of the log data with respect to tags. Jan Valdman [5] suggests a scheme for extracting data from flat log messages into variables. The final result is a set of variables holding data to be used in further analysis. This scheme is useful for analyzing log files that comprise simple messages. However, it cannot be used with log files that contain hierarchical information like in xml. III.

schemes it aligns well with xml which is a very popular format in modern log files. The scheme can be used to separate data fields in log files fairly easily even when the separation logic is complicated to a degree that existing schemes fail to handle. The output of the parser is a mind map which is more suitable for further processing in contrast to a set of variable-value pairs generated by existing log data extractors. According to our knowledge this is the first log data extraction scheme that generates a mind map. Due to the flexibility of the scheme it can be used even for tasks outside the log file analysis domain as a generic means for extracting information from files. The error recovery mechanism adds a great degree of robustness to the scheme. It allows the author of the log file definition grammar to decide the level of error recovery and resumes after falling through a corrupted part of the log file accordingly. This mechanism works well particularly with log file corruptions happening due to abrupt application terminations and disk file corruptions.

CONCLUSIONS

The new scheme described in this paper is capable of expressing both text and binary log files with different structures and formats ranging from flat messages to complex hierarchies. Compared to existing log data extraction

ACKNOWLEDGMENT The author expresses his gratitude to Dr. Amal Shehan Perera from University of Moratuwa for supporting this work. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]

J. H. Andrews, “Testing using log file analysis: tools, methods and issues,” Proc. 13th IEEE International Conference on Automated Software Engineering, Oct. 1998, pp. 157-166. D. Jayathilake, “A mind map based framework for automated software log file analysis,” International Conference on Software and Computer Applications., in press. T. Takada and H. Koike, “Mielog: a highly interactive visual web browser using information visualization and statistical analysis,” Proc. USENIX Conf. on System Administration, Nov. 2002, pp. 133-144. L. Destailleur, “AWStats,” [Online]. Available: http://awstats.sourceforge.net J. Valdman, “Log file analysis,” Department of Computer Science and Engineering (FAV UWB)., Tech. Rep. DCSE/TR-2001-04, 2001. J. H. Andrews, “Theory and practice of log file analysis,” Department of Computer Science, University of Western Ontario., Tech. Rep. 524, May 1998. T. Buzan and B. Buzan, The Mind Map Book. New York: Penguin Books, 1994, pp.79-91. J. Cowie and W. Lehnert, “Information extraction,” Comm. ACM 39, 1996, pp. 80–91. J. Abela and T. Debeaupuis, “Universal Format for Logger Messages,” The Internet Engineering Task Force. [Online]. Available: http://tools.ietf.org/html/draft-abela-ulm-05

Suggest Documents