Applying Frequent Sequence Mining to Identify Design ... - CiteSeerX

Applying Frequent Sequence Mining to Identify Design Flaws in Enterprise Software Systems Trevor Parsons1? , John Murphy1 and Patrick O’Sullivan2 1

2

Performance Engineering Laboratory, University College Dublin, Dublin 4, Ireland. Dublin Software Lab, IBM Software Group, Damastown, Dublin 15, Ireland, Email: [email protected], [email protected], [email protected] Abstract. In this paper we show how frequent sequence mining (FSM) can be applied to data produced by monitoring distributed enterprise applications. In particular we show how we applied FSM to run-time paths to highlight repeating sequences of interest by using alternative support counting techniques. We show how the patterns identified, can be used to highlight design flaws in enterprise applications. We also discuss some algorithm scalability problems related to applying FSM to run-time paths and give solutions to these issues.

1

Introduction

Over the past decade there has been a major effort made by the research community to develop efficient algorithms for the purposes of data mining. Efficient techniques in the field of data mining are certainly important, considering the recent advances in computing, communication and digital storage technologies which today make it possible to easily store incredible volumes of data. However there is a real need to show how such algorithms can be applied to different domains to extract interesting and useful information, since it is irrelevant how efficient an algorithm is, if it can not be put to a meaningful task. Due to technological advances there are many different domains which are ”drowning in information but starving for knowledge”. One such domain is the area of enterprise application development. Enterprise applications are the large software applications that companies use to manage their day to day operations. Today’s enterprise applications are extremely large and complex and tend to be physically distributed across many different machines. To understand these systems, and identify problems, developers are often required to sift through large volumes of information produced during monitoring, or to correlate the logs produced by the different software components that comprise the system. This can be an extremely tedious task, and very often developers do not have ?

Our work is funded under the Commercialisation Fund from the Informatics Research Initiative of Enterprise Ireland. We would like to thank Dr. Sean Murphy for his patient assistance with some of the mathematical aspects of this paper. Also his help with the algorithm implementation is much appreciated.

time to make sense of this large volume of data. In this paper we show how frequent sequence mining (FSM) [1] can be applied to data produced by monitoring distributed enterprise applications. In particular we show how we applied FSM to run-time paths [2] [3] to highlight repeating sequences of interest (e.g. resource intensive loops) by using an alternative support counting technique. We show how the patterns identified, can be used to highlight design flaws in enterprise applications that lead to poor system performance. We also discuss scalability problems (in terms of both the algorithm runtime and the data produced) related to applying FSM to run-time paths and give solutions to these issues. Section 2 gives some background information on enterprise applications and run-time paths. Section 3 introduces FSM and outlines why we feel FSM is the most appropriate mining approach for analysing run-time paths. In section 4 we outline different support count options that can be applied in FSM and introduce our notion of non-overlapping weighted support. We also show how further measures of ”interest” (and not merely frequency) can be applied when using FSM with run-time paths to identify resource intensive sequences (section 5). In section 6 we introduce a number of preprocessing techniques that can be applied to improve the run-time of the algorithm. Section 7 gives results on the performance of a number of different FSM implementations when applied to run-time paths. In this section we also show how the output can be applied to identify design flaws in enterprise applications. Section 8 and 9 detail related work and our conclusions respectively.

2

Enterprise Applications and Run-Time Paths

A typical enterprise application is made up of a number of distributed components (e.g. web, application and database servers). Many of the components in such a system can be subdivided into smaller software components which interact to service user requests (e.g. web, business and database tier components). When problems arise in these systems it can be very difficult for developers to determine the exact cause of the issue. Often developers must spend hours analysing and correlating the different server logs, in order to gain an understanding into how the components interact when the system is running. Considering the number of software components and sub components in an typical enterprise application (this can be in the order of hundreds), the number of paths that can be taken through the system is generally quite large. Recent advances in monitoring technology allow for automatic collection of run-time paths spanning physically distributed systems [2]. Run-time paths capture the ordered sequence of (component) events that service client requests in an application. They can also contain performance metrics associated with servicing such a request. Figure 1 gives an example run-time path in text format. Each line contains the component type, the component name, the method called and the method execution time. An indent from one line to the next shows that the component on the line above the indent is the parent component (or caller) of the method directly below, (and any subsequent methods until the level of

Fig. 1. Run-time Path

indentation changes). For quick comprehension run-time paths can be displayed in a diagrammatic format as in 1. Monitoring enterprise systems in this manner can be very useful since run-time paths contain valuable information (e.g. component relationships) that can be used to analyse the overall system design. An issue with analysing run-time paths from enterprise applications is that (a) the paths can be very long and (b) there may be a large number of different paths that exist for a given system. Rather than having to analyse the paths manually, it is desirable to be able to perform automatic analysis, such that the amount of data to be examined by developers can be reduced. From a performance design perspective we are interested in finding frequently occurring method calls across the run-time paths that might suggest instances of potential performance design flaws in the application. That is, frequent sequences of method calls that are very resource intensive. Identified design flaws can be refactored or optimized such that the overall system performance can be improved.

3

Frequent Sequence Mining

Frequent Itemset Mining (FIM) [4] is particularly suited to finding patterns in transactional data. However an issue (in relation to run-time paths) with FIM is the fact that it does not take the order of items in a transaction into account. Since a run-time path maintains the order of the events that constitute it, it is important that our analysis technique also respects this order. Mining frequent item sequences [1] in transactional data considers the order of the transactions and thus is more suited to finding patterns in run-time paths. FSM is a general case of FIM. FIM is concerned with discovering all frequent itemsets within a transactional database. Most FIM algorithms work on the following principle: an itemset X of variables can only be frequent if all its subsets are also frequent (i.e. the downward closure property). Using this principle the general approach is to find all frequent sets of size 1. Assuming these are known, candidate sets of size 2 can be generated (i.e. sets {A, B} such that {A} is frequent and {B} is frequent).

Fig. 2. Example Transaction with Different Support Counting Approaches

The frequency of the candidate sets can then be calculated by scanning the transactional database. Infrequent candidate sets are removed. This gives the frequent sets of size 2. Next the candidate sets of size 3 can be constructed and their frequency calculated. This process can be repeated until no candidate sets are generated. This is known as the Apriori algorithm [4]. Calculating the frequency of a candidate itemset is referred to as determining the support count of the candidate. The support count of the candidates are calculated by reading in each transaction from the database and determining if each candidate is contained within the transaction. Where this is true the support count is incremented. For item sets the containment relation corresponds to the set inclusion (⊆) relation. In the case of item sequences, a sequence ha1 , a2 ....an i is contained in another sequence hbj1 , bj2 ....bjm i if there exists integers 1 < j1 < j2 < ... < jn

Applying Frequent Sequence Mining to Identify Design ... - CiteSeerX

Applying Frequent Sequence Mining to Identify Design ... - CiteSeerX

Suggest Documents

Mining medical data to identify frequent diseases using Apriori algorithm

A Sequence Data Mining Protocol to Identify Best ...

Algorithm to Identify Frequent Coupled Modules

Mining Frequent Patterns from Multi-Dimensional ... - CiteSeerX

Advances in Frequent Itemset Mining Implementations - CiteSeerX

a combination of SVD, correlation and frequent sequence mining

An Efficient Algorithm for Mining Frequent Sequence with Constraint ...

MARGIN: Maximal Frequent Subgraph Mining - CiteSeerX

FREQUENT SET MINING

Applying machine learning to identify autistic

Applying Machine Learning to identify Geological ...

applying context-awareness to appliance design - CiteSeerX

Applying Reflective Design to Digital Memorials - CiteSeerX

Applying Data Mining to Extract Design Patterns from ... - IEEE Xplore

Using sequence data to identify alternative routes and risk ... - CiteSeerX

Using Data Mining to Identify Customer Needs in Quality ... - CiteSeerX

Applying Data Mining to Pseudo-Relevance Feedback for ... - CiteSeerX

Mining Probabilistically Frequent Sequential Patterns

Applying Data Mining Techniques to e-Learning Problems - CiteSeerX

applying data mining techniques to forecast number of ... - CiteSeerX

Applying Data Mining to Customer Churn Prediction in an ... - CiteSeerX

applying data mining techniques to forecast number of ... - CiteSeerX

Applying Data Mining Techniques to a Health Insurance ... - CiteSeerX

Applying Data Mining to Pseudo-Relevance Feedback for ... - CiteSeerX