Distance-based database user sessions clustering - CiteSeerX

2 downloads 189 Views 121KB Size Report
may contain branches (such as if/else and switch-case statements) or cycles (such as ..... tions to the center database server, which is Microsoft SQL Server 7.0.
Distance-based database user sessions clustering Qingsong Yao, Aijun An, and Xiangji Huang ? Department of Computer Science, York University, Toronto M3J 1P3 Canada {qingsong,aan}@cs.yorku.ca, [email protected]

Abstract. It has been brought into attention that analysis of task-oriented database user sessions provides useful insight into the query behavior of database users. A database user session is a sequence of queries issued by a user (or an application) to achieve a certain task. It consists of one or more database transactions, which are in turn a sequence of operations performed as a logical unit of work. In this paper, we assume a set of session instances are already obtained, and focus on grouping these sessions into different session classes. We propose a distancebased clustering algorithm which is based on three session similarity metrics between sessions. We also show experimental results.

1 Introduction The performance of a database system is influenced by the characteristics of its hardware and software components as well as of the workload it has to process. The analysis of the workload has played an important role in optimizing the performance of database systems. In recent years, it has been brought into attention that analysis of task-oriented user sessions provides useful insight into the query behavior of database users. A database user session is a sequence of queries issued by a user (or an application) to achieve a certain task. It consists of one or more database transactions, which are in turn a sequence of operations performed as a logical unit of work. Analysis of sessions allows us to discover high-level patterns that stem from the structure of the task the user is solving. The discovered patterns can be used to predict incoming user queries based on the queries that the user has already issued [Sapia, 2000b, Yao and An, 2003, Bowman and Salem, 2004] and to redesign and rewrite the queries within a user session to achieve a better performance [Yao and An, 2004, Behm et al., 2004]. In this paper, we assume that a set of database user sessions are already obtained from database workloads. The session instances usually finish certain task and are controlled by certain business logic. Therefore, it is necessary to group these session instances into different groups/classes according to the performed tasks. We propose a distance-based session clustering algorithm to group session instances into different classes. The distance between two session instances is measured according to three similarity metrics/scores: the coefficient score, the alignment score and the neighborhood score. This approach considers not only the local similarity between sessions (the ?

This work is supported by research grants from Communications and Information Technology Ontario (CITO) and the Natural Sciences and Engineering Research Council of Canada (NSERC).

coefficient score, the alignment score), but also the global similarity (the neighborhood score). The rest of the paper is organized as follows. In Section 2, we discuss the background of our project. Related work is discuss in Sect. 3. In Section 4, a distance-based session clustering algorithm is proposed, and three similarity scores are discussed. We analysis our clustering algorithm and give experimental results in Section 5 and Section 6. We give experimental results in Section 5. Finally we conclude the paper in Section 7.

2 Background The work presented in this paper is part of a large research project that investigates how data mining can be used for database query optimization (see http://www.cs.yorku.ca/˜qingsong for details). We particularly focus on how to discover and model user access patterns from database workloads and how to use the user access patterns to improve system performance.   

   



  



   



  



  

!

   

  

Fig. 1. The procedure of our workload analysis methods

Fig. 1 shows our database workload analysis process. First, database workloads/traces can be collected either at the clients or at the server, and descriptive statistics are obtained (step 0). Since the submitted SQL queries usually have certain format, they can be classified into different query templates (step 1). In particular, we replace each data value embedded in a SQL query with a wildcard character ’%’, and obtain a query template1 . The query template represents a set of queries that have the similar format. By replacing the submitted SQL statements with the corresponding query template, we obtain a set of query template sequences, referred to as request sequences (step 2 and 3), each for one database connection. Fig. 2 shows an example of query templates and request sequences, where each query template has an unique label, and field spid is the connection id. Each database connection may contain many database user sessions. We assume that queries within a user session have the same connection id, and there is no interleave between two sessions of a connection. Thus, a request sequence corresponds to a sequence of sessions. The task of session identification is to separate database sessions from request sequences (step 4). In [Yao et al., 2004a], we use a language statistical 1

In [Yao and An, 2004], a query template is called a user access event that contains an SQL template and a set of parameters.

Fig. 2. Query template and request sequence

modeling based algorithm to identify database sessions. The method does not rely on any time intervals when identifying session boundaries. Instead, it uses an information theoretic approach to identifying session boundaries dynamically by measuring the change of information in the sequence of requests. Table 1. An instance of schedule display session procedure Label Statement q30 q09 q10 q20 q47 q49

select authority from employee where employee id =’1025’ select count(*) as num from customer where cust num = ’1074’ select card name from customer t1, member card t2 where t1.cust num = ’1074’ and t1.card id = t2.card id select contact last, contact first from customer where cust num = ’1074’ select t1.branch ,t2.* from record t1, treatment t2 where t1.contract no = t2.contract no and t1.cust id =’1074’ and check in date = ’2003/03/04’ and t1.branch =’scar’ select top 10 contract no from treatment schedule where cust id = ’1074’ order by check in date desc

The session instances usually finish certain task and are controlled by certain business logic. For example. table 1 shows an user session instance that retrieves and display user’s treatment schedule. From the table, we observe that the queries within a session have certain order, and the query parameters have certain relationships (customer id ’1074’ is shared by several queries). Thus, it is necessary to group them into different session classes and to use a model to represent the query orders and their relationships. In this paper, a session clustering algorithm is proposed to group session instances into different session classes (step 5). We use user access graphs, weighted directed graphs, to represent query execution order and query relationship of a session class [Yao and An, 2004]. A algorithm that builds user access graphs is discussed in [Yao et al., 2004b] (step 6).

3 Related Work Data clustering is a subject of active research in several fields such as statistics analysis, pattern recognition, and machine learning. There are plenty of clustering techniques,

such as hierarchical clustering, partition-based clustering, density-based partitioning and grid-based methods. A review of clustering techniques is given in [Jain et al., 1999]. A survey on data clustering algorithms can be found in [Berkhin, 2002]. Guha et al. [2000] present a clustering algorithm, ROCK (Robust Clustering using linKs),that deals with categorical data in which each data point is a set of items. The algorithm is based on links between data points instead of distance based metrics or the Jaccard coefficient. Wang and ZaAne [2002], and Birgit Hay and Vanhoof [2001] use an idea similar to sequence alignment presented in this paper to cluster web user sessions, however they use different scoring scheme. Clustering is one of the common techniques used in characterizing the workload in DBMS environments. Transactions can be grouped according to their consumption of system resources [Yu and Dan, 1992], or according to their database reference patterns [Yu and Dan, 1994], which is called affinity clustering. Artis [1978] characterizes the workload of an IBM MVS system with the aim of determining its capacity. Several parameters are used for developing cluster descriptions of the transaction workload, such as total database calls and total number of locks. Nikolaou et al. [1998] propose several clustering algorithms are propose to classify OLTP transactions according to their database reference patterns. However, these work focus on grouping database transactions instead of user sessions. Several papers has discussed how to use user session information to improve system performance. Sapia [2000a,b] discusses the PROMISE approach which provides the cache manager with user navigation patterns within sessions to make predictions for an OLAP system. Bowman and Salem [2004] present the Scalpel system, which detects certain query streams and optimizes them by using context-based predictions of future requests. Behm et al. [2004] present an example to reduce the number of SQL statements of a user session by embedding data-change operations in selection operations. Yao and An [2003] propose algorithms to analyze the semantic relationship between the queries of a user session, three types of solutions are proposed to rewrite queries. [Yao and An, 2004] observe that the pseudo-code/queries provided by the TPC-W [2002] benchmark is not efficient, and try to rewrite these queries in different ways to improve the system performance.

4 Distance-based Session Clustering Given a set of database session instances, s = {s1 , s2 , ..., sn }, where each session instance contains a sequence of requests, i.e., si =< ri1 ri2 ....rim >, our task is to group the session instances into meaningful session classes. In this paper, each request is a query template, and is represented by using a template label. We assume that the number of tasks and the structure of the tasks in the target application are unknown. Thus, the step of grouping sessions is unsupervised learning that try to find hidden patterns from the data. Cluster analysis is a commonly used unsupervised learning method. The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar one another within the same cluster and are dissimilar to the objects in other clusters.

Since similarity is the fundamental of the definition of a cluster, a cluster similarity measurement is necessary. Given a pair of sessions, si and sj , we can estimate a similarity score between them. The similarity score is between 0 and 1, and a higher score indicates high similarity. We use a set of representative session instances to represent a session class. Then, the similarity between two session classes is defined as the average similarities between representative session instances. The dissimilarity between two session classes can be viewed as the distance between them, thus distance-based clustering algorithms can be used to cluster session instances. We propose a distance-based session clustering algorithm that groups session instances into different classes. We first consider each session instance as a session class, and calculate the distance between them. Then, the session groups are merged according to their intra-group distances, and the intra-group distances are updated correspondingly. The clustering procedure will stop when all intra-group distances become more than a pre-defined distance threshold. The session similarity score is the combination of three scores: the coefficient score, the alignment score, and the neighborhood score, respectively. Among these scores, alignment score is based on the idea of sequence alignment which is a crucial operation in bioinformatics and genetics research. 4.1 Session Distance and Similarity Scores The distance between session instances si and sj is defined as: 1.0 − α1 × csim(si , sj ) − α2 × asim(si , sj ) − α3 × nsim(si , sj ),

(1)

where csim, asim, nsim are the coefficient score, the alignment score, and the neighborhood score, respectively. α1 , α2 , and α3 are the similarity parameters, and the sum of them is 1.0. The coefficient score is based on the Jaccard Coefficient method. Jaccard Coefficient method [Jaccard, 1912] can be used to measure the similarity between sets, in which the similarity between two sets A and B is defined as the fraction of the common |A∩B| items between them, i.e., |A∪B| . We treat each session instance as a un-order request |s ∩s |

set, then the coefficient score, csim(si , sj ), is defined as |sii ∪sjj | . The objective of coefficient score is based on the assumption that session instances belonging to a session class usually have large amount of common requests. Moreover, we observe that if two sessions belong to the same group, they are very likely to have similar request sequence. But the coefficient score does not reflect the request order. Therefore, we propose another scoring scheme based on the idea of sequence alignment, referred to as the alignment score. In sequence alignment, two or more strings are aligned together in order to get the highest similarity score. For example, given two sequences X=“ABCDD” and Y=“ABED”, the aligned sequences can be: ABCDD ABED− . Well-known sequence alignment algorithms are Needleman-Wunsch algorithm Needleman and Wunsch [1970] and Smith-Waterman algorithm [Smith and Waterman, 1981]. Both algorithms are based on dynamic programming, but NeedlemanWunsch algorithm tries to find a global optimal alignment, while Smith-Waterman algorithm tries to find the longest or best subsequence pairs in the sequences.

In our case, we first use Needleman-Wunsch algorithm to align two sessions (since we like to find global optimal alignment). Once optimal alignment is obtained, we calculate the alignment score based on aligned session sequences as follows. We believe that the session sequences are controlled by certain logic/programming code. The code may contain branches (such as if/else and switch-case statements) or cycles (such as for-loop and do-while statements) that may cause the requests to be executed repeatedly. The branches and cycles can be observed from the aligned session instances. For example, from alignment sequences ABCDD ABED− , we can see two matches (A and B), one branch (C/E) and one cycle (DD/D). We assign each match with a score of 2, each branch with a score of 1, and each cycle with a score of 1. To normalize the score, we divide the assigned value with the length of the aligned session instances. The length is defined as 2 × (num. of matches + num. of branches + num. of cycles). Then, the final alignment score is 6/8 = 0.75. The hidden logic/code in the real application is more complex than the above example, but the principle is still be applicable. In some situations, two session instances are in the session class but their distance is not so “near”. Thus, simply applying the two similarity scores as the distance metric is not enough. It is necessary to take global similarity into consideration. We call two session instances si and sj ”neighbor” if the local distance between them is within a pre-defined threshold β2 . The local distance can be estimated by using the combination of the coefficient score and the alignment score by assuming that the neighborhood score is 0. Thus, each session has a set of neighbors. Each session pair, < si , sj >, has a value, nsim(si , sj ), which is the faction of common neighbors between them, and it is called the neighborhood score. The idea of neighborhood score comes from the ROCK clustering algorithm. Guha et al. [2000] observe that previous clustering algorithms only take the local similarity, i.e., the similarity between two points, into consideration. They suggest to take global similarity into consideration. 4.2 Group Representation and Group Distance The step of session distance computing has a high space and time complexity. For example, given a data set that contains k session instances, k 2 scores need to be calculated. Meanwhile, to align two sequences with length m and n, Needleman-Wunsch algorithm requires O((m + 1) × (n + 1)) space to store a matrix, and O(m × n) time to compute the matrix and then O(m + n) time to find a optimal path. In this section, several approaches are proposed to solve the problem. First, we observe that there are some repeated session instances in the data set(i.e., their request sequences are the same). These session instances are in the same session class. Thus, we can represent repeated sessions by using a single session si associated with the occurrence frequency, f req(si ). The next approach is concerned with session class representation. It is implausible to use all session instances to represent a session class. Thus, we use two sets, the request set, rset(gj ), and the session set, sset(gj ) together, to represent a session class gi . rset(gj ) contains all distinct requests appeared in gj , and sset(gj ) contains a set of the representative session instances. We observe that if session group gi and gj are likely to be merged, they should have a large part of common requests. Thus, we use formula |rset(gi )∩rset(gj )| |rset(gi )∪rset(gj )| > β3 to pre-eliminate un-related groups when merging them. The

sset is used to compute the distance between session groups. Frequent session instances are usually in sset. The distance between two session group is then defined as: P P sm ∈sset(gi ) sn ∈sset(gj ) distance(sm , sn ) × f req(sm ) × f req(sn ) P P distance(gi , gj ) = sm ∈sset(gi ) sn ∈sset(gj ) f req(sm ) × f req(sn ) (2) When two session groups are merged, the sset and rest are changed correspondingly. Furthermore, since it is implausible to store and compute distances between all session instances in the data, data sampling is used to reduce computing complexity and space requirement. 4.3 Session Clustering Algorithm The distance-based session clustering algorithm contains three steps. In the first step, a set of N session instances, are randomly chosen from the data set. Then, repeated session instances are merged, session frequencies are updated, and a set session instances S : (s1 , s2 , ..., sM ), are obtained. We calculate three similarity scores and distances for each pair of session instances, and a M × M similarity matrix are obtained. We treat each session instance as a session group and calculate the distance between the groups based on the average intra-group distance. Then, we agglomerate the session groups until the distance threshold β1 is reached. During this step, a hash table is constructed. The entry of the table has a format of < seq, grp >, where seq is a session sequence, grp is the group id. The hash table will be used in the next step to assign the sessions in the disk. When there is no space in the hash table, infrequent items, i.e., sessions with low frequency, are replaced. In the second step, we assign the sessions residing on the disk to the identified session groups according to the distance between the session and the session groups. The algorithm first checks whether the hash table contains the given session. If the session is in the hash table, we can find the corresponding group, and assign it to the session. Otherwise, we find a set of candidate groups. The distance between the candidate groups and the session are estimated, and a group with minimal distance are selected. If the minimal distance is larger than the distance threshold β1 , we create a new session group for the session, otherwise, the session will be put into the group with the minimal distance. In this step, the neighborhood score can not be used as part of the distance since it is implausible to calculate them unless all data is read into memory and the value of α1 , α2 , α3 are adjusted correspondingly. We observe that certain number of the session instances are very short. These session instances are either noise data, or belonging to certain session classes. The distancebase clustering algorithm can not handle such session instances efficiently. For example, a session < a > has the same distance with session < a, b > and < a, c >. We will ignore session instances with length smaller than 3 in step 1 and step 2, and processes them in the final step. These session instances are either discarded, merged into a session classes, or constructed as a new session classes, and the domain knowledge would help to process them. In addition, session classes with small number of session instances (the number is smaller than a predefined threshold), are treated as noise data, and are removed.

5 Analysis

Typical pattern clustering activity involves the following steps [Jain and Dubes, 1988]: (1) data/pattern representation, feature extraction or selection, (2) definition of a pattern proximity measure appropriate to the data domain, (3) clustering or grouping, (4) data abstraction, and (5) assessment of output. In our clustering algorithm, the data representations, i.e., the representations of session instances, are a collection of request sequences, where each request is a query template. The pattern representations, i.e., the representations of session classes, are the request set and the representative session set. Feature selection and extraction are performed to the original session instances. Certain session information, such as the user id , connection id, the original request (SQL query), are ignored. Pattern proximity is usually measured by a distance function defined on pairs of patterns. A variety of distance measures are in use in the various communities. A simple distance measure like the Euclidean distance can often be used to reflect dissimilarity between two patterns. The pattern proximity of our clustering algorithm is based on the three similarity scores. This approach considers not only the local similarity between sessions (the coefficient score and the alignment score) but also the global similarity (the neighborhood score). The clustering/grouping step can be performed in a number of ways. Traditionally clustering techniques are broadly divided in hierarchical and partitioning. Hierarchical clustering is further subdivided into agglomerative (bottom-up) and divisive (topdown). While hierarchical algorithms build clusters gradually, partitioning algorithms learn clusters directly. In doing so, they either try to discover clusters by iteratively relocating points between subsets, or try to identify clusters as areas highly populated with data. Our session clustering algorithm is a bottom-up agglomerative hierarchical algorithm. The stopping criterion is a distance threshold value instead of using the number of clusters. Data sampling technique is used to reduce time and space complexity. In the clustering context, a typical data abstraction is a compact description of each cluster, usually in terms of cluster prototypes or representative patterns such as the centroid. During the session clustering step, each session is represented with the request set and the representative session set. This approach can reduce computation complexity and do not loss accuracy as well. After session classes are generated, we use a user access graph to represent each session class. User access graph was proposed in [Yao and An, 2004] to describe the query execution order and their relationship in session classes. Our algorithm can handle noise data as well. Random session instance, i.e., the session instance does not belong to any session class, is usually far away from all other session groups, and will be discarded because of the small number of session instances. If a session instance contains few random requests, it is very likely to be clustered into the corrected session group since the three similarity scores reflects the random requests as well.

6 Performance Evaluation and Experimental Result 6.1 Performance Evaluation Strategy In general, there are two ways to validation and evaluate the output of a clustering algorithm. One is estimated the clustering algorithm based on certain criterion or measurement. The criteria can be external criteria, such as the pre-defined class label, internal criteria, such as the quantities that involve the data set themselves. and relative criteria, such as comparing it with other clustering schemes, resulting by the same algorithm but with different parameter values. With regard to this approach, we can evaluate our algorithm with different threshold values and similarity parameters. The performance of the session clustering algorithm itself depends on the selection of the distance threshold values (β1 , β2 , β3 ) and the session similarity parameters (α1 , α2 , α3 ). Among the three distance thresholds, the value of β1 is the most important one since the other two values are related to β1 , and can be derived from it. If distance thresholds are too low, many session classes are generated. If they are too high, many session instances that belong to different classes may be merged into a single session group. The selection of threshold values depends on the specific application. Small threshold value can be used for application in which the difference between session classes is significant, i.e., the distance between them is “far”, otherwise, large threshold values are used since it may discriminate the trivial difference between session classes. The choose of similarity parameters is also an important factor. The parameters can be viewed as the weight of the three similarity scores. Difference application may have different parameter values, and the adjustment of these parameters accordingly is necessary. Another clustering validation and evaluation approach is actually an assessment of the data domain. The clustering system can be assess by an domain expert by using domain knowledge. Cluster interpretability and cluster visualization are two common methods of this assessment. For this approach, we can build user access graph, a weighted directed graph, to represent each session class (see [Yao et al., 2004b] for details). In this sense, we can evaluate user access graphs instead of session classes based on the following two methods, requests classification and next request prediction. The evaluation methods are described as follows. Given a request sequence r :< r1 , ..., rn >, that is part of a given session s of the testing data, we can estimate the possibility of r belonging to a session class gi , and the possibility of the next request is rm from the corresponding models (user access graphs). In particular, we can estimate the possibility value of P (r|gi ), P (gi ), and P (rm |r, gi ) from the user access graphs. Meanwhile, the distance-based algorithm is used to find a nearest session class gj . The next request rn+1 can be obtained from the testing data. For request classification, we said it is a ”hit”, if gi and gj are the same. For request prediction, we said it is a ”hit”, if rm and rn+1 are the same. The performance measurement is referred to as F-measure, which has been used in information retrieval to measure the retrieval performance. F-measure is defined as 2 ∗ precision ∗ recall F-Measure = precision + recall , where precision is defined as the ratio of the number of correctly clustered sessions/requests to the total estimated sessions/requests, and the recall is the hit-rate, that is the ratio

of the number of correctly sessions/requests to the total number of sessions/requests, A higher F-measure value means a better overall performance. This approach can be viewed as statistics based cluster interpretation technique. 6.2 Application Domain To test our ideas in the project, we have been using a clinic OLTP application as a test bed. In each day, the client applications installed in the branches make connections to the center database server, which is Microsoft SQL Server 7.0. In each connection, a user may perform one or more tasks, such as checking in patients, making appointments, displaying treatment schedules, explaining treatment procedures and selling products. We obtain a database trace log (400M bytes) that contains 81,417 events belonging to 9 different applications, such as front-end sales, daily report, monthly report, data backup, and system administration. The target application of the paper is the front-end sales application. After preprocessing the trace log, we obtain 7,244 SQL queries, 18 database connection instances of the front-end sales application. The queries are classified into 190 query templates, and 18 request sequences are obtained (one sequence per connection). 1510 session instances are obtained from 18 request sequences. In the experiments, we choose 721 session instances that belonging to 4 request sequences as the input of clustering algorithm, and the other 789 sessions as the testing data. 6.3 Clustering Results with Different Threshold and Similarity Parameter We test the performance of clustering algorithm in the following two approaches. We first test the number of clusters generated with different similarity parameters in the sampling step. We set the neighborhood similarity parameter as 0.3, and the coefficient parameter is dynamically changed from 0.0 to 0.7, and the alignment parameter is changed correspondingly. The result is shown in Fig. 3. The figure shows that more clusters are generated when the coefficient parameter is large, and fewer clusters are generated when the alignment parameter is large. In the next step, we use 0.4,0.3,0.3 as similarity parameters in the first step, and use 0.5,0.5, 0.0 in the second step. We use different threshold value (β1 ) to cluster session instances. The result is shown in Fig. 4. From the figure, we observe that the number of clusters increases when the threshold value increase, however, the increasing rate after pruning is less than that before rating. Finally, we set the threshold value as 0.5, remove any clusters with less than 10 session instances. 21 session clusters (or classes) are found from the training data. 6.4 Performance with requests classification and request prediction Figure 5 shows an example of user access graphs obtained from our test data. Each node in the graph is a query template, and an edge is represented by ek : (vi , vj , σvi →vj ), where σvi →vj is the probability of vj following vi , which is called the confidence of the edge. An session instance of user access graph P1 is shown in Table 1. The instance

0.2 0.5 0.8

90 80

40

70

35 30

50

25

40

20

30

15

20

10

10 threshold value 9

95

8

0.

0.

7

75

0.

0.

0.

4

45 0. 5 0. 55 0. 65

0.

0.

3

35

0.

0.

2 0.

0.

25

0 1

0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70

15

Coefficient weight

0.

5

num. clusters before pruning num. clusters after pruning

60

05

threshold: 0.4 0.6

0.

Num. of Clusters

45

0.

50

Fig. 3. Number of clusters v.s various similarity Fig. 4. Number of clusters v.s various threshold values (pruning threshold =10) parameter values

can be interpreted as employee ’1025’ retrieve customer ’1074’ ’s profile and treatment schedule at branch ’scar’ on 2003-03-04. We treat some query data value as the parameters of the graph. As a result, graph P1 has four parameters: user id (g uid), customer id (g cid), branch id (g bid), and login date (g date) shared by all nodes.

Fig. 5. User access graphs.

Our experimental result shows that the overall precision of the algorithm is 89% and the recall is between 60% to 80%. The overall precision of the prediction algorithm is 84%, but the recall is not between 50% and 70%. The reason is that we would like to obtain high precision to avoid the high compensation of wrong classification or predication, thus the coverage (recall) is low.

7 Conclusion In this paper, we discuss our approach of clustering database user sessions. The results from our approach can be used to tune the database system and predict incoming queries based on the queries already submitted, which can be used to improve the database performance by effective query prefetching, query rewriting and cache replacement. The work presented in the paper has a broader impact on the database and data mining fields. Although the data set used in the paper is based on a clinic application, the idea presented in the paper can be used in any database-based application, such as the ERP or CRM applications that may contain hundreds or even thousands different types of session. It can also be used on Web log analysis and DNA sequence analysis. In the future, we plan to apply the ideas proposed in this paper to OLAP trace logs.

Bibliography

H. P. Artis. Capacity planning for MVS computer systems. In L. S. Wright and J. P. Buzen, editors, Int. CMG Conference, pages 209–226. Computer Measurement Group, 1978. A. Behm, S. Rielau, and R. Swagerman. Returning modified rows - SELECT statements with side effects. In VLDB 2004, Toronto, Canada, 2004. P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. G. W. Birgit Hay and K. Vanhoof. Clustering navigation patterns on a website using a sequence alignment method. In IJCAI Workshop on Intelligent Techniques for Web Personalization, 2001. I. T. Bowman and K. Salem. Optimization of query streams using semantic prefetching. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 179–190. ACM Press, 2004. ISBN 1-58113-859-8. S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345–366, 2000. P. Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11:37–50, 1912. A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988. ISBN 0-13-022278-X. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 58:443–453, 1970. C. Nikolaou, A. Labrinidis, V. Bohn, D. Ferguson, M. Artavanis, C. Kloukinas, and M. Marazakis. The impact of workload clustering on transaction routing. Technical Report TR98-0238, 1998. C. Sapia. PROMISE - modeling and predicting user query behavior in online analytical processing environments. Forwisee technical report fr-2000-001, 2000a. C. Sapia. PROMISE: Predicting query behavior to enable predictive caching strategies for OLAP systems. In DAWAK, pages 224–233, 2000b. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981. TPC-W. TPC Benchmark W (Web Commerce) - standard specification revision 1.6, Feb 2002. W. Wang and O. R. ZaAne. Clustering web sessions by sequence alignment. In 13th International Workshop on Database and Expert Systems Applications (DEXA’02), 2002. Q. Yao and A. An. Using user access patterns for semantic query caching. In Database and Expert Systems Applications (DEXA), 2003. Q. Yao and A. An. Characterizing database user’s access patterns, accepted for database and expert systems applications, 2004.

Q. Yao, X. Huang, and A. An. Applying language modeling to session identification from database trace logs. submitted for publication, 2004a. Q. Yao, X. Huang, and A. An. Modeling database user’s behaviors. in progress, 2004b. P. S. Yu and A. Dan. Impact of workload partitionability on the performance of coupling architectures for transaction processing. In Proc. of the Fourth IEEE SPDP Conference, pages 40–49, 1992. P. S. Yu and A. Dan. Performance analysis of affinity clustering on transaction processing coupling architecture. IEEE TKDE, 6(5):764–786, 1994. ISSN 1041-4347.

Suggest Documents