SeqTrie: An Index for Data Mining Applications

4 downloads 1860 Views 536KB Size Report
SeqTrie: An Index for Data Mining Applications. Witold Andrzejewski and Tadeusz Morzy. Institute of Computing Science. Poznan University of Technology.
SeqTrie: An Index for Data Mining Applications Witold Andrzejewski and Tadeusz Morzy Institute of Computing Science Pozna´ n University of Technology Piotrowo 2, 60-965 Pozna´ n, Poland {wandrzejewski,tmorzy}@cs.put.poznan.pl

Abstract. Large databases of sales data are not susceptible for manual analysis. In order to extract useful knowledge from them, one must use data mining algorithms (the so-called market basket analysis). Unfortunately, these algorithms, depending on data and parameters, may generate a large number of patterns. These patterns are easier to analyse then the raw data set, nevertheless they must still be analysed by the end user of the data mining application. Such analysis involves executing a lot of queries on complex data types that are not well supported by commercially available database management systems. In this paper, we present an index that may be used for improving performance of such queries and therefore, improving performance of analysis of data mining results.

1

Introduction

Database systems, data warehouses and other data repositories are becoming increasingly popular. It is estimated that since the year 2000, size of the stored data has doubled. Each year, 5 million terabytes of new data are stored. Effective access and analysis of such large volumes of data is today an essential problem. Analysis of such large volumes of data is impossible to do “by hand”. To solve this problem, a large number of different techniques for knowledge discovery in databases (also known as data mining) have been developed. The purpose of these techniques is to discover new, useful, correct and understandable patterns in large databases. Such patterns are very useful in many applications, such as: science, medicine, finances and marketing. Many different types of data may be analysed using data mining algorithms including, but not limited to: stock prices, cash register data or web server logs. Particularly interesting is the analysis of sales data also known as market basket analysis. Through market basket analysis one may obtain frequent itemsets, association rules or sequential patterns. Market basket analysis algorithms frequently involve searching for supersequences or supersets of a given sequence or set. Unfortunately, very often, the number of discovered patterns is very large, and thus they need to be stored in a separate database for further analysis. Such analysis also involves searching for supersequences or supersets (the socalled subset or set subsequence queries). All of the aforementioned patterns

have a complex structure. Frequent itemsets are sets of categorical data, association rules are represented by two itemsets and sequential patterns are sequences of itemsets. Such complex data types, although possible to store, are not well supported in commercial database systems. Thus, the search for supersets or supersequences is very costly. Retrieval of sets has been widely investigated in literature and many indexing schemes have been developed, such as: signature files [7], inverted files [3], RDTrees [6], S-Trees [4] or hierarchical bitmap index [12]. On the other hand, several indexing schemes for sequences have been proposed so far, for sequences of atomic values such as: ISO-Depth index [16], SEQ-Join index [10] or SEQ family of indexes [14]. According to out knowledge, the only index for sequences of sets developed so far was proposed by us in [2]. However, this index was designed for sequences of timestamped sets and its main task was to support a very special case of set subsequence queries, where sets were also timestamped. In this paper we propose a new indexing scheme capable of efficient retrieval of sets or sequences of sets based on subset/non-contiguous subsequence containment. We present the physical structure of the index and we develop algorithms for query processing. The rest of the paper is organized as follows. In Section 2 we introduce basic definitions used throughout the paper. Section 3 contains an overview of the related work. We present our index in Section 4. Experimental evaluation of the index is presented in Section 5. Finally, the paper concludes in Section 6 with a summary and a future work agenda.

2

Basic Definitions

Let I = {i1 , i2 , . . . , in } be the set of literals called the items. A non-empty subset of I, denoted A = {a1 , a2 , . . . , am }, is called an itemset. An ordered list of itemsets, denoted S = hs1 , s2 , . . . , sk i, where si are itemsets, is called a sequence. Itemsets, that are a part of the sequence, are called elements. We assume that all of the elements in the sequence are numbered by consecutive positive integers starting with 1. We define the length of the sequence, denoted len(S), as the number of all the items in all of the elements of the sequence. Given a sequence S and an item i, we say, that an item i is contained within the sequence S, denoted i ∈ S, if there exists any element in the sequence S such, that it contains the item i. Given sequences S and T we say that a sequence T is a subsequence of S, denoted T ⊆ S, if the sequence T may be obtained from sequence S by removing some items from it and removing empty elements if such occur. Conversely we say, that the sequence S is a supersequence of sequence T . A database is a set of either sets or sequences, called the database entries. We denote the database as DB. Sets from database are denoted as Aid and the sequences from database are denoted as S id , where id is the unique identifier of the database entry. Without the loss of generality, we assume those identifiers to be consecutive integers. Given the database of sets DB and a query set Q, we define a subset query as an operation of retrieving identifiers of all the sets from

database such, that they are supersets of the query set. Formally, subset query returns the set {id : Aid ∈ DB ∧ Q ⊆ Aid }. Given the database of sequences DB and a query sequence Q, we define a set subsequence query as an operation of retrieving identifiers of all the sequences from the database such, that they are supersequences of the query sequence. Formally, set subsequence query returns the set {id : S id ∈ DB ∧ Q ⊆ S id }. The set returned by the queries is called the result set. We define a support of the item i, denoted sup(i), as a number of sequences from the database such that they contain this item. We define a set of extended items as a set containing elements of the form xi where x ∈ I and i is a positive integer. We define a string as a special case of a sequence, in which all of the elements contain only a single item. We shall denote strings as “x1 x2 . . . xn ” where xi are items or extended items.

3 3.1

Related Work Indexing of sets

Indexing of sets has been widely researched in literature. Many solutions, capable of supporting different types of queries to set-valued attributes, have been developed. Different types of queries include: equality, subset, superset and similarity queries. Such queries search the database for sets that are: equal to, supersets of, subsets of and similar to the query set respectively. Amongst the developed indexes one may mention: inverted file [3], RD-tree [6], signature files [7], S-Tree [4], Group Bitmap Index [13] and Hierarchical Bitmap Index [12]. 3.2

Indexing of sequences

Most research on indexing of sequence data is focused on three distinct areas: indexing of time series, indexing of strings of symbols, and indexing of web logs (sequences of timestamped symbols). Most indexes proposed for time series support searching for similar or exact subsequences by exploiting the fact, that the elements of the indexed sequences are numbers. This is reflected both in index structure and in similarity metrics. Often, a technique for reduction of the dimensionality of the problem is employed, such as discrete Fourier transform [1]. String indexes usually support searching for subsequences based on identity or similarity to a given query sequence. The most common distance measure for similarity queries is the Leveshtein distance [8], and index structures are built on suffix tree [17, 15] or suffix array [11]. Indexing of web logs data differs significantly from indexing of strings. The main difference is that each element in such a sequence is assigned a timestamp that must be taken into consideration when processing a query. Several different approaches have been considered so far. The first one uses a special transformation technique to transform the original problem into the well-researched problem of indexing of sets [14]. Other approaches include ISO-Depth index [16] which is based on a trie structure and SEQ-Join index [10] which uses a set of relational tables and a set of B+ -tree

indexes. The ISO-Depth index was extended by us in [2] to support timestamped set subsequence and subsequence similarity queries on databases of timestamped sequences of sets.

4

SeqTrie Index

We will now proceed to the presentation of our index for sequences of nontimestamped sets. The main idea is based on the well known trie structure [5], and on the tree node numbering scheme similar to the ones used to index XML data [9, 16]. The general idea for the index is as follows. We transform sequences of sets from the database to strings. Next, such strings are stored in a trie structure. After that, each of the nodes in the trie is assigned some additional data allowing to test whether one node is in a subtree of the other node. Next, nodes are grouped on lists according to the corresponding items. The lists obtained in the last step form the index. Finally, the trie structure is removed as it is no longer needed. To perform a query, we read the lists corresponding to all the items in the query sequence and check whether they contain nodes that were lying on a single path in a trie. Let us start with the description of the transformation of the sequence to a string. In order to perform such transformation we need to assume an arbitrary total order among the items. Given such an order we sort all of the items in all of the elements. Next, we convert all of the items to extended items by appending to each of the items the numbers assigned to their respective elements. As the next step, we convert each of the elements in the sequence to a string by concatenating all of the extended items they contain. Finally we concatenate all of the obtained strings. Example 1. As an example, we shall convert a sequence h{5, 3}, {3, 2, 1}, {2, 1, 2}i to a string. Because we use integers as items, we shall assume a total order defined by relation ≤. After sorting, we obtain the following sequence: h{3, 5}, {1, 2, 3}, {1, 2, 2}i. Next we perform conversion of items to extended items: h{31 , 51 }, {12 , 22 , 32 }, {13 , 23 , 23 }i. Finally, we concatenate all of the extended items in elements, and then concatenate the obtained strings to create the final string: “31 51 12 22 32 13 23 23 ”. Before we present the algorithm for index construction, we shall briefly describe the trie structure, and introduce our modifications to it. The trie structure is a tree used for storing strings. An empty trie is composed of only a single, root node. When inserting a string to the trie one starts from the root, and then travels along the edges labeled by the consecutive items from the string. If such edge does not exist, it is created. Every inserted string is terminated with a special item, which denotes the end of the string. Our algorithm for index construction uses a slightly modified version of the trie structure. The only modification intruduced by us, removes the special item terminating the strings. Instead, in the node, where the inserted string ends, we store the identifier of

Table 1. Exemplary database (a) Before conversion 1

S h{1, 3}, {4}, {3, 4}i S 2 h{1, 3, 4}, {2, 3}i S 3 h{2, 4}, {1, 4}i

(b) After conversion S 1 “11 31 42 33 43 ” S 2 “11 31 41 22 32 ” S 3 “21 41 12 42 ”

that string. Such modification does not change the complexity of the trie bulding algorithms, however, it allows us to remove the nodes, that are unnecesary during construction of our index. Steps for SeqTrie index creation are given by the Algorithm 1. Algorithm 1 An algorithm for SeqTrie index creation. 1. Convert all of the sequences from the database to strings. 2. Insert all of the strings to the modified trie along with their identifiers from the database. 3. Number all of the nodes of the trie using Depth First Search order, and for each of the nodes assign the biggest node number in the subtree. We shall denote the nodes’ number in DFS order as d and the biggest node number in the nodes subtree as m. 4. For each of distinct items in the database create an empty list (these lists are called the appearance lists). 5. Start traversing the trie in DFS order and for each of the nodes: (a) Decompose the extended item ie , labelling the edge pointing to this node, to the item: i and element number e. (b) Append an entry h(d, m), ei, where (d, m) is an interval created from the values d and m, to the list corresponding to the item i. 6. For each of the nodes, where a string has ended, create a list labeled by the nodes’ number and containing the list of sequence identifiers stored in that node. These lists are called the position lists. 7. Appearance lists and position lists form the SeqTrie index. The trie structure may now be discarded, as it is no longer necessary.

Example 2. In this example, we shall build an index for a small database of three sequences presented on Table 1(a). First, we transform all of sequences from the database to strings, to obtain the database of strings show on table 1(b). Next, we store these strings in a trie, to obtain the structure presented on the Figure 1. As a final step we extract appearance lists and position lists, to obtain the lists presented on tables 2(a) and 2(b). As one may notice, the intervals labelling the trie nodes have a special property. Node A is in the subtree of node B if, and only if, the interval, which labels

Table 2. An index for exemplary database (a) Appearance lists 1 2 3 4

(1, 8), 1; (4, 5), 2; (2, 8), 1; (3, 5), 1;

(b) Position lists

(11, 12), 2 (9, 12), 1 (5, 5), 2; (7, 8), 3 (6, 8), 2; (8, 8), 3; (10, 12), 1; (12, 12), 2

5 S2 8 S1 12 S 3

Fig. 1. Trie for exemplary database

the node A, is contained within the interval of node B. This property is used by the subsequence query algorithm (Algorithm 2). Sets may be processed using the same algorithms as sequences (both construction of the index and query execution) if we treat them as sequences with only a single element. However, if we plan to use the index only for storing sets, then the whole index structure may be simplified. In such a case, there is no need to store data about the order of the items (intervals) and the index degrades to a structure similar to the inverted file [3]. Example 3. In this example we shall find all supersequences of the sequence Q = h{1}, {4}i using the index from the previous example. We begin query processing by converting the query sequence to the string. The resulting string is as follows: Q =“11 42 ”. We also need to create and initialize the semiResults array. The first extended item from the query string is 11 which means that the first appearance list to analyze is the list correspoding to the item 1. The first entry on this list is h(1, 8), 1i. Because our semiResults array is empty, we may start analyzing the appearance list for item 4 (which is the next item from the query string) by calling the procedure queryRek. Because the next item in the query string comes from the next set, we search for all entries on this list, such that their intervals are contained within the interval (1, 8) and their corresponding set number is greater then 1. The first such entry is the entry h(6, 8), 2i. Because we have found an entry corresponding to the last item from the query string, we store the interval (6, 8) in the semiResults array. Next (and

Algorithm 2 An algorithm for set subsequence query execution. 1. Convert the query sequence Q to the string Q using the same total order as the one used to build the index. 2. Prepare an empty array called semiResults. 3. Let ieq be the first extended item from the query string. Decompose it to the item i and element number eq . 4. For each of the consecutive entries h(d, m), ei from the appearance list associated with the item i , if the interval(d, m) is not contained within the one of the intervals from semiResult, call queryRek(d, m, e, eq , 2) (Algorithm 3). 5. For each of the intervals (d, m) stored in the semiResults array, read all entries from the position lists that are labeled by a number contained within the interval (d, m). These entries contain the result of the query.

Algorithm 3 Recursive procedure queryRek used by the algorithm for set subsequence query execution. 1. Procedures parameters are: d0 , m0 , e0 , e0q and pos. 2. if pos = len(Q) + 1 then store the interval (d0 , m0 ) in the semiResults array. 3. if pos

Suggest Documents