A Parallel Data Mining Architecture for Massive Data Sets - CiteSeerX

5 downloads 20855 Views 67KB Size Report
by a data mining client, Syllogic's DMT/MP, which implements the actual data mining algorithms. The parallel architecture and the primitives used to operate on ...
A Parallel Data Mining Architecture for Massive Data Sets Arno Knobbe Syllogic B.V. Postbus 2729 3800 GG Amersfoort The Netherlands email: [email protected]

Felicity George High Performance Research Center Tandem Business Unit, Compaq Wallace View, Hillfoots Road, Stirling Fk9 5PY Scotland email: [email protected]

Abstract This paper discusses a parallel data mining architecture which provides the capability to mine massive data sets highly efficiently, scanning millions of rows of data per second. In this architecture the mining process is divided into two distinct components. A parallel server, Compaq’s Data Mining Server (DMS)1, provides a set of data mining primitives which are utilized by a data mining client, Syllogic’s DMT/MP, which implements the actual data mining algorithms. The parallel architecture and the primitives used to operate on the data will be discussed, and the mining algorithms’ use of these primitives. Performance figures will be presented for both the primitives and the high level mining algorithms.

1.

Section 2 discusses possible ways of parallelising the data mining process. Section 3 looks at the server providing the mining primitives, DMS. The primitives it provides and their performance are discussed. Section 4 presents ways in which a data mining client could use the server, focussing in particular on two well known data mining techniques within DMT/MP, association analysis and decision tree induction. Section 5 discusses future directions for DMS. Finally, section 6 presents a brief summary of the material covered.

Introduction

A common trait of many data mining algorithms is that they search a large space, considering numerous different alternatives and scanning the data repeatedly. In order to handle these large demands on the database and to be able to work with industrial-sized databases in an interactive fashion, a parallel architecture is required ([4], [6], [7], [11], [17]). This paper presents such an architecture, setting it in the context of other possible parallel designs, and discussing the entire mining process from simple low-level data manipulation primitives up to the data mining algorithms. The data mining solution presented here has been 1

designed primarily for speed and scalability; in order to process large amounts of data rapidly, simplicity of design has also been crucial. Other features of the system are that it is easy to extend to add further primitives as required, and that, whilst the server is targeted at data mining, the primitives are general enough that other more traditional decision support tools could also make use of the server with relatively little effort.

A commercial version of DMS, called InfoCharger, is available through the Tektonic division of Compaq.

2.

Methods for Parallelising Data Mining

For compute-intensive applications, parallelisation is an obvious means for improving performance and achieving scalability. A variety of techniques may be used to distribute the workload involved in data mining over multiple processors. A good classification of different approaches to parallel processing for data mining is presented in [4]. Four major classes of parallel implementations are distinguished. The classification tree in Figure 1 demonstrates this distinction. The first distinction made in this tree is between taskparallel and data-parallel approaches.

Divide and Conquer Task Parallelism

Task Queue Record Based Data Parallelism Attribute Based

Figure 1: Methods of Parallelism Task-parallel algorithms assign portions of the search space to separate processors. The taskparallel approaches can again be divided in two groups. The first group is based on a Divide and Conquer strategy that divides the search space and assigns each partition to a specific processor. The second group is based on a task queue that dynamically assigns small portions of the searchspace to a processor whenever it becomes available. A task parallel implementation of decision tree induction will form tasks associated with branches of the tree. A Divide and Conquer approach seems a natural reflection of the recursive nature of decision trees. Experiments on a task-parallel implementation of decision trees that seems to work well due to the independent nature of the sub-tasks are presented in [4]. However the task parallel implementations suffers from load balancing problems caused by uneven distributions of records between branches. The success of a taskparallel implementation of decision trees seems to be highly dependent on the structure of the data set. The second class of approaches, called dataparallel, distribute the data set over the available processors. Data-parallel approaches come in two flavors. A partitioning based on records will assign non-overlapping sets of records to each of the processors. Alternatively a partitioning of attributes will assign sets of attributes to each of the processors. Attribute-based approaches are based on the observation that many algorithms can be expressed in terms of primitives that consider every attribute in turn. If attributes are distributed over multiple processors, these

primitives may be executed in parallel. For example, when constructing decision trees, at each node in the tree, all independent attributes are considered, in order to determine the best split at that point. Attribute-based approaches work well if a good load balancing can be achieved by a careful distribution of attributes. Unfortunately, the number of attributes usually is in the same order of magnitude as the number of processors, which makes an uneven distribution of attributes more likely. Also the set of attributes that is considered during the course of the algorithm may differ, for example because certain attributes have been discarded or completely analyzed. Finally, many data mining primitives consider multiple attributes at once, which may require duplication of data in order to minimize communication between processors. Record-based partitioning is based on the notion that most primitives common in data mining that consider all records can be executed in parallel if the records are distributed over the processors. The execution of a single primitive will be farmed out to each of the processors. The results of the operations are then combined into a single result. Record-based approaches are dependent on the fact that every record has an equal probability of being considered during the execution of a primitive. For a random distribution of records this is the case. The number of records is usually several orders of magnitude larger than the number of processors. This will facilitate an equal distribution of the data over the available processors. This last parallelisation method is the one used by DMS.

3.

DMS: Architecture, Primitives and Performance

This section describes DMS, the data mining server, discussing the architecture, the primitives offered by the server, the benefits and drawbacks of this design, and some basic performance analysis.

3.1.

Client Tool

System Architecture

As mentioned in the previous section, DMS implements a data-parallel record-based parallelisation of data mining primitives. In Figure 2 an overview of DMS is shown; basically, the system consists of a manager and a number of servers which each process a subset of the data. Requests are received by the manager from the client data mining tool. The request may be handled entirely by the manager if it is a simple query; for example, a request for a list of data sets currently available to the client. If the request requires processing of the data, it is farmed out to the servers. The manager then consolidates responses from the servers and returns the required results to the client. The manager also maintains a catalog detailing the data which DMS currently has access to.

ORB interface

DMS Manager

Server



Server

Figure 2: An Overview of DMS’s Architecture

Figure 3 shows the server components of DMS. Each server consists of four modules: • The factory, which inputs data into DMS from a variety of sources including text files and Tandem NSSQL and Oracle databases. The factory converts the data into COREs (Column Represented Data Objects), DMS’s internal data representation. • The engine, which undertakes all operations on DMS data, such as building cross-tables and fetching rows of data to be passed through to the front end client. • The cache, which stores and manages COREs and other internal DMS objects. • The fileserver, which manages the permanent storage of and access to COREs on disk. The high data processing speeds attained by DMS are due to several factors; primarily • Effective parallelisation of the data. • Efficient encoding of the data into COREs; these structures are such that they are compact in memory and fast to access. The encoding also makes the COREs data-type independent, which simplifies the code and further compacts the data. • Simple and optimized algorithms are used to manipulate the COREs.

Server

Manager

Server Core Engine

Core Factory

Src Data

… Src Data

Core Cache Core File Server

Core Data

Core Data



Core Data

Figure 3: The internal structure of the DMS servers • •

Storing data by column rather than by row, reducing disk access times and memory requirements for column based operations. Zoom-in functions enabling fast processing of subsets of the data, removing the need to scan the entire data.

3.2.

System Primitives

DMS can be accessed via a number of function calls, which can be divided into three major categories: • administration • visualization • operations The sections below deal with each of these areas in turn. The full interface to DMS is not presented in this paper; [9] provides a full specification of the interface, and [8] is an accompanying user guide.

3.2.1.

Administrative Primitives

The primary purpose of the administration interface is to manage the creation and storage of CORE objects upon which the data mining functions operate. COREs may be implemented as Binary encoded Attributes (BATS) [11], bitmaps or lists of row identifiers; however, the interface to these data mining objects is independent of the data-structure. Some examples of administrative functions are as follows: •

• • • • •



Register a data source; DMS is made aware of a data source, which may be for example a text file or a relational table, and is told how to access it. From this point on, the client may query DMS about the data source, asking for a list of attribute names and data types for example, or extracting data from the source into COREs. Define the parallel partitioning of CORE data required for a given data source. Create CORE objects for a set of attributes from a specified data source. Delete CORE objects from DMS. Remove a table from DMS; this erases all knowledge of the external table from DMS. Define relationships between source tables; hierarchical data in a star or snowflake schema can be described and loaded by DMS. Transform data; any set of COREs may be transformed via basic arithmetic and logical operations to create a new CORE; this may be done during data analysis or also in a data pre-processing phase.

3.2.2.

Data Description/Visualization Primitives

This section gives some examples of functions used to retrieve information about the structure and contents of source data and of COREs. Some of these functions are as follows: •

• •



Fetch a list of the tables currently available to DMS. These tables are source data tables; possibly Oracle tables or text files, which have been registered with DMS. For a given table, get a list of the attribute names and their data types. For a given table, get a list of the COREs already created for this table. Only data for which COREs have been created can be operated on in DMS. Fetch a number of rows of data; the number of rows, and the first row to start fetching from may be set by the user. A subset of the attributes forming a row may be fetched, or the entire row. Rows from a subset of the rows in the table may also be fetched.

3.2.3. Operational Primitives

The client tool operational interfaces provide support for manipulating CORE objects in a variety of ways. For example: • • • • •

Creating histograms. Creating cross-tables. Creating aggregate tables. Identifying subsets of data (zooming). Creating association tables.

For a given column, a histogram is a onedimensional representation of the distribution of values within the column. Cross-tables are the ndimensional counterparts of histograms. By default cross-tables are indexed by a number of columns, and each cell in the table is a count of the number of occurrences of all column index values in the same row. Other aggregate functions may also be calculated within the cross-table cells; for example minimum, maximum, average, standard deviation and sum

aggregates of any attribute may be returned in the cross-table. Example: let us assume that we have a text file or relational database containing a table which relates to property purchase and has, in its list of attributes, Age_of_Property, Area and Cost_of_property. Table 1 shows the data in these attributes.

1

2

8

9

10

… 1 2 3 4 5 6

75,000

100,000 20,000 150,000

30,000 20,000 30,000 200,000

95,000

43,333

Table 3: A cross-table averaging house price over age and area Area 1 1 1 1 2 3 3 4 4 5 5 5 5 6 6 6

Age 10 1 2 1 10 8 2 10 2 1 1 1 1 2 2 2

Cost 30,000 85,000 100,000 65,000 20,000 30,000 20,000 200,000 150,000 100,000 95,000 80,000 105,000 45,000 35,000 50,000

Table 1: An example data set To analyze this data, we would firstly register the table in DMS, then request that COREs be built for the above attributes. We could then request DMS to create a cross-table of counts on age and area; Table 2 would be returned 1

2

8

9

10

… 1 2 3 4 5 6

2

1 1 1

1 1 1

In addition to providing operations that scan the entire data set efficiently, we must also have the ability to zoom in on areas of data. Many data mining algorithms will start by scanning the entire data set to find simple patterns, then progressively reduce the amount of data they are processing as they search for more complex patterns in the data. In order to effectively make use of the zooming behavior of most data mining algorithms, an efficient scheme is required that allows the dynamic construction and destruction of subsets. Lightweight indexes called pCOREs (predicate COREs) provide the means for storing temporary subsets of records. pCOREs can be created by specifying a predicate (a set of conditions) either on complete table or on a subset of existing pCOREs. When performing data mining primitives, these pCOREs can be used to limit the set of rows that are examined. During a single search for patterns a large number of pCOREs will be created, and at any time multiple pCOREs may exist. Example: we could now decide to focus on properties which are only one year old by creating a pCORE which represents the subset shown in Table 4.

1

4 3

Table 2: A cross-table counting age and area The result set is actually returned as a packed list of values. Also, the user may specify a threshold below which results should not be returned. Counts of 0 are never returned to the user. If the average cost of a house for each area and age were required, Table 3 would be returned.

Area 1 1 5 5 5 5

Age 1 1 1 1 1 1

Cost 85,000 65,000 100,000 95,000 80,000 105,000

Table 4: A subset of the data in Table 1 in which only one year old properties are selected

This subset would be represented by a list of row numbers: 2 4 10 11 12 13 Table 5: a pCore representing properties which are a year old One popular data mining technique which is aided by slightly more advanced primitives is association analysis. Using this technique, we take a set of transactions, where each transaction consists of a set of items, and form an association rule X => Y, where X and Y are sets of items. Such a rule indicates that transactions that contain X are likely to contain Y also. For example, in the retail industry a transactional table might consist of rows which each hold a basket identifier, one product found in the basket and some other information pertaining to the sale such as price. We could then use association analysis (explained more fully in section 4.2) on this table to discover which products commonly occur together in the same basket. Table 6 shows an example with 7 customers and 4 different products, stored as COREs. Association analysis could provide us with the results shown in Table 7, where the x- and yaxis’s are the products-id’s and the counts are the number of times that given combinations occur. The diagonal counts the number of duplicates of an item in a group. This result shows us, for example, that the strongest correlations between products are between mortgage and insurance and between loan and insurance which both occur in the same group 5 times. For this type of analysis, we need to understand the concept of sets of rows, such as the set of rows forming one basket or all purchases by one customer. Primitives exist within DMS to allow basic association analysis and work with such sets of rows. For example, when creating a pCORE, a grouping CORE may be input, such as a CORE representing basket number, and a pCORE may be created in which every member of a group (basket) is included in the subset

whenever one of its members satisfies the given predicate. In this way, subsets of data can be built for transactional association analysis representing subsets such as ‘all baskets that contain product A’.

customer-id nCORE 1 1 1 1 2 2 2 3 3 4 4 5 5 5 6 7 7 7 7 7

product-id nCORE savings mortgage insurance loan mortgage insurance savings savings mortgage loan insurance mortgage insurance loan loan savings mortgage insurance loan insurance

Table 6 Some example customer data

mortgage savings insurance loan

mortgage

savings

loan

-

insurance -

0 4

0

-

-

5

4

1

-

3

2

5

0

-

Table 7 An association table that shows how often combinations of two products are found in groups.

3.3.

consolidating server results. If each server processes a substantial amount of data, this overhead is negligible and the manager and communications do not form a bottleneck. However, if each server has only, say, a few thousand rows of data, the parallelism will not be worth while.

DMS: Pros and Cons

This section takes a brief look at the positive features of DMS, and at some of the drawbacks of the architecture. On the positive side, DMS has the following attributes: • •





Speed: Performance figures are presented in section 3.4 Scalability: Provided that the data set is not too small, and the amount of data is proportional to the number of processors, this system scales very well, as will be shown in the next section. Generality: The general primitives provided so far, and the ability to add new primitives easily, make the system adaptable and useful for many data mining applications, and also for other decision support systems. The nature of the interface allows front-end data mining tools to hook in to DMS easily. Extensibility: New features can easily be added to DMS. The simplicity and modularity of design largely account for the extensibility.

A few drawbacks of the architecture used in DMS are: •





Performance will degrade if a CPU is short of memory relative to the number of attributes and rows of data that it is processing. However, the memory requirements are not unreasonable; given a 1 million row subset of data which one CPU has to handle, an average of 1 Mbyte of memory will be required for each attribute used in any given operation. As mentioned previously, a drawback of this mode of parallelism is that if a subset of the rows is used in an operation, it may be that these rows are not evenly distributed over processors and the load balance is poor. However, all parallelism techniques have similar ‘worst case’ scenarios, and sensible data distribution should in the main prevent this problem from occurring too often. A final point about the architecture, is that it is less effective for small data sets. There is some overhead in client/manager and manager/server communications, and also with the manager accumulating and

3.4.

Performance Analysis

This section presents some performance figures for DMS. This is not intended to be a thorough analysis of system behavior and characteristics. The intention is simply to present some performance figures to give the reader a feel for the speed and scalability of DMS. Measurements are presented in terms of basic primitives. These figures were taken from tests run on a 4 processor 200 MHz Pentium Pro NT server with 256 Mbytes of memory. The largest system that DMS has run on to date is a 64 processor NT system (16 nodes of 4 processors each). A data set of 100 Million rows was used in this case, and the performance scaled nearly linearly in comparison to the performance on a single node. This is simply because most time is spent scanning and operating on the data; obviously as the system size grows the communications and data passing through the manager increases, but at a system size of 64, the time this takes is not very significant.

3.4.1. CORE Creation and storage requirements

In general, CORE creation is bounded by the time it takes to scan data from the source database. Retrieving the data is the most time consuming part of the process; constructing COREs from the data is very straightforward. To give a feel for creation times, if we were to create 10 COREs for a text table of 100,000 rows, this would take a little under 8 seconds on one processor of our NT server. To create 10 COREs for a text table of 500,000 rows on one CPU of the NT server would take 37.5 seconds; the sub-linear scaling is due to overhead in the manager which is constant. The attributes created in this test all have cardinality of

approximately 100; higher cardinality attributes will take more time to convert to COREs. It should be noted that once created, COREs may be reused indefinitely, and need only be reloaded if the source data changes. The storage required per CORE depends on the cardinality of the attribute it encodes. One attribute of a 1 million row table, with a cardinality of, say, 200, would require a total of 1 Mbyte of storage in DMS. An attribute with cardinality 1 million would require 4 Mbytes. An attribute of cardinality 2 would require only 125 Kbytes. Currently, the only other permanent data stored in DMS is meta data about the tables and attributes it has access too. Other structures, such as pCOREs, are dynamically created and destroyed within one DMS session. Therefore, it can be seen that the storage requirements are very modest. Given a table of 1 million rows, and 40 attributes, with an average cardinality of 100, DMS would require, depending on the exact cardinalities of the columns, between 10 and 40 Mbytes of permanent storage to encode the whole table. In comparison, a standard relational DBMS would require in the region of 200 Mbytes. The memory required by the system depends on the number of COREs which are operated on concurrently. DMS works most efficiently when all COREs can reside in memory. COREs are cached and can be overwritten in memory if there is insufficient space to hold them all, but this clearly leads to increased I/O costs.

3.4.2. Creating cross-tables

The figures presented in this section give a feel for DMS’s performance when creating crosstables. The tests are run on one processor and four processors to give some feel for scalability, using tables of 2,500,000 rows per processor; thus the table sizes used are 2,500,000 rows and 10,000,000 rows. In all cases, the COREs operated on are in memory before the operation. If COREs must be read from disk first, there is a simple overhead depending on the I/O speed of the system. The source table holds 12 attributes, but since DMS stores each attribute individually, the width of row does not affect the performance. Factors which do affect the performance are the cardinalities of the attributes used (different cardinalities are encoded in COREs in different

ways), and the number of attributes used concurrently. cardinalities of COREs used

operation

2 112 65536 2,2

histogram histogram histogram cross-table (count) cross-table (2D, count) cross-table (3D, count) cross-table (4D, count) cross-table (2D, average) cross-table (3D, average)

112,94 112,5,2 2,4,104,94 2,2,104 112,5,2,94

Performance in seconds 1 CPU 0.062 0.094 0.438 0.218

4 CPUs 0.078 0.156 0.890 0.234

0.516

0.625

1.484

1.515

1.797

1.937

1.266

1.281

1.828

1.984

Table 8: performance of DMS creating crosstables for 2,500,000 and 10,000,000 row tables It can be seen from Table 8 that system scales well from 1 processor to 4. There are two major reasons that the performance significantly differs when scaling up to 4 processors. Firstly, if the size of the result table is very large, then sending results from each server to the manager will take some time, and be a noticeable factor. This is the case, for example, when the 4-dimensional crosstable is built. This effect also shows when a histogram is built for an attribute with cardinality 65536. In this case the problem is further compounded by the second reason for performance differences; very simple operations such as building histograms do not provide much work for the servers even when each is dealing with 2,500,000 rows, and thus the communications overheads of several servers becomes noticeable. The best processing time achieved by DMS, in terms of the number of rows processed per second was in building a simple, low cardinality histogram. In this case roughly 40 million rows can be processed per second. The slowest operation was the largest histogram over 4 processors, processing roughly 1.25 million rows per second per CPU. An ‘average’ data mining operation might well be creating a two

dimensional cross-table holding simple counts, which can be done at a speed of between roughly 5 and 12 million rows per second per CPU.

3.4.3. Creation and use of pCORES

As explained in section 3.2.3, pCOREs are used to identify subsets of the data upon which the mining algorithm wishes to focus. pCOREs make access to a subset faster, as only the data which is being used need be scanned, and they also provide a simple access method to a subset. The supported data mining algorithms need some method of identifying subsets of the data, and building a cross-table would take considerably longer than the times shown in Table 8 if the subset had to be identified during processing. Creating a pCORE from an entire table takes slightly longer than building a histogram or a cross-table on the same data. However, once a pCORE has been constructed, it may be reused many times in the application of any operation to the subset of data the pCORE represents. The figures given in Table 9 show the time taken to generate cross-tables on a subset of the data. Comparing these results to the generation of the same cross-tables on all the data, shown in Table 8, the improvement in performance is clear. The operations are between 10.5 and 5.8 times faster than those on the full data, processing 7.5% of the data. cardinalities of COREs used

operation

112,5,2

crosstable (3D, count) crosstable (3D, average)

112,5,2,94

time (seconds) 1 proc

4 proc

0.141

0.187

0.234

0.343

Table 9: Generating cross-tables for a subset (7.5%) of the data

3.4.4. Data Mining Performance

Finally, it is helpful to put these figures in the context of data mining operations with one simple example; building decision trees for the tables used in the previous sections. Two decision trees were built for each table, as shown in Table 10. Three attributes have a cardinality of around 100, the other nine attributes have a cardinality of less than 10. As can be seen from Table 10, building a 24 node decision tree using 10 million rows of data takes only 17.5 seconds. number of attributes input 11 12

number of nodes generated (number of splits) 11 (5) 24 (10)

time (seconds) 1 proc

4 proc

12.3 14

15.4 17.5

Table 10: Building decision trees using DMS

4.

Client

This section describes some of the implementations of data mining algorithms in the Syllogic DMT/MP. It demonstrates how the data mining primitives described in the previous section support the efficient implementation of a range of algorithms. Before discussing details of DMT/MP’s algorithms, it is worth explaining some of the concepts of our approach, which is based on the idea of a level-wise search for patterns, where each level represents patterns of similar complexity. The search algorithm will start examining simple regularities. Regularities with sufficient support will be used to produce a set of candidate patterns of a higher complexity, which will form the next level in the search process. The process will continue recursively until all potential patterns have been examined. ([14], [16], [20]). A large number of algorithms can be expressed as variations on such a top-down search for patterns. The application of a level-wise algorithm to the discovery of a variety of patterns (called 'sentences' by the authors) is discussed in [16]. Examples include the discovery of association rules ([2], [10], [16], [20]), 'strong rules', and inclusion dependencies.

Another application of the level-wise algorithm is described in [14]. It discusses the discovery of keys and foreign key relations. The level-wise algorithm is conceptually quite simple; a search algorithm will produce candidate patterns, and each candidate will have to be validated by scanning the data. The primitives that are executed in order to validate the patterns perform very similar operations on similar sets of records. It is possible to reuse results from previous primitives during the execution of subsequent primitives. In particular, the similarity in operation between a pattern and its extensions can be exploited in order to increase performance. One form of reusing results from higher levels comes from the observation that the set of records that support a particular pattern is a superset of the set of records supporting a more complex pattern derived from the original pattern. In the process of extending simple patterns, the examination of records in the database-table will zoom into an increasingly smaller subset of records. This zooming-effect can be exploited by storing the set of records that support the original pattern and focusing on these records during the validation of derived patterns. As described in previous sections, pCOREs are used to provide this zooming.

4.1.

Decision Trees

The induction of decision trees is a popular data mining operation ([1], [4], [5], [6], [12], [18]). It involves the recursive partitioning of a data set, with the goal of producing subsets of records with a common value for a particular target attribute. Each partitioning (a split in the decision tree) is based on a simple test involving one attribute from a set of independent attributes. By splitting on attributes that have a high correlation with the target attribute, a decision can be made about the value of the target attribute. The resulting decision tree can be used to classify new records for which the value of the target attribute is unknown. Each node in the decision tree represents a set of records. The root of the tree represents the whole data set that was used for induction. At each node, the set of records associated with that node

is divided into multiple mutually exclusive subsets on the basis of the test performed in the node. Our implementation will use pCOREs as instantiations of the sets of records associated with each of the nodes except for the root node. Computing the correct pCORE for each node has two purposes. The first purpose is to guarantee that primitives performed at the node at hand produce results for the correct subset of records. The second purpose is to optimally benefit from the zooming effect by restricting all operations to the relevant records. Full table scans are therefore prevented. At each node in the tree a consideration of all independent attributes is required in order to determine the best attribute to split on. This involves considering the correlation between every independent attribute and the target attribute. A range of splitting criterions is described in the literature: information gain, gain ratio etc. [18]. The computation of each of these criterions is based on information about the frequency of every pair of values. This information is obtained from the current subset of records by executing the DMS primitive to create a cross-table with the current pCORE as an argument. The cross-table primitive provides all relevant information about the dependency between two attributes and can be used to compute the splitting criterion of choice. Depending on the splitting algorithm the information in a crosstable is combined to provide a single measure of dependency. A cross-table can also be used in the presence of value groupings or discretisations, both for the independent attribute and for the target attribute. The counts in rows pertaining to values belonging to the same group are simply summed. Although this grouping on the fly is very flexible, it may be inefficient if the simplification of cross-tables is performed frequently. A grouping of values during pre-processing is more desirable in this case. Transformation functions as described in previous sections can be applied to define such groupings. The decision tree will be constructed in a depth first manner. At each node in the tree a crosstable will be constructed for every available independent attribute. The cross-tables will be used to compute the best split. The most determining independent attribute will be used as

a basis for the test that will distribute records over several sub-trees. For each sub-tree a new pCORE is created from the current pCORE, using the associated test as a restricting predicate. When a pCORE is created, it is used to build a sub-tree recursively. After the sub-tree is created the associated pCORE is deleted, and the same process is repeated for the remaining subtrees. If none of the available independent attributes provides a relevant split, a leaf is created and the algorithm will continue with a different branch, until the tree is complete.

Y within the complete set of transactions. The support of a rule is therefore a measure for the generality of this rule.

The above described depth first algorithm will create and manage several pCOREs at the same time. At any time at most d pCOREs exist for a tree of depth d (an empty tree has depth 0). A pCORE will only exist as long as any work is required on subsets of this pCORE. One can easily prove that the amount of work performed at each level of the tree is proportional to O(m*n + p(d)), where m is the number of attributes and n is the number of records. p(d) is the total number of nodes at level d in the tree. For a binary tree, this amounts to p(d) = O(2^d).

The algorithm for discovering all frequent itemsets is an example of the level-wise algorithm discussed earlier in section 4. The algorithm starts by testing all item-sets consisting of a single item. All frequent item-sets are then extended to form the set of candidate item-sets for the next level. This process is continued until no new candidates can be generated or all levels have been examined.

Alternative implementations of decision trees that require full table scans at each node, and hence do not benefit from the zooming effect, will require O(m*n*p(d)) time to compute all relevant information at level d, which is clearly slower.

4.2.

Association Rules

Another popular paradigm in data mining is the discovery of association rules ([1], [2], [5], [10], [13], [15], [21]). Given a set of transactions, where each transaction consists of a set of items, an association rule is an expression X => Y, where X and Y are sets of items. Such a rule indicates that transactions that contain X are likely to contain Y also. Two measures, confidence and support, are used to describe the quality of a rule. The confidence of X => Y refers to the percentage of transactions that contain Y, within the set of transactions that contain X. Intuitively, it can be viewed as a measure for the truth of the implication. The support of X => Y refers to the percentage of transaction that contain both X and

A popular algorithm for discovering association rules, called Apriori, is described in [2]. It is a two-stage algorithm of which only the first stage requires the examination of the data. This first stage is aimed at the discovery of all frequent item-sets, sets of items that occur in many transactions. In the second stage, these frequent item-sets are combined to form rules.

Each item-set represents a set of transactions. It is clear that the set of transactions that supports an item-set X is a superset of the set of transactions that supports XA, where X is an item-set and A is a single item. Therefore, in order to compute the support of a new candidate XA, only the transactions that support X will have to be examined. Again we will use pCOREs as instantiations of the sets of supporting transactions. This will enable us to benefit from the zooming-effect inherent in the level-wise algorithm. The computation of the support of an item-set X will result in the creation of a new pCORE that represents all transactions supporting X. At each level in the search-space a number of overlapping pCOREs will exist, each associated with a single frequent item-set. These pCOREs will be used as input to the candidate-tests in the next level. In our previous discussion, no indication was given as to how transactions and items are represented in the Data Mining Server. Our approach supports two types of representation, each requiring different primitives and a different search algorithm. However, the general approach described in the previous paragraphs applies to both representations.

4.2.1.

Propositional representation

The first representation supported by our solution is a propositional one. A transaction is represented by a single record, and each item is represented by a single binary attribute. A transaction can therefore be seen as a bit-vector, of which the ones indicate the items belonging to the transaction. A bit-vector is clearly a good representation of a transaction, if transactions are likely to contain many items. It is less favorable if transactions contain only a few items on average. We assume that new candidates item-sets are derived from the frequent item-sets in the preceding level. Any extension XA of a frequent item-set X that does not contain any infrequent subsets will be tested. In order to test XA, a histogram primitive is executed that will count the number of transactions that contain a “1”, within the set of transactions that support X. If XA is frequent, it will be reported as such, and the set of supporting transactions will be instantiated as a pCORE. pCOREs will be deleted as soon as no possible extensions can be made from the associated item-sets. Propositional representations can be viewed as rules of binary expressions. In this context, a rule A => B simply states that if attribute A = True, it is likely that B = True. We can easily extend this by also allowing negative binary expressions, or even expression involving nominal or numeric attributes. Our solution currently supports rules of conjunctions of expression of the type A = value. For binary attributes it is possible to set whether just positive values, or both values are used (effectively treating it as nominal). It should be noted that including nominal attributes in the search process greatly increases the size of the search-space. A binary attribute can be used in a single way to extend a given item-set. A nominal attribute of cardinality n can give rise to n new candidates. Fortunately, the support of each of these extensions can be computed with a single call to the histogram primitive. Toivonen ([22]) describes some of the effects that allowing negative values in association rules has on the size of the rule-set that is returned. It turns out that attributes that rarely have the value “1” appear very frequently in the rules that are

produced. It may become very hard to manually inspect the rules that were discovered due to an abundance of rules containing negative expressions.

4.2.2. Transactional representation

The alternative representation of transactions supported by our solution uses multiple records to describe a single transaction. For each item contained in a transaction, a single record exists. A record consists of two fields; one field to describe the item, and one field to identify the transaction to which it belongs, so related records can be grouped. A transactional representation is attractive when transactions contain few items on average. The same level-wise algorithm is used to discover frequent item-sets. However, modified primitives are required to accommodate for the transactional representation. Both the pCORE construction primitive and the Histogram primitive will have their transactional counterparts. Because transactions are represented by sets of records with a common transaction identifier, sets of transactions will correspond to all records for which the transaction identifier is a member of the set of transactions. All records with the same transaction identifier will belong to the same transaction. Therefore, a record belongs to the set of records that support a particular item A, if any of the records with the same transaction identifier holds item A. In order to produce a pCORE containing all transactions supporting item A, not only the records that hold item A, but also all records with related transaction identifiers will have to be summed. The transactional histogram primitive uses a similar grouping strategy. The primitive produces a list of counts, one for each item. A count for item A represents the number of transactions that contain A. This is a different operation from a regular histogram call on the item-attribute, because duplicates within transactions can be ignored. A transaction that contains item A twice, may be counted only once, depending on whether the association analysis wants to consider duplicate items or not. Note that the propositional representation,

contrary to the transactional representation, does not allow duplicates.

5.

Future Plans

In conclusion, we have presented a parallel data mining architecture that has the following features: • • •

The first release of DMS, as described in this paper, is currently in use. Future releases intend to add features such as:





• •

• •



Enabling user defined data sources. Currently the range of data sources from which DMS can extract data is fairly limited, although the ability to read text files means that any data source which can dump data to a text file can load its data into DMS. An interface will be provided such that the user can supply and link in their own simple primitives for accessing data from their data source. Additional aggregate functions will be provided. Extensions to association analysis. In particular, sequence analysis will be supported, where not only the co-occurrence of items in a set is analyzed, but also the order in which they occur. Update, delete and append functionality for COREs; as a user’s data changes, it would be useful to feed the changes through to DMS without necessitating a full reload of the data.

The Syllogic DMT/MP will be extended to use some of the existing features of DMS in new algorithms, as well as supporting the newer features. Extensions will include: •



6.

Taxonomies over items in transactions. The existing framework for discovering association rules can be easily extended to incorporate taxonomies. Especially the zooming capabilities of pCOREs provide the ideal basis for an efficient implementation. Regression trees. Regression trees can be easily implemented analogous to regular decision trees. Instead of using a cross-table of counts, aggregate functions such as average and sum can be used.

Summary

Scalable. Capable of handling massive data sets. Fast (even on one processor, if adequate memory is provided). Portable; most modules are system independent. Extensible. Simple to use.

We have seen how the data mining primitives provided can be used by common data mining algorithms and the resulting performance achieved. Further extensions to DMS, as discussed in section 5 should significantly broaden its applicability in the future.

References [1] Adriaans, P.W. and Zantinge, R. Data Mining, Addison-Wesley, 1996. [2] Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A. Fast Discovery of Association Rules in [5]. [3] Blockeel, H., De Raedt, L. Relational Knowledge Discovery in Databases, Proceedings BENELEARN-96. [4] Chattratichat, J. et al Large Scale Data Mining: Challenges and Responses, Proceedings KDD '97. [5] Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. Advances in Knowledge Discovery and Data Mining, AAAI Press/MIT Press, 1996. [6] Freitas, A.A., Lavington, S.H. Using SQL-primitives and parallel DB servers to speed up knowledge discovery in large relational databases, Proceedings EMCSR '96, 1996. [7] Galal, G., Cook, D.J., Holder, L.B. Improving Scalability in a Scientific Discovery System by Exploiting Parallelism,

Proceedings KDD ’97. [8] George, F., Green, R., Robertson, I., Stones, T. InfoCharger User Guide, Available from the authors [9] George, F., Green, R., Robertson, I., Stones, T. InfoCharger User Reference Manual, Available from the authors [10] Holsheimer, M., Kersten, M., Mannila, H., Toivonen, H. A Perspective on Databases and Data Mining, Proceedings KDD ’95. [11] Holsheimer, M., Kersten, M., Siebes, A. Data Surveyor: Searching the Nuggets in Parallel, in [5]. [12] John, G.H., Lent, B. SIPping from the Data Firehose, Proceedings KDD ’97. [13] Kamber, M. Han, J., Chiang, J.Y. Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes, Proceedings KDD ’97. [14] Knobbe A.J., Adriaans, P.W. Discovering Foreign Key Relations in Relational Databases, Proceedings EMCSR ’96, 1996. [15] Knobbe, A.J., Adriaans, P.W. Analysing Binary Associations, Proceedings KDD ’96. [16] Mannila, H., Toivonen, H. On an algorithm for finding all interesting sentences, Proceedings EMCSR ’96, 1996. [17] Provost, F., Kolluri, V. Scaling Up Inductive Algorithms: An Overview, Proceedings KDD ’97. [18] Quinlan, J.R. C4.5: Programs for Machine Learning, Morgan Kaufman, 1992. [19] Siebes, A. Data Surveying: Foundations of an Inductive Query Language, Proceedings KDD ’95.

[20] Siebes, A., Kersten, M.L. KESO: Minimizing Database Interaction, Proceedings KDD ’97. [21] Srikant, R., Agrawal, R. Mining Generalized Association Rules, Proceedings VLDB ’95. [22] Toivonen, H. Discovery of frequent patterns in large data collections, PhD Thesis, 1996.