integrating association rule mining algorithms with ... - Semantic Scholar

0 downloads 0 Views 73KB Size Report
DaimlerChrysler AG, Research & Technology, Ulm, Germany ... requirements arising from mining an operative data warehouse at DaimlerChrysler.
INTEGRATING ASSOCIATION RULE MINING ALGORITHMS WITH RELATIONAL DATABASE SYSTEMS Jochen Hipp DaimlerChrysler AG, Research & Technology, Ulm, Germany Wilhelm-Schickard-Institute, University of T¨ubingen, Germany Email: [email protected]

Ulrich G¨untzer Wilhelm-Schickard-Institute, University of T¨ubingen, Germany Email: [email protected]

Udo Grimmer DaimlerChrysler AG, Research & Technology, Ulm, Germany Email: [email protected]

Key words:

Data mining, Association rules, Relational database systems

Abstract:

Mining for association rules is one of the fundamental data mining methods. In this paper we describe how to efficiently integrate association rule mining algorithms with relational database systems. From our point of view direct access of the algorithms to the database system is a basic requirement when transferring data mining technology into daily operation. This is especially true in the context of large data warehouses, where exporting the mining data and preparing it outside the database system becomes annoying or even infeasible. The development of our own approach is mainly motivated by shortcomings of current solutions. We investigate the most challenging problems by contrasting the prototypical but somewhat academic association mining scenario from basket analysis with a real-world application. We thoroughly compile the requirements arising from mining an operative data warehouse at DaimlerChrysler. We generalize the requirements and address them by developing our own approach. We explain its basic design and give the details behind our implementation. Based on the warehouse, we evaluate our own approach together with commercial mining solutions. It turns out that regarding runtime and scalability we clearly outperform the commercial tools accessible to us. More important, our new approach supports mining tasks that are not directly addressable by commercial mining solutions.

1 INTRODUCTION

introduced in (Agrawal et al., 1993).

With the growing amount of data stored in databases and data warehouses the need to develop new analysis methods especially for very large collections of data became evident. Recent approaches subsumed under data mining or knowledge discovery in databases (KDD) allow analyzing even very large databases.

Generally speaking, an association rule is an “implication” X ! Y , where X and Y are disjunct sets of items. The meaning of such rules is quite intuitive: Let D be a database of transactions, where each transaction T 2 D is a set of items. An association rule X ! Y then expresses “Whenever a transaction T contains X than this transaction T also contains Y with probability ”. The probability is called the rule confidence and is supplemented by further quality measures like rule support and interest, c.f. (Agrawal and Srikant, 1994; Brin et al., 1997a).

1.1

Mining for Association Rules

One of the fundamental methods from the prospering field of KDD is the generation of association rules, as

The prototypical application of association rules is the analysis of customer data from retail. All items bought together by a customer during one visit of a store form a so called customer transaction. 1 The rules generated from such transaction databases are of the following type: “Customers who buy products x1 ; : : : ; xn also buy product y with probability c”. But association rules are not restricted to dependency analysis in the context of retail applications. In fact, they are easily adapted to a wide range of business problems. Accordingly, a great deal of attention has been paid to the algorithms for association rule generation, c.f. (Agrawal et al., 1993; Houtsma and Swami, 1993; Agrawal and Srikant, 1994; Savasere et al., 1995; Zaki et al., 1997; Brin et al., 1997b; Han et al., 2000; Hipp et al., 2000b). A general survey and comparison can be found in (Hipp et al., 2000a). There are many extensions of the basic framework, e.g. generalized association rules, c.f. (Srikant and Agrawal, 1995; Hipp et al., 1998), quantitative association rules, c.f. (Srikant and Agrawal, 1996a; Fukuda et al., 1996), or sequential patterns, c.f. (Agrawal and Srikant, 1995; Srikant and Agrawal, 1996b; Mannila et al., 1997). The main problem when generating assocation rules is the algorithmic complexity. In brief, the algorithms have to deal with a search space that grows exponentially with the number of possible items. Nevertheless, today’s algorithms offer acceptable performance by restricting the generated rules through minimal thresholds on the rule quality measures. Most of the algorithms scale linearly with the number of transactions to be analyzed. Moreover, some algorithms, like Apriori (Agrawal and Srikant, 1994), impose no limits on the size of the data in practice.

1.2

Integration with Relational Database Systems

Researchers normally rely on flat files when implementing prototypes of their mining algorithms. This is a good choice when developing and evaluating new algorithms and actually is not a serious disadvantage in the context of research and pilot projects. Nevertheless, this approach is no longer satisfying when transferring data mining technology into daily operation. In practice, data is rarely stored in flat files, but normally resides in a database management system. As a matter of fact, this data is not directly accessible to the mining algorithms. Exporting huge parts of the data from a database turns out to be rather annoying and error prone. It is very time consuming, demands huge 1 Note: Throughout the whole paper we will use the term transaction in the sense of such customer transactions and not in the sense of concurrency control.

amounts of temporary disk space, and includes the often complicated task of transferring data into proprietary formats accepted by the algorithms. Moreover, we must keep in mind that knowledge discovery in databases is understood as a highly interactive and iterative process, c.f. (Fayyad et al., 1996; Brachman and Anand, 1996; Wirth and Hipp, 2000). Whenever we need to return to previous phases of the process or whenever the data in the database changes, we need to export and prepare the data again, c.f. (Hipp and Lindner, 1999) for some of our experiences. As a consequence, instead of bringing the data to the algorithms, we decided to bring the algorithms to the data, respectively to the database system. Our experiences with several leading commercial data mining tools were rather discouraging. Although database access through ODBC or comparable interfaces is quite common, all tools that we considered came up with an integration of the association rule generating facilities with the relational database system that was far from satisfying our needs. The problems we encountered can be summarized as follows:

 Annoying run times that cannot be explained by the underlying mining algorithms.  Inflexibility of the tools concerning different encodings of transactions2 as relational tables.  Inability of the tools to deal with transactions that are distributed over multiple relations. The shortcomings of the commercial tools, both regarding performance and applicability to practical data mining tasks, motivated our own approach that is presented in this paper.

1.3 Related Work and Outline of the Paper There have been several publications on integrating association rule algorithms with relational database systems during the last years. In (Agrawal and Shim, 1996; Sarawagi et al., 1998) the aspects of integration are discussed but in a pure technical way. That is, alternatives to speedup the database access of the algorithms are evaluated. We base our own implementation on the valuable results from these publications, but we do not find answers concerning our questions on the mapping from relations to transactions in (Agrawal and Shim, 1996; Sarawagi et al., 1998) . (Han et al., 1996) and (Meo et al., 1996) do not focus on aspects of actual realization. Instead, both 2 Transactions in the sense of e.g. customer transactions as introduced above.

present extended versions of SQL that directly support data mining. Whereas these approaches conceptually incorporate multiple rows and different table encodings, details that go beyond the formal definition of the language extensions are missing. In contrast, the goal of our own approach is to end up with an actual “piece of software” that efficiently integrates existing association rule algorithm implementations with a commercial database management system. We explicitly place the aspect of applicability to real-world projects into the foreground of our work. Our paper is organized as follows: In Section 2, we identify demands arising from applying association rules. For that purpose we contrast the somewhat academic retail example with an operating database at DaimlerChrysler. We give some details on this database which is also the basis for the evaluation of our implementation. Next, we present some typical mining scenarios as illustrative examples in order to show the shortcomings of today’s approaches. Finally, we conclude with the result of our investigation: We present a list of requirements that from our point of view turned out to be inevitable for successful integration of association rule algorithms with relational database systems. In Section 3, we present the idea behind our own realization that transfers the results of Section 2 into practice. We give reasons for our design decisions and explain the final concept. In brief, we do not aim at universally integrating data mining into the database system but to treat the algorithms as just another application that is running on the database. We leave data intensive tasks like joining tables to the database system where such tasks are in good hands. We finally give the details of our implementation that connects our own C++-implementations of association rule algorithms with DB2 Universal Database from IBM (Chamberlin, 1998). In Section 4 we evaluate the resulting approach based on the database introduced in Section 2. We compare our own implementation with commercial data mining tools, both regarding runtime and scalability. The results concerning performance are rather encouraging. More important, our new approach supports mining tasks that are not directly addressable by commercial mining solutions. Finally, in Section 5, we conclude with a short summary.

2 REQUIREMENTS FROM REALWORLD APPLICATION The prototypical application of association rule mining is the analysis of customer transactions in a re-

tail environment. Whereas this example application is well suited for tutorial purposes, the employment of association rules typically goes far beyond. In this section we analyze such an “advanced” mining scenario based on QUIS – QUality Information System – at DaimlerChrysler. As we will see there are several general implications from this special case on the efficient integration of the rule generation algorithms with relational databases.

2.1 The QUIS-Database in Contrast to Typical Retail Data Figure 1 shows the common relational representatid .. . 1147 1147 1148 1148 1148 .. .

item id .. . 4553 5596 8933 12780 27453 .. .

Figure 1: Relational representation of retail data tion of retail data, c.f. (Agrawal and Shim, 1996; Sarawagi et al., 1998). Each transaction and each item is uniquely identified by an integer. The twocolumn representation is implied by the set characteristic of the transactions. That is, the number of items typically contained in a transaction may vary largely. Moreover, a maximal transaction size may not be determined in advance. In the late eighties QUIS was implemented at former DaimlerBenz. This huge database contains detailed information on each manufactured MercedesBenz vehicle, both technical production details and information on the complete history of garage visits during the warranty period. 3 Today information on more than 10 million vehicles are stored. Figure 2 shows a simplified extract from the entity-relationship model of QUIS. In addition, example tables are shown in Figure 3 and Figure 4. There are three obvious differences compared to the retail situation: First of all, information is distributed over several tables. Instead of a single table that contains the transactions, we have separate tables. For example there is one table describing the vehicles themselves, another is giving information on garage visits and yet another table contains details on the special equipment 3 DaimlerChrysler service partners are obliged to submit information on garage visits related to warranty or goodwill cases.

garage

gv_ga

claim

production plant

garage visit

gv_cl

ve_en_pl

ve_gv

special equipment

engine model

vehicle

ve_speq

of vehicles. Moreover there are tables that do not correspond to an entity but describe many-to-manyrelationships between entities, e.g. between vehicles and special equipment. Such a distribution of information is a natural consequence from the normalization paradigms of database design, e.g. (Codd, 1972). Second, only few of the tables contain sets of arbitrary size, e.g. table ve speq in Figure 4, which directly corresponds to the canonical representation of retail transactions given in Figure 1. In most of the tables, e.g. table vehicle in Figure 3, each entity is described by exactly one row. Third, a database like QUIS typically does not contain integer identifiers. That is, most columns contain nominal values, like “W220” or “AirCondition” that need to be mapped to integers before the algorithms can process them.

2.2 Typical Mining Scenario Figure 2: QUIS: simplified extract of entityrelationship-model

vehicle id .. . A23943CI C33467CI C45940CI C59612CI C59613CI .. .

model type .. . W202 W202 W220 W220 W220 .. .

type .. . R34 R56 R56 R56 R56 .. .

prod date .. . Jan 10 2000 Feb 12 2000 Feb 12 2000 Feb 12 2000 Feb 12 2000 .. .

Figure 3: QUIS: example table vehicle

vehicle id .. . A23943CI A23943CI C33467CI C33467CI C33467CI C45940CI .. .

equip id .. . A48543 A59032 B89494 A59032 B89569 A48543 .. .

Figure 4: QUIS: example table ve speq

... .. . ::: ::: ::: ::: :::

.. .

In the case of mining retail transactions the situation is rather straight forward. The basic task is to generate all association rules that satisfy certain thresholds on the quality measures. The only variants are restrictions on the items to be considered and selections of appropriate transaction subsets. When mining a database like QUIS, the situation becomes more complex. Indeed, our mining queries are no longer restricted to a single table. For example we may be interested in dependencies between the special equipment and the garage visits of vehicles. Moreover we no longer agree upon a common entity as reference. In brief, the mining of dependencies between attributes of vehicles is just one of many possible mining scenarios. For example questions like “Are there any dependencies between the characteristics of garages and the warranty claims they submit?” may be answered by taking the garages as the reference entity. Summing up we want to emphasize, that the retail scenario is just a special case of a much more general scenario.

2.3 Implied Properties Based on the insights from above we conclude with the following list of properties: 1. Often the volume of data is rather huge. Naively exporting interesting subsets to the file system would result in tremendous overhead concerning execution time and disk space. Consequently the algorithms need direct access to the database system.

2. We need to access both, tables representing transactions by a varying number of pairs like in the retail example and tables where each row describes a separate transaction. 3. In general, data mining is just an additional application added to an already introduced database. Consequently, the integration of association rule algorithms should not require any changes of the underlying database. 4. We need to efficiently access data from different tables and convert the collected attribute values into transactions. That is, for each entity, we have to lookup the appropriate attribute values in the database and enumerate them as a set. 5. The underlying algorithms work on integer values rather than on higher data types like strings. 4 The mapping between data types should happen transparently to the user. 6. The reference entity type is no longer predefined. For example when mining QUIS data we might choose vehicles or garages as the underlying reference.

3 IDEA AND IMPLEMENTATION In this section we describe the basic idea behind our integration of association rule algorithms and a relational database system. We give reasons for our decisions and present our final implementation. Our prototype meets the above requirements and connects the association rule algorithm implementations existing at DaimlerChrysler with DB2 Universal Database from IBM (Chamberlin, 1998).

Building such transactions may require connecting information pieces from all over the database. Instead of enhancing our algorithms with such a collection functionality, we decided to pass this task to the database engine. This decision is quite natural because the join operations provided by the database system are already available and incorporate the experiences of more than two decades of database research. We want to achieve a general solution that is not tied to any specific database design or schema. As a matter of fact it is up to the user to supply an appropriate SQL-clause to select and join all information he wants to be incorporated in the analysis. Normally figuring out this SQL-clause is rather straight forward. In Figure 5 we give an example of a more advanced query. The analysis aims at finding dependencies between special equipments, the model type, and the engine type installed in the vehicles during production year 2000. For this purpose data from several tables must be collected, c.f. Figure 2. select vehicle id, model type, equipment name, engine type from vehicle as ve, ve speq as vesp, special equipment as sp, ve en pl as veen, engine model as en where ve.prod year = 2000 and ve.vehicle id = vesp.vehicle id and vesp.equip id = sp.equip id and ve.vehicle id = veen.vehicle id and en.engine id = veen.engine id

Figure 5: Example query

3.1

Technical Prerequisites

For our experiments we replicated the QUIS-database at our research department. The DB2 database system is running on a SUN UltraSPARC-II workstation under Solaris 2.7 equipped with 512 megabytes of main memory and clocked at 440 mhz. The association rule algorithms are implemented in C++ and for compiling we employ the gcc-compiler, c.f. (FSF, 2001). The implementations already proved their efficiency in several projects at DaimlerChrysler.

3.2

Basic Idea

As explained above, current association rule algorithms request the data to be prepared as transactions. 4 Of course the mining of association rules is restricted to symbolic data types.

In general, the resulting database table will contain a great amount of redundant information. This is because joining tables undoes the effects of database normalization, e.g. (Codd, 1972). Fortunately, we only read from the resulting table and therefore do not have to care about update or insert anomalies. The table resulting from the example SQL-clause might look as the one shown in Figure 6. It is important to note that none of the mining tools we looked at was able to straightly deal with such a table. That is, the advanced query from above cannot be directly addressed by the commercial tools. We sorted this table according to the key attribute vehicle id that uniquely identifies every vehicle stored in the database. As a result each vehicle is entirely described by consecutive rows. We have both, attributes that have a single value for each entity, e.g. engine type and variable length sets of at-

vehicle id .. . A23943CI A23943CI C33467CI C33467CI C33467CI C45940CI .. .

model type .. . W202 W202 W202 W202 W202 W220 .. .

equipment name .. . AirCond. 2nd Airbag Clutch 2nd Airbag Hardtop AirCond. .. .

engine type .. . D D P P P P .. .

... ... .. . ::: ::: ::: ::: ::: :::

.. .

Figure 6: Example for a table that results from a user’s SQL-clause tributes, e.g. equipment name. Transactions are derived as follows from this table: All rows are passed consecutively. As long as the value of the key attribute does not change we add the values of all other attributes of the row to the current transaction. Of course, duplicates have to be eliminated. Whenever the value of the key attribute changes we know that the current transaction is completed and start with the next transaction. With the above approach choosing the key attribute decides which entity from the database is the reference entity. For example in the example from Figure 6 choosing model type as key attribute would make the model type the reference entity.

3.3

Actual Implementation

The first decision is whether the algorithms should be moved into the database system or should be kept separately. From the several variants between these extremes the so called cache-mine approach presented in (Sarawagi et al., 1998) is one of the most promising approaches concerning performance. In brief, association rule algorithms like Apriori make several passes over the whole database during rule generation. With cache-mine the database is accessed only once and all following passes refer to a look-aside buffer on the local file system. Alternatively, in (Sarawagi et al., 1998) parts of the algorithms are expressed in SQL and run directly on the database engine. This latter approach would have required profound changes to our existing algorithm implementations and furthermore according to (Sarawagi et al., 1998) does not imply serious performance gains. For our own implementation we adapted the cachemine approach in the following way: Instead of relying on an embedded SQL-cursor we employed the CLI-interface, c.f. (Chamberlin, 1998), provided by DB2 Universal Database. In contrast to embedded

SQL that needs to install a so called package in the database before accessing its tables, a CLI based implementation only requires read access to the tables containing the mining data. The SQL-clause to join and select the appropriate data is passed to the algorithm by the user and executed via CLI on the database engine. The resulting table is transformed into a compact binary representation. As described above we pass through the rows of the table and whenever the key value changes we start generating a new transaction. This presumes that each entity is described by consecutive rows, ensured by an additional “order by” on the key attribute. After that, each attribute value is transformed into a character string with the column name as its prefix. Adding the column name is necessary because attribute values may occur in several columns, e.g. “black”. This string is than mapped to an integer. For this purpose we employed the hashing functionality provided by the standard template library coming with the gcccompiler (FSF, 2001). Whenever a transaction is complete and duplicates are eliminated the integer array [item1; : : : ;itemsize ℄ is sorted. Then, together with its size, we append it to the cache file on the local disk. In the subsequent passes the mining algorithm reads chunks of several megabytes of the file directly into main memory through a “raw read”. We implemented our algorithms in such a way that no further preparation is necessary. That is, the algorithms process the condensed transactions “as they are” through pointers to the appropriate memory segments. In addition to this basic approach we implemented an optimized version that executes the extraction step as a stored procedure, c.f. (Chamberlin, 1998). Stored procedures are executed as sub processes of the database management system. The expected benefits are fewer context switches and reduced data transfer. The drawback is the need to “install” the stored procedure in the database. This requires additional database access privileges. Furthermore, running unfenced stored procedures, that is code running in the address space of the database management system, may become a security issue. Whereas this is not a severe problem on our replicated database, running unfenced code is infeasible on the operative QUIS database.

4 EVALUATION For our evaluation we employed the system described above, based on a Sun workstation and the DB2 Database. We decided to analyze dependencies between special equipments installed in vehicles. That is, we mine association rules from the table ve speq, Figure 4, of the QUIS database. The key attribute is

vehicle id and the items are from column equip id. There is a variety of more than 10000 equipments and in the average about 20 are installed per vehicle. The total size of the table is approximately 200 million rows. In addition we restricted the maximal rule size to three and set the minimum support threshold to 2%. Out of the mining tools at our hands only two were running on the Solaris machine. From our own implementations we chose Apriori. As mentioned earlier, this algorithm practically does not impose limits on the size of the data. One of the commercial solutions already collapsed at several hundred thousand rows. This tool internally encodes the transactions as bit vectors. With an average transaction size of 20 out of 10000 equipment codes the resulting bitvectors are rather sparse. The tool does not seem to take this sparseness into account and consequently runs into tremendous memory and runtime problems. This effect is even strengthened by keeping the complete dataset in main memory during analysis. The other commercial tool and our own implementation performed reasonably on the data. In Figure 7 the results of our evaluation are shown. We varied the number of considered transactions, given in millions, and show the corresponding run times in minutes. As expected, c.f. (Hipp et al., 350 Commercial Tool New Approach 300

Runtime in minutes

250

200

150

100

50

pete with commercial solutions. The main benefit of our new approach is its ability to support mining tasks that are not directly addressable by the considered mining tools. Furthermore, we were not able to analyze the internals of the commercial mining software. But from our own code we learned that the rule generation process is highly dominated by the effort for extracting and preparing the transactions. Only about 10% of the run time is spend on the rule generation itself. In (Sarawagi et al., 1998) similar experiences are described.

5 SUMMARY In this paper we described a new approach to efficiently integrate association rule mining algorithms with relational database systems. By presenting QUIS, an operating data warehouse at DaimlerChrysler, we explained the shortcomings of today’s solutions. We give examples of “advanced” mining scenarios that originate from transferring association mining from the somewhat academic retail example to a practical mining scenario at DaimlerChrysler. We finally conclude with a list of requirements for a satisfying integration of the mining algorithms with a database system. Based on these requirements we developed our new approach. We thoroughly explained the basic ideas behind it and gave the details of the implementation. By carrying out experiments on the QUIS-database we showed the superiority of our own approach compared with examples of commercial tools regarding performance. Finally we want to emphasize that we do not see the main benefit of our work in performance improvements. More important, our new approach supports mining tasks that are typical for realworld applications but actually not directly addressable by commercial mining solutions.

0 1

25

50

75

100

150

200

Number of rows in Millions

Figure 7: Evaluation on table special equipment 2000a), both implementations scale linearly with the number of rows to be analyzed. Our own approach clearly outperforms the commercial mining tool. Surprisingly there was no significant difference between the run times of the completely CLI-based implementation with the one employing a stored procedure to extract the data. The explanation is that the efforts to select the data from the database and to prepare the binary transactions dominate the extraction process. In addition the following is worth to be noted: Our performance comparison is be no means exhaustive. It only shows that our implementation is able to com-

REFERENCES Agrawal, R., Imielinski, T., and Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD ’93), pages 207 – 216, Washington, USA. Agrawal, R. and Shim, K. (1996). Developing tightlycoupled data mining applications on a relational database system. In Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining (KDD ’96), Portland, Oregon, USA. Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Databases (VLDB ’94), Santiago, Chile.

Agrawal, R. and Srikant, R. (1995). Mining sequential patterns. In Proceedings of the International Conference on Data Engineering (ICDE), Taipei, Taiwan. Brachman, R. J. and Anand, T. (1996). The process of knowledge discovery in databases. In Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., editors, Advances in Knowledge Discovery and Data Mining, chapter 2, pages 37–57. AAAI/MIT Press. Brin, S., Motwani, R., and Silverstein, C. (1997a). Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD ’97), pages 265 – 276. Brin, S., Motwani, R., Ullman, J. D., and Tsur, S. (1997b). Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (ACM SIGMOD ’97), pages 265 – 276. Chamberlin, D. D. (1998). A complete guide to DB2 universal database. Morgan Kaufmann Publishers, San Francisco. Codd, E. F. (1972). Further normalization of the data base relational model. Data Base Systems, Courant Computer Science Symposia Series, 6. Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27–34. FSF

(2001). The http://gcc.gnu.org.

gcc

home

page.

Fukuda, T., Morimoto, Y., Morishita, S., and Tokuyama, T. (1996). Mining optimized association rules for numeric attributes. In Proceedings of the 15th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS ’96), pages 182 – 191, Montreal, Quebec, Canada. Han, J., Fu, Y., Wang, W., Koperski, K., and Zaiane, O. (1996). Dmql: A data mining query language for relational databases. In 1996 SIGMOD’96 Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’96), Montreal, Canada. Han, J., Pei, J., and Yin, Y. (2000). Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM-SIGMOD Intl. Conf. on Management of Data, Dallas, Texas, USA. Hipp, J., G¨untzer, U., and Nakhaeizadeh, G. (2000a). Algorithms for association rule mining – a general survey and comparison. SIGKDD Explorations, 2(1):58–64. Hipp, J., G¨untzer, U., and Nakhaeizadeh, G. (2000b). Mining association rules: Deriving a superior algorithm by analysing today’s approaches. In Proceedings of the 4th European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD ’00), pages 159–168, Lyon, France.

Hipp, J. and Lindner, G. (1999). Analysing warranty claims of automobiles. an application description following the CRISP-DM data mining process. In Proceedings of 5th International Computer Science Conference (ICSC ’99), pages 31–40, Hong Kong, China. Hipp, J., Myka, A., Wirth, R., and G¨untzer, U. (1998). A new algorithm for faster mining of generalized association rules. In Proceedings of the 2nd European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD ’98), pages 74–82, Nantes, France. Houtsma, M. and Swami, A. (1993). Set-oriented mining for association rules in relational databases. Technical Report RJ 9567, IBM Almaden Research Center, San Jose, California. Mannila, H., Toivonen, H., and Verkamo, I. (1997). Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1(3):259 – 289. Meo, R., Psaila, G., and Ceri, S. (1996). A new sql-like operator for mining association rules. In Proceedings of the 22nd International Conference on Very Large Databases (VLDB ’96), Mumbai (Bombay), India. Sarawagi, S., Thomas, S., and Agrawal, R. (1998). Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD Record (ACM Special Interest Group on Management of Data), 27(2):343–355. Savasere, A., Omiecinski, E., and Navathe, S. (1995). An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st Conference on Very Large Databases (VLDB ’95), pages 432–444, Z¨urich, Switzerland. Srikant, R. and Agrawal, R. (1995). Mining generalized association rules. In Proceedings of the 21st Conference on Very Large Databases (VLDB ’95), Z¨urich, Switzerland. Srikant, R. and Agrawal, R. (1996a). Mining quantitative association rules in large relational tables. In Proceedings of the 1996 ACM SIGMOD Conference on Management of Data, Montreal, Canada. Srikant, R. and Agrawal, R. (1996b). Mining sequential patterns: Generalizations and performance improvements. In Procedings of the 5th International Conference on Extending Database Technology (EDBT ’96), Avignon, France. Wirth, R. and Hipp, J. (2000). CRISP-DM: Towards a standard process modell for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, pages 29–39, Manchester, UK. Zaki, M. J., Parthasarathy, S., Ogihara, M., and Li, W. (1997). New algorithms for fast discovery of association rules. In Proceedings of the 3rd International Conference on KDD and Data Mining (KDD ’97), Newport Beach, California.

Suggest Documents