Efficient Rule Retrieval and Postponed Restrict ... - Semantic Scholar

2 downloads 0 Views 221KB Size Report
Association rule mining [1] is one of the fundamental methods for knowledge ..... trieving rules from a cache generated from ten million transactions is not much.
Efficient Rule Retrieval and Postponed Restrict Operations for Association Rule Mining Jochen Hipp1,3 , Christoph Mangold2,3 , Ulrich G¨ untzer3 , and Gholamreza 1 Nakhaeizadeh 1

3

DaimlerChrysler AG, Research & Technology, Ulm, Germany {jochen.hipp,rheza.nakhaeizadeh}@daimlerchrysler.com 2 IPVR, University of Stuttgart, Germany [email protected] Wilhelm Schickard-Institute, University of T¨ ubingen, Germany [email protected]

Abstract. Knowledge discovery in databases is a complex, iterative, and highly interactive process. When mining for association rules, typically interactivity is largely smothered by the execution times of the rule generation algorithms. Our approach is to accept a single, possibly expensive run, but all subsequent mining queries are supposed to be answered interactively by accessing a sophisticated rule cache. However there are two critical aspects. First, access to the cache must be efficient and comfortable. Therefore we enrich the basic association mining framework by descriptions of items through application dependent attributes. Furthermore we extend current mining query languages to deal with these attributes through ∃ and ∀ quantifiers. Second, the cache must be prepared to answer a broad variety of queries without rerunning the mining algorithm. A main contribution of this paper is that we show how to postpone restrict operations on the transactions from rule generation to rule retrieval from the cache. That is, without actually rerunning the algorithm, we efficiently construct those rules from the cache that would have been generated if the mining algorithm were run on only a subset of the transactions. In addition we describe how we implemented our ideas on a conventional relational database system. We evaluate our prototype concerning response times in a pilot application at DaimlerChrysler. It turns out to satisfy easily the demands of interactive data mining.

1 1.1

Introduction Mining for Association Rules

Association rule mining [1] is one of the fundamental methods for knowledge discovery in databases (KDD). Let the database D be a multiset of transactions where each transaction T ∈ D is a set of items. An association rule A → B expresses that whenever we find a transaction which contains all items a ∈ A, then this transaction is likely to also contain all items b ∈ B. We call A the body and B the head of the rule. The strength and reliability of such rules are expressed by

various rule quality measures [1, 3]. The fraction of transactions T ∈ D containing an itemset A is called the support of A, suppD (A) = |{T ∈ D | A ⊆ T }|/|D|. The rule quality measure support is then defined as suppD (A → B) = suppD (A ∪ B). In addition the rule confidence is defined as the fraction of transactions containing A that also contain B: confD (A → B) = suppD (A ∪ B)/suppD (A). These measures are typically supplemented by further rule quality measures. One of these measures that we found very helpful is called lift or interest [3]: liftD (A → B) = confD (A → B)/suppD (B). It expresses in how far the confidence is higher or lower respectively than the a priori probability of the rule head. For example, DaimlerChrysler might consider vehicles as transactions and the attributes of these vehicles as items. Then one might get rules like Mercedes A-Class, AirCondition → BatteryTypeC. The support of this rule might be about 10% and the confidence about 90%. A lift of about 4 might indicate that the body of the rule actually raises the likelihood of BatteryTypeC clearly over its a priori probability. 1.2

Motivation

Obviously the idea behind association rules is easy to grasp. Even non-experts in the field of data analysis directly understand and can employ such rules for decision support. Unfortunately in practice the generation of valuable rules turns out to imply much more than simply applying a sophisticated mining algorithm to a dataset. In brief, KDD is by no means a push button technology but has to be seen a process that covers several tasks around the actual mining. Although there are different process descriptions, e.g. [2, 4, 20], KDD is always understood as complex, incremental, and highly iterative. The analyst never walks strictly through the pre-processing tasks, mines the data, and then analyzes and deploys the results. Rather, the whole process has a cyclic character: we often need to iterate and to repeat tasks in order to improve the overall result. It is the human in the loop who, on the basis of current results, decides when to return to previous tasks or proceed with the next steps. In the end, it is the analyst’s creativity and experience which determine the success of a KDD project. When mining for association rules on large datasets, the response times of the algorithms easily range from minutes to hours, even with the fastest hardware and highly optimized algorithms available today [7]. This is problematic because investigating even speculative ideas often requires a rerun of the mining algorithm and possibly of data pre-processing tasks. Yet if every simple and speculative idea implies to be idle for a few minutes, then analysts will – at least in the long run – brake themselves in advance instead of trying out diligently whatever pops into their minds. So, creativity and inspiration are smothered by the annoying inefficiencies of the underlying technology. 1.3

Contribution and Outline of this Paper

Imielinski et al. describe the idea of working on associations stored in a previous algorithm run [10, 11]. In Section 2 we put this approach further by explicitly in-

troducing a sophisticated rule cache. Other related work covers the idea of ‘rule browsing’, e.g. [12, 13]. Our idea is to accept a single and possibly expensive algorithm run to fill the cache. After this initial run all mining queries arising during following iterations through the phases of the KDD process are satisfied directly from the cache without touching the original data. The result is interactivity due to short response times that are actually independent from the size of the underlying dataset. However there are two problematic aspects. First, the analyst must be supported adequately when accessing the rule cache. For this purpose we suggest to employ mining query languages as described in [5, 11, 14]. We enrich these languages by fundamental extensions. We add language support for dealing with attributes that describe single items and aggregate functions on rules. Furthermore we add explicit ∃ and ∀ quantifiers on itemsets. Our extensions perfectly supplement today’s approaches. Second, a KDD process starts typically with the general case and comes down to the specifics in later phases. The rule cache can deal easily with restrictions concerning the items. Focusing the analysis on e.g. dependencies between special equipments of vehicles is achieved simply by filtering the cached rules. Unfortunately things get more difficult when restricting the underlying transactions. For example the analyst might decide to focus on a special vehicle model, e.g. asks which rules would have been generated when only E-Class vehicles were taken into account. Today such a restriction is part of the preprocessing and therefore requires a complete regeneration of the associations in the cache. As one of the main contributions of our paper in Section 3 we show how to even answer such queries from the cache without rerunning the rule generation algorithm. In Section 4 we introduce the SMART SKIP system that efficiently implements the described ideas. The cache structure must be able to store a huge number of rules, occasionally upto several hundreds of thousands. In addition it must be prepared to answer a broad variety of mining queries efficiently. In fact, we show how to realize such a rule cache based on top of a conventional relational database system. The database system stores the rules together with additional information in several relational tables. Moreover we implement an interpreter that translates our enhanced mining language to standard SQL that is directly executed on the database engine. We proof the efficiency of the resulting system by presenting experiences from a pilot application at DaimlerChrysler. Finally, we conclude with a short summary of our results in Section 5.

2

Interactivity Through Caching and Efficient Retrieval

Current approaches focus mainly on speeding up the algorithms. Although there have been significant advances in this area, the response times achieved do not allow true interactivity [7]. One approach is to introduce constraints on the items and to exploit this restriction of the rule set during rule generation, e.g. [15, 19]. Yet the resulting response times still are far from the needs of the analyst in an interactive environment. In this section we tackle this problem by rule caching and sophisticated rule retrieval.

2.1

Basic Idea

Instead of speedup through restriction, our own approach does exactly the opposite: we accept that running the mining algorithm implies an interruption of the analyst’s work. But if there must be a break, then the result of this interruption should at least be as beneficial as possible. In other words, if running the algorithm is inevitable, then the result should answer as many questions as possible. Hence, the goal of our approach is to broaden the result set by adding any item that might make sense and by lowering the thresholds on the quality measures. Typically response times suffer; but, as this is expected, this should not be a severe problem. In extreme cases, running the mining task overnight is a straightforward and acceptable solution. In general, the number of rules generated is overwhelming and, of course, the result set is full of noise, trivial rules, or otherwise uninteresting associations. Simply to present all rules would hardly make sense, because this would surely overtax the analyst. Our approach is to store all the generated rules in an appropriate cache and to give the analyst highly sophisticated access to this cache. The goal is to satisfy as many of the mining queries as possible directly from the cache, so that a further mining pass is only needed as an exception. Once the cache is filled, answering mining queries means retrieving the appropriate rules from the cache instead of mining rules from the data. Now interactivity is no longer a problem, because mining queries can be answered quickly without notable response times. We want to point out that we distinguish strictly between running mining algorithms and retrieving rules from the cache. In other words, we want analysts to be aware of what exactly they are doing: they should either start a mining run explicitly or query the cache. Otherwise, accidently causing a rerun of the rule generation by submitting a query that cannot be satisfied from the cache could interrupt the analysis for hours. Explicitly starting a rerun makes analysts think twice about their query and, moreover, gives them control. 2.2

Enhanced Rule Retrieval

The access to the rule cache must be as flexible as possible in order to be useful for a wide range of mining scenarios. Strict separation between rule mining and querying the cache allows us to also separate the retrieval language from the means to specify the rule generation task. For the latter we want to refer the reader to [5, 6, 14] where accessing and collecting the underlying transactions is treated exhaustively. The mining language we need is focused on rule retrieval from the cache and is never concerned directly with the mining data itself. We found languages that cover this aspect [5, 11, 14]. However for the purpose of demonstrating our new ideas we decided to restrict ourselves to a simple ‘core’ language. We forego a formal language definition but sketch the concept behind as far as necessary for understanding our ideas. Of course our enhancements are supposed to be integrated into a universal environment, e.g. [5, 11, 14]. We point out that we do not compete with these systems but see our ideas as supplements arising from experiences during practical mining projects, e.g. [8, 9].

A query in our simplified retrieval language always consists of the keyword SelectRulesFrom followed by the name of a rule cache and a sophisticated whereclause that filters the retrieved rules. The basic query restricts the rules by thresholds on the quality measures. For example, we may want to retrieve all rules from cache rulecache that have confidence above 75% and lift of at least 10: (Query 1)

SelectRulesFrom rulecache Where conf > 0.75 and lift >= 10;

Often we want to restrict rules based on the items they contain respectively not contain. For example we might be interested in rules that ‘explain’ the existence of a driver airbag, that is rules containing item Airbag in the head: SelectRulesFrom rulecache (Query 2) Where ‘Airbag’ in head and conf > 0.75 and lift >= 10; At the same time we know that a co driver airbag CoAirbag always implies a driver airbag. So by adding “not ‘CoAirbag’ in body” to the where-clause we might exclude all trivial rules containing this item in the body. We think our ‘core’ language is rather intuitive to use but actually upto this point its capabilities do not go beyond today’s mining languages. The first feature we missed are aggregate functions on rule quality measures. In fact, it is often appropriate not to specify thresholds for the quality measures as absolute values but relative to the values reached by the generated rules. For example the following query retrieves rules having relatively low support – less than 1% higher than the lowest support in rulecache – and at the same time having a relatively high confidence – more than 99% of the highest confidence found in the cache. The minimum support respectively the maximum confidence value of all rules in rulecache are denoted by min(supp) and max(conf) (and vice versa). Other aggregate functions like average are also useful extensions: SelectRulesFrom rulecache (Query 3) Where supp < 1.01*min(supp) and conf > 0.99*max(conf); Association mining algorithms treat items as literals and finally map them to integers. Whereas this restriction makes sense during rule generation we found it is not satisfying when retrieving rules from a cache. In brief, although items are literals from the rule generation point of view, in practical applications items normally have structure and we came to the conclusion that rule retrieval can greatly benefit from exploiting this structure. For example the items in a supermarket all have prices and costs associated with them. Similarly production dates, costs, manufacturers etc. are assigned to parts of vehicles. Such attributes can be considered through discretization and taxonomies. But actually the resulting quantitative [18] or generalized [17] rules are typically not what we want. Formally we extend the basic framework of association mining as follows: let I ⊆ × 1 × · · · × m be a set of items. Each item is uniquely identified by an ID id ∈ and described by attributes a1 , . . . am ∈ 1 × . . . × m . For example, one attribute may be the name of the item, another the price of the item, costs

N A N

A

A

A

associated with it, or other application dependent information. A set of rules is then defined as R ⊆ P(I) × P(I) × × · · · × . As usually, in addition to the body and the head (subsets of the power set of I), each rule is rated by a fixed number of real valued quality measures. Adding structure to the items in such a way does not affect the mining procedure but nevertheless introduces a new means to formulate practically important mining queries, as we will see. Let x.attname denote the value of the attribute attname of item x. For example we may want to select all rules satisfying certain thresholds for support and confidence and ‘explaining’ an item of type special equipment SpEquip which incurs costs above 1000. Such queries now can be expressed through ∃ and ∀ quantifiers on itemsets:

R

R

SelectRulesFrom rulecache (Query 4) Where supp > 0.25 and conf > 0.975 and exists x in head (x.type = ’SpEquip’ and x.costs > 1000); A more complex and also very useful query is to find all rules with at least one special equipment in the head that originates from a manufacturer who also manufactures at least one special equipment from the body: SelectRulesFrom rulecache (Query 5) Where exists x in head (x.type = ’SpEquip’ and exists y in body (y.type = ’SpEquip’ and x.manu = y.manu)) and supp > 0.25 and conf > 0.975; The quantifiers and attributes on the items and the aggregate functions on the rules are both intuitive to use and flexible. In the hands of the analyst a mining language enhanced to express queries considering the structure of the items is a powerful means to efficiently break down the result space. The examples above give a first impression of the potentials.

3

Postponing Restrict Operations on the Mining Data

The basic problem when caching association rules is the validity of the cache when the underlying data changes. In this section, we explain how to circumvent expensive regeneration in the practically very important case of data restrictions. 3.1

Restricting the Mining Data

Restriction in the sense of relational algebra means selecting a subset of the transactions for rule generation. This pre-processing task is quite common and often essential: for example, an analyst might restrict the transactions to a special vehicle model, because he is only interested in dependencies in this subset of the data. Or he might want to analyze each production year separately in order to compare the dependencies over the years. An example from the retail domain are separate mining runs for each day of the week. For example, after generating

the rules for all baskets, the analyst might decide to have a closer look at rules that occur if only baskets from Saturdays are taken into account. The problem with the restrict operation is that the quality measures of the rules change as soon as transactions are removed from the data. For example a rule may hold with low confidence on the set of all vehicles. When restricting the mining data to a subset, e.g. to the vehicles of the A-Class model, we might find the same rule but with much higher confidence. For instance, a stronger battery type in an A-Class vehicle might imply air conditioning with high confidence. In contrast, for the more luxurious Mercedes E-Class, there are many more reasons to implement a stronger battery type. Accordingly the stronger battery type need not imply necessarily air conditioning with high confidence. When postponing restrict operations from pre-processing to post-processing, the adapted values for the quality measures must be derived from the rules in the cache. In the following we show how to do this for fundamental quality measures. 3.2

Postponing Restrict Operations to Retrieval

We presume that attributes being employed for restriction of the mining data are contained as items in the transactions already during the initial rule generation. Such items describe the transactions, e.g. production date or vehicle model. They can be seen as pseudo items and must be distinguished from attributes that are attached to items, e.g. costs or manufacturer of a special equipment. Let D0 be a subset of D that is restricted to transactions containing a certain itemset R. The support of the itemset A in D0 can be derived from the support values in D as follows: suppD0 (A) = suppD (A ∪ R) ·

|D| |D0 |

Therefore rule quality measures with respect to D0 can be derived from the rule cache generated with respect to D by the following equations: suppD0 (A → B) = suppD0 (A ∪ B) = suppD (A ∪ B ∪ R) · confD0 (A → B) = liftD0 (A → B) =

|D| |D0 |

suppD0 (A ∪ B) suppD (A ∪ B ∪ R) = suppD0 (A) suppD (A ∪ R)

suppD0 (A ∪ B) suppD (A ∪ B ∪ R) |D0 | = · suppD0 (A) · suppD0 (B) suppD (A ∪ R) · suppD (B ∪ R) |D|

To give an illustrative example: let us restrict the mining data to the vehicles of the model E-Class by setting R = {E-Class}. Then the support of the rule AirCond → BatteryTypeC that would have been generated when mining only on the E-Class-vehicles can be determined from the rule cache through: suppD0 (AirCond → BatteryTypeC) = suppD ({AirCond, BatteryTypeC, E-Class})·

|D| |D0 |

Further quality measures can be derived similarly. But of course the cache contains only information to derive rules r with suppD0 (r) ≥ minsuppD0 with minsuppD0 =

|D| · minsuppD . |D0 |

This means, to derive rules at a reasonable threshold minsuppD0 , minsuppD must be quite low or R must occur comparably often in the data. Low minsuppD can easily be achieved in our scenario although drastically lowering this threshold may result in tremendous cache sizes, a potential performance problem concerning rule retrieval, c.f. Section 4.2. Fortunately the other condition turns out to be a minor problem in practical applications. It simply implies that the subset D0 to which we restrict D must be a reasonable ‘portion’ of D. For typical subsets we encountered, e.g. restrictions to a special model, production year, or a day of the week, this was always the case. For instance presuming approximately the same sales everyday, lowering minsuppD by factor 1/7 is a practical guess for restrictions on ‘day of the week’.

4

The SMART SKIP System

SMART SKIP is a prototypical implementation of the ideas described in this paper. As a starting point, we took a collection of mining algorithms called ART – Association Rule Toolbox –, which proved their efficiency in several prior research projects, e.g. [7, 8]. SMART SKIP implements a comfortable platform for interactive association rule mining. Queries can be submitted on a command shell or through a web browser based GUI. In addition to the features described in Section 2, queries can also contain statements to start mining runs explicitly or to restrict the mining data as suggested in Section 3. 4.1

Implementation

Instead of implementing specialized data structures to hold the cache, we employ a relational database system for this purpose. We did that for two reasons. First, the mining results are typically generated at high cost. Storing them together with the data in the database is natural. Second, the implementation of the mining system benefits directly from the database system. Actually, we translate queries from our mining language to standard SQL and execute these SQL queries directly on the query engine of the database. Although the generation of association rules from frequent itemsets (sets of items that occur at least with frequency minsupp in the data) is straightforward [1], we store rules instead of frequent itemsets for a good reason: Typically in our scenario there is no longer a significant difference between the number of frequent itemsets to be cached or the corresponding rules. We experienced that minsupp is rarely employed in the sense of a quality measure but as the only efficient means to reduce the response times of the mining algorithms. The cache releases us from tight runtime restrictions, so we expect the analyst to set

minsupp to relatively low values and filter the numerous resulting rules by strict thresholds on the rest of the measures. We learned that as a consequence the number of frequent itemsets and the number of rules get more and more similar. In addition, for such low support thresholds, generating rules from the frequent itemsets can be quite costly. The reason is simply the great number of frequent itemsets that imply an even greater number of rules that need to be checked for satisfying the thresholds on the quality measures during rule generation. Each rule to be stored consists of body, head and a vector of quality measures. An itemset may occur in several rules as body or head. In order to avoid redundancy, we keep the itemsets in a table separate from the rules and store each itemset only once. This saves a considerable amount of memory, e.g. in practical applications we experienced that memory usage decreased between 50% and upto more than 90%. But of course for each new itemset such a storage approach implies checking whether this itemset already exists or not before adding it to the database. This turned out to be rather inefficient when implemented on the database system. The reason is that itemsets are of arbitrary size and therefore are stored as (ID, Item-ID)-pairs. Fortunately our caches are always filled in a single pass with the complete result of one mining algorithm run. That means duplicates can be eliminated efficiently already outside the database. The tables implementing the cache are given in Figure 1. For each rule in

Rules

ID

ItemsetID Body

Itemsets

ID

ItemID

Items

ID

Name

ItemsetID Head

Att 1

Measure 1

...

...

Measure m

Att n

Fig. 1. Database tables that implement the rule cache.

the rules table, a rule-ID and two itemset-IDs are stored. The latter identify body and head of the rule in the itemsets table. The values for the rule quality measures are added as floats. The pairs stored in the itemsets table consist of an ID to identify the itemset and an item-ID refering to the IDs in the items table. The latter table is application dependent and stores the attribute values further describing the items. (We regard the name of an item as an attribute.) Now that the rules are stored in the database system we need to realize the access to the cache. For that purpose we translate the queries from our mining language into SQL queries. As mentioned before, at this point the usage of the database system pays off: implementing a translation unit is rather straight forward and executing the resulting SQL queries is handled entirely by the database query engine. For the realization we employ C++ and a parser generator. As an example the translation of Query 4 from Section 2 is given in Figure 2. The

CREATE VIEW extended itemsets (itemsetid, itemid, type, costs) AS ( SELECT Itemsets.id, itemid, type, costs FROM Itemsets INNER JOIN Items ON Itemsets.itemid = Items.id ) SELECT * FROM rules WHERE supp > 0.25 AND conf > 0.975 AND itemsetidhead in ( SELECT itemsetid FROM extended itemsets WHERE type = ’spec equip’ and costs > 1000 ) Fig. 2. Translation of Query 4 from Section 2 to SQL.

translated queries may look complicated to human but nevertheless are processed efficiently by the database system. The number and the types of the attributes linked to the items and stored in the separate items table are of course application dependent. Therefore they are not hard wired into our software. In fact, during the compilation of the translation unit and during the translation of mining queries the types need not to be known. Types are not required until executing the translated queries on the database engine. At this point of course a table describing the attributes with their types must exist. 4.2

Evaluation

For our evaluation, we used the QUIS database – Quality Information System – at DaimlerChrysler. This database is a huge source of interesting mining scenarios, e.g. [8, 9]. We consider mining dependencies between the special equipments installed in cars, together with additional attributes like model, production date, etc. We selected a set of 100 million database rows. The generation of association rules took upto more than four hours on a SUN ULTRASPARC-2 clocked at 400 Mhz. As in [6, 16], we experienced that the time for rule generation was dominated by the overhead from database access. Obviously a rerun of the algorithm implies a considerable interruption of the analysis. So without rule caching, interactive mining of the data is impossible. We filled four separate caches at different levels of minsupp, containing about 10, 000, 100, 000, 200, 000, and 300, 000 rules. We found that rules with too many items hardly make sense in our domain and therefore restricted the maximal number of items per rule to four and to a single item in the head. In addition before submitting the queries from Section 2 we modified the thresholds for the quality measures to restrict the returned rules to less than a thousand rules for each of the caches. We think the retrieval of larger rule sets does not make sense. We employed an item with support of about 33% in the where-clause of Query 2. In Query 5 there are five different manufacturers uniformly distributed over the special equipments. In Figure 3, the response times for the Queries 1-5 from Section 2 on the different rule caches are shown. We employed IBM’s DB2 V7.1 running under Linux on a 500Mhz

time



 



!

!

!

!

Cache 10,000 Cache 100,000

13−156sec 

"

" #

#

Cache 200,000 



Cache 300,000 









 





 





 





 



1−15sec



3−12sec 

 









 







 

2−3sec 







1−15sec



































 









 



























Query 1









Query 2









Query 3

 







 



Query 4

Query 5

Fig. 3. Response times for Queries 1-5 from Section 2.1 in seconds.

Pentium III for our experiments. The response times clearly satisfy the demands of interactive knowledge discovery. It is important to note that, when the cache is filled, response times no longer depend on the size of the underlying data. Whereas algorithm runs scale at least linearly with the number of transactions, retrieval from the cache depends only on the number of cached rules. Our experiences show that typically a growing number of transaction does not imply a growing number of rules (constant thresholds presumed). Obviously, in the data we analyzed, frequent patterns are more or less uniformly distributed over all transactions. As a consequence retrieving rules from a cache generated from ten million transactions is not much worse than retrieving rules from a cache generated from one million transactions. The reason is simply that the sizes of the caches do not vary significantly. For our experiments with postponed restricts we chose three different attributes for restriction that are contained as pseudo items in approximately 25%, 30% and 60% of the vehicles. Actually the achieved response times were nearly the same for all three attributes. Not surprisingly, execution times grow linearly with the number of cached rules. Restricting the four caches took about 20sec, 163sec, 319sec and 497sec respectively. Although still much faster than a rerun of the algorithm, for very large caches the response times obviously suffer. We therefore explicitly treat the restrict operation separately from the query language. The idea is to always transform a complete rule set and to store the result of this operation also in the database. Then this result is not lost for further mining iterations but accessible through the query language like any other cached rule set.

5

Conclusion

In this paper we set out how to support user interactivity in association rule mining. Our basic idea is exactly the opposite of the common approach taken

today: instead of improving response times by restrictions on the result sets, we accept one broad and possibly expensive algorithm run. This initial run fills a sophisticated rule cache. Answering refined search queries by retrieving rules from the cache takes only seconds typically whereas rerunning an algorithm implies several minutes upto hours of idle time for the analyst. In addition response times become independent from the number of underlying transactions. However there are two critical aspects of rule caching. For both problems we presented a promising approach in this paper. First, the analyst needs a powerful means to be able to navigate in the rather large rule cache. For this purpose, we enhanced the concept of association mining. In brief, we introduced attributes to describe items, quantifiers on itemsets, and aggregate functions on rule sets. The queries becoming possible are powerful and practically relevant. Second, without rerunning the mining algorithm, the cache must satisfy a broad variety of queries. A KDD process starts typically with the general case and comes down to the specifics in later phases. A common and often employed task is to restrict the mining data to subsets for further investigation, e.g. focus on a special vehicle model or on a special day of the week. Normally this implies a regeneration of all rules in the cache. We solved this problem and showed how to answer even those queries from the cache that specify rules that would have been generated if the mining data were restricted to a subset. For this purpose we do not need to rerun the algorithm or touch the mining data at all. Finally, we presented the SMART SKIP system, which implements the ideas introduced in this paper. An important aspect of SMART SKIP is that it greatly benefits from its implementation based on a conventional relational database system. We evaluated it on a real database deployed at DaimlerChrysler. We found that our system is scalable and supports interactive data mining efficiently, even on very large databases.

References 1. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Int’l Conf. on Management of Data (ACM SIGMOD ’93), pages 207–216, Washington, USA, May 1993. 2. R. J. Brachman and T. Anand. The process of knowledge discovery in databases: A human centered approach. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, chapter 2, pages 37–57. AAAI/MIT Press, 1996. 3. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Int’l Conf. on Management of Data (ACM SIGMOD ’97), pages 265–276, 1997. 4. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM, 39(11):27– 34, November 1996. 5. J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A data mining query language for relational databases. In Proc. of the 1996 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD ’96), Montreal, Canada, June 1996.

6. J. Hipp, U. G¨ untzer, and U. Grimmer. Integrating association rule mining algorithms with relational database systems. In Proc. of the 3rd Int’l Conf. on Enterprise Information Systems (ICEIS 2001), pages 130–137 , Portugal, July 2001. 7. J. Hipp, U. G¨ untzer, and G. Nakhaeizadeh. Algorithms for association rule mining – a general survey and comparison. SIGKDD Explorations, 2(1):58–64, July 2000. 8. J. Hipp and G. Lindner. Analysing warranty claims of automobiles. an application description following the CRISP-DM data mining process. In Proc. of 5th Int’l Computer Science Conf. (ICSC ’99), pages 31–40, Hong Kong, China, December 13-15 1999. 9. E. Hotz, G. Nakhaeizadeh, B. Petzsche, and H. Spiegelberger. Waps, a data mining support environment for the planning of warranty and goodwill costs in the automobile industry. In Proc. of the 5th Int’l Conf. on Knowledge Discovery and Data Mining (KDD ’99), pages 417–419, San Diego, California, USA, August 1999. 10. T. Imielinski, A. Virmani, and A. Abdulghani. Data mining: Application programming interface and query language for database mining. In Proc. of the 2nd Int’l Conf. on Knowledge Discovery in Databases and Data Mining (KDD ’96), pages 256–262, Portland, Oregon, USA, August 1996. 11. T. Imielinski, A. Virmani, and A. Abdulghani. DMajor - application programming interface for database mining. Data Mining and Knowledge Discovery, 3(4):347– 372, December 1999. 12. M. Klemettinen, H. Mannila, and H. Toivonen. Interactive exploration of discovered knowledge: A methodology for interaction, and usability studies. Technical Report C-1996-3, University Of Helsinki, Department of Computer Science, P.O. 26, 1996 13. B. Liu, M. Hu, and W. Hsu. Multi-level organisation and summarization of the discovered rules. In Proc. of the 6th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD ’00), pages 208–217, Boston, MA USA, August 20-23 2000. 14. R. Meo, G. Psaila, and S. Ceri. A new sql-like operator for mining association rules. In Proc. of the 22nd Int’l Conf. on Very Large Databases (VLDB ’96), Mumbai (Bombay), India, September 1996. 15. R. Ng, L. S. Lakshmanan, J. Han, and T. Mah. Exploratory mining via constrained frequent set queries. In Proc. of the 1999 ACM-SIGMOD Int’l Conf. on Management of Data (SIGMOD’99), pages 556–558, Philadelphia, PA, USA, June 1999. 16. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD Record (ACM Special Interest Group on Management of Data), 27(2):343–355, 1998. 17. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of the 21st Conf. on Very Large Databases (VLDB ’95), Z¨ urich, Switzerland, Sept. 1995. 18. R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. of the 1996 ACM SIGMOD Int’l Conf. on Management of Data (SIGMOD ’96), Montreal, Canada, June 1996. 19. R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. In Proc. of the 3rd Int’l Conf. on KDD and Data Mining (KDD ’97), Newport Beach, California, August 1997. 20. R. Wirth and J. Hipp. CRISP-DM: Towards a standard process modell for data mining. In Proc. of the 4th Int’l Conf. on the Practical Applications of Knowledge Discovery and Data Mining, pages 29–39, Manchester, UK, April 2000.

Suggest Documents