A Parallel/Distributed Algorithmic Framework for Mining All ... - arXiv

18 downloads 0 Views 848KB Size Report
optimize targeted campaigns, or general market segmentation; they can also be of value in ... the literature on algorithms for the discovery of association rules in ...... The running time of QARMA on an intel i7 CPU was less than one minute.
1

A Parallel/Distributed Algorithmic Framework for Mining All Quantitative Association Rules Ioannis T. Christou, Athens Information Technology, Monumental Plaza, Bld. C, 44 Kifisias Ave., Marousi 15125, Greece, ([email protected], +30-210-668-2725) (Corresponding Author) Emmanouil Amolochitis, Cepal Hellas Financial Services, Athens, Greece Zheng-Hua Tan, Dept of Electronic Systems, Aalborg University, Aalborg, Denmark Abstract—We present QARMA, an efficient novel parallel algorithm for mining all Quantitative Association Rules in large multi– dimensional datasets where items are required to have at least a single common attribute to be specified in the rules’ single consequent item. Given a minimum support level and a set of threshold criteria of interestingness measures (such as confidence, conviction etc.), our algorithm guarantees the generation of all non-dominated Quantitative Association Rules that meet the minimum support and interestingness requirements. Such rules can be of great importance to marketing departments seeking to optimize targeted campaigns, or general market segmentation; they can also be of value in medical applications, financial as well as predictive maintainance domains. We provide computational results showing the scalability of our algorithm, and its capability to produce all rules to be found in large–scale synthetic and real–world datasets such as Movie–Lens, within a few seconds or minutes of computational time on commodity off–the–shelf hardware. Index Terms—I.2.6.g-Machine Learning, G.2.1.a-Combinatorial Algorithms

1 INTRODUCTION Quantitative association rules (QAR) [1,2,3] form an integral part of Association Rule Mining [3,4,5]; nevertheless, most of the literature on algorithms for the discovery of association rules in databases focuses on Boolean association rule mining, that is, finding all rules with certain support and confidence levels of the form 𝐴 → 𝐵 where 𝐴, 𝐵 are subsets of the entire set of items (inventory catalogue) 𝑆 that appear in a database 𝐷 of historical transactions. Such rules are qualitative (Boolean) in nature as no attribute–related information of the items in the antecedent or the head of the rule are taken into account, and all that matters is the existence or not of the items in question in a transaction, even though certain information, such as the price paid for each item, could be of great importance. In our context for Quantitative Association Rule Mining (QARM), the database 𝐷 comprises of historical transactions of users purchasing items from a collection 𝑆 of items, each characterized by a set of categorical (𝐴𝑖𝐶 ) and quantitative attributes (𝐴𝑖𝑄 ), together forming the set of attributes 𝐴𝑖 = 𝐴𝑖𝐶 ∪ 𝐴𝑖𝑄 for the item 𝑖; and for each attribute 𝑎 ∈ 𝐴𝑖 , depending on whether it’s a categorical or quantitative attribute, there exists an associated set of possible values 𝑅𝑎 the attribute may take, called the attribute’s range, so that ̅ = [−∞, +∞], 𝑎 ∈ 𝐴𝑖𝑄 [𝑙 𝑎 , 𝑢𝑎 ] ⊆ ℝ 𝑅𝑎 = { 𝑎 {𝑣1 , … , 𝑣𝑛𝑎 }, 𝑎 ∈ 𝐴𝑖𝐶 The database D is arranged in records of user histories H containing information about every item purchased by each user. A user history therefore is the set of all transactions made by that particular user. A single transaction is a record containing a unique identification number of the user who made the transaction, and also information about the single item in that transaction: the item’s id, together with values for one or more of the item’s attributes; for example, the item’s purchase price. We denote by 𝑆 𝑖 the set of all user histories containing at least one instance of item 𝑖. The QARM task is then to discover all interesting quantitative rules of the form 𝑖1 [𝑣 𝑎𝑖1,1 ∈ 𝑟𝑖1,1 ] ∧ … ∧ 𝑖𝑛 [𝑣 𝑎𝑖𝑛,𝑘 ∈ 𝑟𝑖𝑛,𝑘 ] → 𝑗1 [𝑣 𝑎𝑗1,1 ∈ 𝑟𝑗1,1 ] ∧ … ∧ 𝑗𝑚 [𝑣 𝑎𝑗𝑚,𝑙 ∈ 𝑟𝑗𝑚,𝑙 ] (1) 𝑠,𝑐

where the notation 𝑖𝑗 [𝑣 𝑎 ∈ 𝑟] (where 𝑎 ∈ 𝐴𝑖𝑗 ) denotes the “restriction” of the records in 𝐻 containing item 𝑖𝑗 to those ones for which each of the attributes 𝑎 ∈ 𝐴𝑖𝑗 takes values 𝑣 𝑎 from the set 𝑟 ⊆ 𝑅𝑎 . The notation in equation (1) should be interpreted as “IF the value of attribute 𝑎𝑖1,1 for item 𝑖1 is in the set 𝑟𝑖1,1 , … and the value of attribute 𝑎𝑖𝑛,𝑘 for item 𝑖𝑛 is in the set 𝑟𝑖𝑛,𝑘 , THEN, with interestingness level(s) c, and support s, item 𝑗1 appears in the same record and the value of its attribute 𝑎𝑗1,1 is in the set 𝑟𝑗1,1 , and … and item 𝑗𝑚 appears in the same record and the value of its attribute 𝑎𝑗𝑚,𝑙 is in the set 𝑟𝑗𝑚,𝑙 ”. Such rules are in fact multi–dimensional quantitative association rules [6 pp. 251—257], as there are many distinct attributes (dimensions) involved in the definition of the same rule. By “interestingness” of a rule, we mean that in the dataset, a rule must take on values for a set of defined metrics above or below specified thresholds. In usual practice, the single metric defining interestingness of a rule is the confidence (that must be higher than specified threshold); however, other metrics or combination thereof might equally well be applied within our system: for example, in the work we present it is possible to define as interestingness criteria of a rule that a rule must have (a) confidence above a minimum confidence threshold and (b) absolute value of leverage [3] above a minimum leverage threshold. We will define the notions of support, confidence and conviction [7], [8 pp. 151—159]

2

for quantitative rules as a natural (minor) extension of the corresponding notions of support, confidence & conviction for the standard qualitative association rules, in the subsequent sections. The problem has been shown to be in general NP–hard [4, 9]. In this work, we present QARMA, a fast parallel exact algorithm that computes all non-dominated (notion precisely defined in section 2.1, see Definition 3) quantitative association rules with minimum support and interestingness of the slightly restricted form 𝑖1 [𝑝 ≥ 𝑙𝑝1 ] ∧ 𝑖1 [𝑣 𝑎𝑖1 ≥ 𝑙𝑎𝑖1 ] ∧ … ∧ 𝑖𝑛 [𝑝 ≥ 𝑙𝑝𝑛 ] ∧ … ∧ 𝑖𝑛 [𝑣 𝑎𝑖𝑛,𝑚 ≥ 𝑙𝑎𝑖𝑛,𝑚 ] → 𝑗[𝑝 ≧ 𝑙𝑗 ] 𝑠,𝑐

(2)

i.e. we require that there is exactly one item specified in the rule’s consequents, that there exist at least one shared quantitative attribute p for each item in the dataset, and we consider the problem of mining (multi–dimensional) quantitative association rules whose items in the consequent constrain only the value of this shared attribute to intervals of the form [𝑙, +∞) or alternatively, to exact values {l} (the symbol ′ ≧ ′ in eq. (2) is to be interpreted as “exclusively either ‘≥’ or else ‘=’”.) In the antecedent, there may be any combination of attributes in any of the items, taking values on left-closed, half-open intervals of the previous form. This restriction does not prevent the algorithm in any way from discovering rules that constrain the attribute values of the antecedent items in fully closed intervals, as the introduction of an artificial attribute that takes on values equal to the negative of the original attribute value results in the discovery of all rules where the intervals on the antecedent items’ attributes may take on any form (half–open on any end, or fully closed intervals, or even singleton value intervals). Indeed, it is trivial to see that the constraint 𝑖[𝑣 𝑎𝑖 ∈ [𝑙, 𝑢]] is equivalent to the expression 𝑖[𝑣 𝑎𝑖 ≥ 𝑙] ∧ 𝑖[〈−𝑣 𝑎𝑖 〉 ≥ −𝑢] where 〈−𝑣 𝑎𝑖 〉 denotes the artificial attribute just discussed. By having 𝑙 = 𝑢 we see we also cover the singleton value interval case. Therefore, the rules our algorithm discovers are constrained from the general multi–dimensional form (1) only in that a shared attribute must be present in all items in the dataset and that the right–hand–side of the rules may only specify half-open (closed–left) intervals (or single values) for the single shared attribute among all items. In the following, we will consider the setting where the quantitative attribute in the rule’s consequent item must take values on left–closed half–open intervals only, and we will discuss the case of exact single values in section 3.4, together with some other possible extensions/modifications; in that section we also show through computational results the scalability of QARMA. We discuss the applications of this algorithm in a number of domains, and provide quantitative results showing the feasibility of our approach even for large–scale datasets.

1.1 Motivation A major motivating factor for this research has been our work on movie recommendation systems for triple–play services providers [10–12]. In many such settings (see also [13–14]), the price for a movie is a function of time (items are depreciated over time), and plays a very important role in the final decision of a user regarding the purchase or not of an item; indeed, market research [15] shows that currently, more than 90% of all items viewed by video–on–demand (VoD) subscribers belong to the “free item” category, and that the average subscriber in the EU pays just under two Euros per month on VoD rentals over and above their fixed monthly subscription fees. Yet, recommender systems algorithms do not systematically exploit this fact in order to either increase their recall rates and subsequently increase customer retention rates, or, even better, understand for each of their customers individually, their “ideal” price-range for a movie, and then potentially make a more appealing offer to each customer. In the context of recommender systems, we use our algorithm in two ways: (a) to build a post–processing tool that re–ranks the top-n recommendations of another recommender engine [10–11] based on the rules found by the algorithm; and (b) to build a personalized reservation price estimator that computes an approximation of a given customer’s reservation price regarding a given item, so as to possibly adjust the price of that item specifically for that customer alone. The second way in which we use our algorithm (so as to obtain first–degree pricing differentiation, see [16]), can have a significant impact for e-commerce, and commerce in general, as our computational results show. 1.2 Related Work Piatetsky–Shapiro [3] presented the first algorithm for Quantitative Association Rule Mining in 1991, even though the exposition was for single antecedent and single consequent attributes. Srikant & Agrawal [1] showed how to overcome the problem of finding summaries for all combinations of attributes (which is exponentially large) required for a straight–forward extension of Piatetsky–Shapiro’s algorithm to multiple antecedent and/or consequent attributes in the rules by a decomposition/partitioning of the quantitative attributes’ intervals followed by possible merging and pruning the search space of candidate itemsets. Salleb– Aouissi et al. [17] developed QuantMiner, a GA for mining interesting QARs via an optimization process; the idea of optimizing QARs originally appeared in Fukuda et al. [18] where however the algorithms presented aim to find intervals for quantitative attributes that maximize support and/or confidence of the produced rules. More recent work on finding QARs using evolutionary algorithms is presented in [19] where the authors used multi-objective Genetic Algorithms incorporating in their fitness function the notion of rule interestingness, and similarly in [20], where Alvarez and Vasquez developed an evolutionary algorithm to discover interesting QARs without any a–priori discretization. The work of Ruckert et al. [21] is also related to our work; in this paper the authors proposed looking for half-space conditions on the quantitative attributes instead of hyper-rectangular representations. In their work, the authors seek to produce rules where the antecedent or the consequent are formalized as linear inequalities of the quantitative attributes, which includes the rules we search for as special cases. However, instead of seeking to produce all such interesting rules that exist, the authors resort to a gradient-descent-based algorithm for producing so-called locally optimal rules (optimal according to their notion of “rule interestingness”), and they give no guarantees as to the quality of the solutions their algorithm finds. Finally, regarding the application of standard association rule mining in the field of recommender systems, a first approach to collaborative filtering-based recommendation via adaptive association rule mining is presented in [22]. Ye [23] presented a system

3

that combines association rule mining and self–organizing maps to create personalized recommendations. Leung et al. [24] also developed a collaborative filtering framework using fuzzy association rules incorporating multi–level similarity measures that improved prediction quality.

1.3 Our Contribution We present QARMA (short for QAR Mining Algorithm), a new highly parallelizable algorithm for mining all interesting quantitative association rules, that is, non–dominated rules having support above a minimum support threshold 𝑠𝑚𝑖𝑛 , and for a set of interestingness metrics (e.g. confidence) have levels above minimum specified threshold values 𝑐𝑚𝑖𝑛 . The algorithm works on datasets where all items share at least one attribute (e.g. price), and produces rules that specify values for the common attribute of the consequent item in a left–closed interval. The rules also specify minimum values for each specific attribute of each item in the antecedents of the rule. Our algorithm is significantly different from other algorithms as it does not use any value discretization or bucketing technique (though such techniques may easily be incorporated and even be needed when dealing with dense datasets whose attributes take very many different numeric values: see section 3.4), and in addition guarantees the generation of all such rules. We prove its importance in real–world business decision making problems. In section 3.2 we run experiments with a synthetic e-commerce dataset. In section 3.3 we run experiments on another synthetic dataset to test whether QARMA can detect rare event conditions in predictive maintenance settings. In section 3.4 we run experiments with the Movie–Lens dataset ml-1m. We show the scalability of QARMA in terms of the speed–up obtained by using more processing units, and we also make comparisons against other QARM algorithms.

2 THE QARMA ALGORITHM 2.1 Definitions In the following, we assume that 𝑄 denotes a finite set of tuples of the form (𝑖, 𝑎, 𝑣) where 𝑖 ∈ 𝑆, 𝑎 ∈ 𝐴𝑖𝑄 , 𝑣 ∈ ℝ. The pair (𝑖, 𝑎) forms a key of that set so that no two tuples with the same item and attribute can exist in 𝑄. We denote a quantitative association rule of the form (2) as 𝑟 = (𝐵 → 𝐼|𝑄) where 𝐵 ⊂ 𝑆, 𝐼 ∈ 𝑆, with the following interpretation: with sufficiently high support and interestingness, the existence of all items in B in a user history 𝑡 such that for each item 𝑖 ∈ 𝐵 that appears in any pair (𝑖, 𝑎, 𝑣) ∈ 𝑄 there exists a transaction in 𝑡 for that item with value for the attribute 𝑎, 𝑣𝑎 ≥ 𝑣 imply the existence of 𝐼 in that user’s set of historical transactions, and that the value of the common shared attribute 𝑝 of item 𝐼 is greater than or equal to the value specified in the pair (𝐼, 𝑝, 𝑣) ∈ 𝑄. Due to our imposed restriction of mining rules specifying values for only a particular user-defined attribute 𝑝 in the consequent item of the rule, we have the following property of the tuples in 𝑄: ∀𝑟 = (𝐵 → 𝐼|𝑄): ((𝐼, 𝑎, 𝑣) ∈ 𝑄 → 𝑎 = 𝑝). For a given dataset 𝐷 we denote the set of all possible quantitative rules in that set by 𝑅𝐷 (not to be confused with the 𝑑dimensional real space ℝ𝑑 .) Definition 1. (Rule Confidence) A rule 𝑟 = (𝐵 → 𝐼|𝑄) has confidence 𝑐 denoted by 𝐶𝑂𝑁𝐹𝐼𝐷𝐸𝑁𝐶𝐸(𝑟) that equals the number of user histories that contain each of the items in 𝐵 ∪ {𝐼} with attribute values for each of the specified attributes in 𝑄, at least equal to the value specified for that item–attribute pair in 𝑄 divided by the number of user histories in which each of the items in 𝐵 appears with attribute values for each of the specified attributes in 𝑄, at least equal to the value specified for that item– attribute pair in 𝑄. If an item does not appear in any of the pairs in 𝑄, the condition is reduced to the mere existence of the item in one or more transactions of the user history, regardless of any attribute value. Definition 2. (Rule Support) A rule 𝑟 = (𝐵 → 𝐼|𝑄) has support 𝑠 denoted by 𝑆𝑈𝑃𝑃𝑂𝑅𝑇(𝑟) that equals the number of user histories in which each of the items in 𝐵 ∪ {𝐼} appears with attribute values for each of the specified attributes in 𝑄, at least equal to the value specified for that item–attribute pair in 𝑄 divided by the total number of user histories in the database. If an item does not appear in any of the pairs in 𝑄, the condition is reduced to the mere existence of the item in one transaction of the user history, regardless of any attribute value. Definition 3. (Rule Conviction) A rule 𝑟 = (𝐵 → 𝐼|𝑄) has conviction 𝑡 denoted by 𝐶𝑂𝑁𝑉𝐼𝐶𝑇𝐼𝑂𝑁(𝑟) that equals the ratio (1 − 𝐶𝑂𝑁𝑆_𝑆𝑈𝑃𝑃(𝑟))⁄(1 − 𝐶𝑂𝑁𝐹𝐼𝐷𝐸𝑁𝐶𝐸(𝑟)) where the consequent support 𝐶𝑂𝑁𝑆_𝑆𝑈𝑃𝑃(𝑟) is the number of user histories in which the item 𝐼 appears with consequent attribute value at least equal to the value specified for that item–attribute pair in 𝑄 divided by the total number of user histories in the database (if the consequent item does not appear in any of the pairs in 𝑄, the condition is reduced to the mere existence of the item in one transaction of the user history, regardless of attribute value.) Definition 4. (Rule Dominance) Given a transitive predicate formula 𝐿𝑇𝐹(𝑟, 𝑟′) derived from a finite set of interestingness metrics 𝑀 = {𝐹: 𝑅𝐷 → ℝ} satisfying ∀𝑟, 𝑟 ′ , 𝑟 ′′ ∈ 𝑅𝐷 : 𝐿𝑇𝐹(𝑟, 𝑟′)⋀𝐿𝑇𝐹(𝑟 ′ , 𝑟′′) → 𝐿𝑇𝐹(𝑟, 𝑟′′), we say that a rule 𝑟 = (𝐵 → 𝐼|𝑄) is dominated by another rule 𝑟′ = (𝐵′ → 𝐼|𝑄′), and we write 𝑟′ ≻ 𝑟 if: 𝐵 ⊇ 𝐵′ AND (𝐼, 𝑝, 𝑣) ∈ 𝑄, (𝐼, 𝑝, 𝑣 ′ ) ∈ 𝑄′ → 𝑣′ ≥ 𝑣 (i.e. the consequent’s attribute value in 𝑟 ′ is greater than or equal to the consequent’s attribute value in 𝑟) AND 𝑆𝑈𝑃𝑃𝑂𝑅𝑇(𝑟) ≤ 𝑆𝑈𝑃𝑃𝑂𝑅𝑇(𝑟′)AND 𝐿𝑇𝐹(𝑟, 𝑟′) AND ∀(𝐾 ∈ 𝐵′ , 𝑎 ∈ 𝐴𝐾𝑄 , 𝑤 𝑎 ) ∈ 𝑄′ → ∃(𝐾 ∈ 𝐵, 𝑎 ∈ 𝐴𝐾𝑄 , 𝑣 𝑎 ) ∈ 𝑄, 𝑤 𝑎 ≤ 𝑣 𝑎 . The predicate that we shall assume by default will be 𝐿𝑇𝐹(𝑟, 𝑟 ′ ) = 𝐶𝑂𝑁𝐹𝐼𝐷𝐸𝑁𝐶𝐸(𝑟) ≤ 𝐶𝑂𝑁𝐹𝐼𝐷𝐸𝑁𝐶𝐸(𝑟 ′ ); in another setting we could have 𝐿𝑇𝐹(𝑟, 𝑟 ′ ) = 𝐶𝑂𝑁𝐹𝐼𝐷𝐸𝑁𝐶𝐸(𝑟) ≤ 𝐶𝑂𝑁𝐹𝐼𝐷𝐸𝑁𝐶𝐸(𝑟 ′ ) ⋀ 𝐶𝑂𝑁𝑉𝐼𝐶𝑇𝐼𝑂𝑁(𝑟) ≤ 𝐶𝑂𝑁𝑉𝐼𝐶𝑇𝐼𝑂𝑁(𝑟 ′ ). The above dominance definition holds when the rules’ right-hand-side are inequalities; when mining rules where the rules’ right-hand-side represent equalities, then the rule dominance criterion must be slightly modified to indicate that the dominant and

4

dominated rule specify exactly the same value for their consequent item common attribute. There is no dominance relationship between two rules whose right–hand–sides are equalities, but specify different values each for their consequent item attribute. Definition 5. (Rule Coverage) The covering set of a quantitative rule 𝑟 = (𝐵 → 𝐼|𝑄) is defined as the subset of user-histories in the dataset for which the rule holds: 𝐶𝑂𝑉(𝑟) = {𝑡 ∈ 𝐷: (∀(𝑖 ∈ 𝐵, 𝑎 ∈ 𝐴𝑖𝑄 , 𝑣) ∈ 𝑄): (∃(𝑖, 𝑎, 𝑣 ′ ) ∈ 𝑡: 𝑣 ′ ≥ 𝑣) ∧ ((𝐼, 𝑝, 𝑣) ∈ 𝑄 → ∃(𝐼, 𝑝, 𝑣 ′ ) ∈ 𝑡: 𝑣′ ≥ 𝑣)}. The covering set of a set of rules 𝑅𝑆 ⊆ 𝑅𝐷 is defined as the union of the covering sets of each member of that rule–set: 𝐶𝑂𝑉(𝑅𝑆) =∪𝑟∈𝑅𝑆 𝐶𝑂𝑉(𝑟). From the definition of support, we have 𝑆𝑈𝑃𝑃𝑂𝑅𝑇(𝑟) = |𝐶𝑂𝑉(𝑟)|⁄|𝐷|. Notice that while our definitions of support and interestingness of a rule are direct applications of the respective notions in the realm of binary association rule mining, we also have to introduce the notion of dominance defined in Definition 4 above, so as to discard valid (satisfying minimum support and interestingness) but useless rules: a dominated rule is of no use as there exists another rule with equal or higher support and interestingness metrics that implies the same item as the dominated rule with equal or higher attribute value, and in addition, the dominated rule specifies higher values for each of the antecedent items’ values specified in the other (dominant) rule. As a result, the dominated rule provides no new information, in that, whenever the dominated rule fires and signals a value for the common quantitative attribute of the consequent item 𝐼 higher than a value v, the dominant rule also fires, and signals an at least as high value of 𝐼’s attribute (providing equal or more information), with equal or higher support and interestingness values. A nice property of the notion of rule dominance is that of transitivity: 𝑟1 ≻ 𝑟2 ∧ 𝑟2 ≻ 𝑟3 ⇒ 𝑟1 ≻ 𝑟3 . Another relation closely associated to the dominance relation between rules, is the widening relation between rules, that we define as follows: we say that a rule 𝑟 ′ is wider than a rule 𝑟 and we write 𝑟 ′ ⊃ 𝑟 if all the conditions of Definition 4 hold with the possible exception of the conditions for the relationship between the support or interestingness values of the two rules. It is easy to see that the widening relation between rules is also transitive, and that clearly the implication 𝑟 ′ ≻ 𝑟 ⇒ 𝑟 ′ ⊃ 𝑟 holds, though the reverse implication is not true in general. It is easy to modify the QARMA algorithm to produce all the widest, non-dominated rules that exist for a given dataset, sometimes significantly reducing the total number of rules produced (the rationale being that as long as the rules have minimum support and interestingness, we don’t care for rules for which there exist others wider than them in the result.) Finally, there is the issue of rule generalization: given a valid non–dominated rule 𝑟 = (𝐵 → 𝐼|𝑄) that meets minimum support and interestingness criteria on a given training dataset 𝐷, the evidence on that dataset also supports the following generalizations on the restrictions of the rule’s item–attribute quantifications: ∀𝑖 ∈ 𝐵 ∪ {𝐼}: (𝑖, 𝑡, 𝑙𝑖,𝑡 ) ∈ 𝑄: 𝑖[𝑣 𝑎𝑖,𝑡 > 𝑝𝑎−𝑖,𝑡 ] where 𝑝𝑎−𝑖,𝑡 is the greatest value that attribute 𝑡 of item 𝑖 takes on the dataset 𝐷 that is strictly less than 𝑙𝑖,𝑡 unless there is no such value in the dataset, in which case the restriction remains interpreted as 𝑖[𝑣 𝑎𝑖,𝑡 ≥ 𝑙𝑖,𝑡 ].

2.2 Algorithm Specification Our algorithm for computing all QARs from a transactional database expects a set of user histories (of the form {UserID:X, Items–Purchased: SET–OF (ItemID: Y, Attr: W, AttrValue: Z)}). As a concrete example, in the single–attribute case, we interpret QARs of the form (2) as: “IF a user has purchased products 𝐴1 … 𝐴𝑘 for which, the corresponding attribute values are 𝑝𝐴1 … 𝑝𝐴𝑘 or higher, THEN such a user would be willing to purchase a product 𝐶 whose attribute value is at least 𝑝𝐶 .” The above definition of what constitutes the input database allows for multiple “purchases” of the same item by the same user, which is very common in most cases our algorithm is designed for application. As an example, in e–commerce datasets, the same customer over a period, purchases in multiple transactions many copies of the same product. Also, in patient medical records, the same patient over a period of time, has the same medical test (e.g. blood pressure measurement) multiple times, and so on. Nevertheless such a feature is seldom found in other implementations, which usually assume that the input dataset consists of a single large table, in which in each row, each column may take on a single value (which in relational databases design, corresponds to the 1st normal form). Items are assumed to be in a transitive order relationship, given by a map 𝑜𝑟𝑑: 𝑆 → ℕ (that can be the sequence id by which items are stored in the db.) Similarly, attributes within an item are assumed to be in a transitive order relationship given by a map 𝑜𝑟𝑑: 𝐴𝑄 × 𝑆 → ℕ. QARMA Algorithm INPUTS: Database 𝐷, minimum required support 𝑠, set of interestingness metrics 𝑀 ⊂ {𝐹: 𝑅𝐷 → ℝ}, minimum required interestingness thresholds 𝑐(𝑚) ∀𝑚 ∈ 𝑀, transitive predicate formula 𝐿𝑇𝐹(𝑟, 𝑟′) defining when a rule is more interesting than another, consequent items’ shared common attribute 𝑝, ordering maps 𝑜𝑟𝑑: 𝑆 → ℕ and 𝑜𝑟𝑑: 𝐴𝑄 × 𝑆 → ℕ. OUTPUTS: All non-dominated QARs of the above form with support, interestingness above the minimum specified thresholds. Begin 1. run procedure INIT(𝐷, 𝑠, 𝑀, 𝑐(. ), 𝐿𝑇𝐹, 𝑝) to produce set of all frequent item-sets 𝐹𝑠 and map 𝑃. 2. let 𝑅 = ∅. 3. foreach 𝑘 = 2 … 𝑚𝑎𝑥ℓ∈𝐹𝑠 ‖ℓ‖ do: 3.0. foreach unit of execution do: 3.0.1. Atomically get global read-lock. 3.0.2. set 𝑅𝑙𝑜𝑐𝑎𝑙 to be a copy of 𝑅 local to the current unit of execution. 3.0.3. Atomically release global read-lock. 3.1. endfor. 3.2. parallel foreach frequent 𝑘-itemset ℓ𝑘 ∈ 𝐹𝑠 do:

5

3.2.1. set 𝐻1 = {𝑟 = (𝐵 → 𝐼): 𝐵 = ℓ𝑘 ∖ {𝐼}, 𝐼 ∈ ℓ𝑘 }. 3.2.2. foreach rule 𝑟 = (𝐵 → 𝐼) ∈ 𝐻1 do: 3.2.2.1. let 𝑇 = ∅. // 𝑇 is a FIFO queue of sets of tuples of the form (𝐾 ∈ ℓ𝑘 , 𝑎 ∈ 𝐴𝐾𝑄 , 𝑝 ∈ {𝑝𝐾,𝑎,1 , … 𝑝𝐾,𝑎,𝑛𝐾,𝑎 }) 3.2.2.2. foreach 𝑖 = 𝑛𝐼,𝑝 (= 𝑑𝑖𝑚(𝑃(𝐼, 𝑝))) … 1 do: 3.2.2.2.1. let 𝑄 = {(𝐼, 𝑝, 𝑝𝐼,𝑝,𝑖 )}. // set the value of the common shared attribute p of 3.2.2.2.2. if 𝑆𝑈𝑃𝑃𝑂𝑅𝑇(𝐵 → 𝐼|𝑄) < 𝑠 continue. 3.2.2.2.3. Insert 𝑄 onto 𝑇. 3.2.2.2.4. while 𝑇 ≠ ∅ do: 3.2.2.2.4.1. Remove the first 𝑄 from 𝑇.

I

to 𝑝𝐼,𝑝,𝑖 = 𝑃(𝐼, 𝑝)𝑖 .

3.2.2.2.4.2. foreach 𝐽 ∈ 𝐵 do: 3.2.2.2.4.2.0. if 𝑜𝑟𝑑(𝐽) < 𝑚𝑎𝑥𝐿∈𝐵,𝑎∈𝐴𝐿𝑄 {𝑜𝑟𝑑(𝐿): ∃𝑣: (𝐿, 𝑎, 𝑣) ∈ 𝑄} continue. // avoid duplicate set creation 𝐽

3.2.2.2.4.2.1. foreach 𝑎 ∈ 𝐴𝑄 do: 3.2.2.2.4.2.1.1. if 𝑜𝑟𝑑(𝐽) < 𝑚𝑎𝑥𝐴∈𝐵 {𝑜𝑟𝑑(𝐴): ∃𝑣: (𝐴, 𝑎, 𝑣) ∈ 𝑄} OR 𝑜𝑟𝑑(𝑎, 𝐽) ≤ 𝑚𝑎𝑥𝑡∈𝐴𝐽 {𝑜𝑟𝑑(𝑡, 𝐽): ∃𝑣: (𝐽, 𝑡, 𝑣) ∈ 𝑄} continue. // avoid duplicate set creation 𝑄

3.2.2.2.4.2.1.2. foreach 𝑗 = 1 … 𝑛𝐽,𝑎 (= 𝑑𝑖𝑚(𝑃(𝐽, 𝑎))) do: 3.2.2.2.4.2.1.2.1. let 𝑄′ = 𝑄 ∪ {(𝐽, 𝑎, 𝑝𝐽,𝑎,𝑗 )}. 3.2.2.2.4.2.1.2.2. if 𝑆𝑈𝑃𝑃𝑂𝑅𝑇(𝐵 → 𝐼|𝑄′) ≥ 𝑠 then 3.2.2.2.4.2.1.2.2.1. Insert 𝑄′ onto 𝑇. 3.2.2.2.4.2.1.2.2.2. if ∀𝑚 ∈ 𝑀: 𝑚(𝐵 → 𝐼|𝑄′) ≥ 𝑐(𝑚) then 3.2.2.2.4.2.1.2.2.2.1. if ¬(Exists-dominating-rule-for((𝐵 → 𝐼|𝑄′), 𝑅𝑙𝑜𝑐𝑎𝑙 , 𝐿𝑇𝐹)) then 3.2.2.2.4.2.1.2.2.2.1.1. Add (𝐵 → 𝐼|𝑄′) to 𝑅𝑙𝑜𝑐𝑎𝑙 . 3.2.2.2.4.2.1.2.2.2.2. endif // no dominating rule exists 3.2.2.2.4.2.1.2.2.3. endif // interestingness metrics ok 3.2.2.2.4.2.1.2.3. else break. // support not ok 3.2.2.2.4.2.1.2.4. endif. // support 3.2.2.2.4.2.1.3. endfor. // j 3.2.2.2.4.2.2. endfor. // a 3.2.2.2.4.3. endfor. // 𝐽 3.2.2.2.5. endwhile. // 𝑇 3.2.2.3. endfor. // 𝑖 3.2.3. endfor. // rule loop over 𝑟 3.3. endfor parallel. // loop over ℓ𝑘 3.4. foreach unit of execution do: 3.4.1. Atomically get global write-lock. 3.4.2. set 𝑅 = 𝑅 ∪ 𝑅𝑙𝑜𝑐𝑎𝑙 . 3.4.3. Atomically release global write-lock. 3.5. endfor. 4. endfor. // k 5. return 𝑅. End The procedure INIT() that is invoked as the first step of the QARMA algorithm (step 1.) is as follows: Procedure INIT INPUTS: Database 𝐷, minimum required support 𝑠, set of interestingness metrics 𝑀 ⊂ {𝐹: 𝑅𝐷 → ℝ}, minimum required interestingness thresholds 𝑐(𝑚) ∀𝑚 ∈ 𝑀, transitive predicate formula 𝐿𝑇𝐹(𝑟, 𝑟′) defining when a rule is more interesting than another, consequent items’ shared common attribute 𝑝. OUTPUTS: all frequent itemsets 𝐹𝑠 with support 𝑠 or higher (assuming minimum attribute value level for all attributes, i.e. by 𝑚𝑎𝑥(𝑛𝑖,𝑎 : 𝑖 ∈ 𝑆, 𝑎 ∈ 𝐴𝑖𝑄 ) treating the database as qualitative database, all attribute values were at minimum). Map 𝑃: 𝑆 × 𝐴𝑄 → ℝ𝑘 ⋃ 𝑘=1 with the property that 𝑃(𝑖, 𝑎) is a lexicographically sorted vector in ℝ𝑛𝑖,𝑎 , and clearly, there are 𝑛𝑖,𝑎 distinct values in the database

6

that the attribute 𝑎 takes for item 𝑖. Begin 0. Compute all different values of each attribute 𝑎 ∈ 𝐴𝑖𝑄 of every item 𝑖 with which it occurs in 𝐷, and sort them in increasing order 𝑚𝑎𝑥(𝑛𝑖,𝑎 : 𝑖 ∈ 𝑆, 𝑎 ∈ 𝐴𝑖𝑄 ) 0 ≤ 𝑝𝑖,𝑎,1 < 𝑝𝑖,𝑎,2 … < 𝑝𝑖,𝑎,𝑛𝑖,𝑎 and store them in map 𝑃: 𝑆 × 𝐴𝑄 → ℝ𝑘 so that 𝑃(𝑖, 𝑎) is a lexicographical⋃ 𝑘=1 ly sorted vector in ℝ𝑛𝑖,𝑎 , and clearly, there are 𝑛𝑖,𝑎 distinct values in the database that the attribute 𝑎 takes for item 𝑖. 1. Use any algorithm for frequent itemset generation (such as the FP-Growth algorithm) to generate all frequent itemsets 𝐹𝑠 with support 𝑠 or higher (assuming minimum attribute value level for all attributes, i.e. by treating the database as a qualitative database). End The operation “Exists-dominating-rule-for((𝐵 → 𝐼|𝑄′), 𝑅𝑙𝑜𝑐𝑎𝑙 , 𝐿𝑇𝐹) ” returns true if and only if the QAR (𝐵 → 𝐼|𝑄′) is dominated by at least one other rule in the current QAR–set 𝑅𝑙𝑜𝑐𝑎𝑙 according to the predicate 𝐿𝑇𝐹. The

tests

in

step

3.2.2.2.4.2.0

(𝑜𝑟𝑑(𝐽) < 𝑚𝑎𝑥𝐿∈𝐵,𝑎∈𝐴𝐿𝑄 {𝑜𝑟𝑑(𝐿): ∃𝑣: (𝐿, 𝑎, 𝑣) ∈ 𝑄})

and

3.2.2.2.4.2.1.1

(𝑜𝑟𝑑(𝐽)

Suggest Documents