tomers which buy a kind of goods (e.g. bread) usually also ... A new approach: the concepts of certainty fac- .... vals, and to consider this set as its new domain.
Mining Fuzzy Association Rules: An Overview Miguel Delgado BISC Conference December 2003
SUMMARY The objective of this talk is to present a revision of the most relevant results about the use of Fuzzy Sets in Data Mining, specially in relation with the discovery of Association Rules.
1
INDEX
1. General ideas and the basic concepts of Data Mining to justify the need of Fuzzy Sets Theory.
2. A historical revision on developments in the field.
3. Our researches about Fuzzy Association Rules.
4. Some instances of using fuzzy association rules.
5. Some suggestions about future researches and problems to be solved. 2
General Ideas The more the amount of stored data the more the demand of extracting the implicit information they contain to aid decision-making in business, health care services, research, etc. Knowledge Discovery/Data Mining, whose objective is to obtain useful knowledge from data are recognized as a basic necessity. Data Mining is sometimes considered as synonymous of Knowledge Discovery. Since nineties ”Data Mining” has become a central topic in Databases and Artificial Intelligence.
3
General Ideas A ”classical” characterization: ”Data Mining is the process of nontrivial extraction of implicit, previously unknown and potentially useful information (rules constraints, regularities, etc) form data in databases.
4
General Ideas Data Mining is a very broad field. According to Vila et al. 1997, the most important problems in Data Mining are:
1.- Mining Association Rules.- To discover important associations between sets of attribute values.
2.- Dependency modelling and link analysis.To disclose dependencies between the variables in the database.
5
3.- Multilevel Data Generalization, Summarization and Characterization.- To provide compact descriptions in a high conceptual level.
4.- Pattern identification and description.- To look for interesting pattern and describing them in a concise and meaningful manner. – pattern identification, – pattern description.
6
General Ideas Soft-Computing and more specifically Fuzzy Sets play a significant role in Data Mining. It is widely recognized that many real world relations are intrinsically fuzzy. Many techniques used in crisp data mining models have their corresponding ”fuzzy version”. Some problems treated in a ”pure crisp” data mining context have a more natural formulation by using a fuzzy knowledge and fuzzy data representation. This talk is devoted to analyze the use of Fuzzy Sets tools in the task of Mining Association Rules. 7
MINING (CRISP) ASSOCIATION RULES The first works (Agrawal et al. in 1993) were devoted to disclose patterns in transactional databases from the retail industry and business. Present terminology of this field preserves its origin. Sometimes looking for Association Rules is also named ”market basket analysis”: looking for association between the items that a purchaser in a retail shop selects. A classical example is to analyze the connections between different types of goods in a sales database, that is to check if those customers which buy a kind of goods (e.g. bread) usually also buy another kind (e.g. milk). 8
Example trans − id 1 2 3 4 5 6
Bread 1 0 1 1 1 1
butter 1 1 0 1 1 1
Biscuits 0 1 1 0 1 1
M ilk 1 0 0 1 0 1
It is very easy to detect an association rule Bread&Butter ⇒ M ilk When a client bought bread and butter then he/she also bought milk. The rule has one exception: trans-id 5. Need of reliability/accuraccy measurement. 9
Reliability/Accuraccy Support and Confidence framework. The ”support” measures the reliability by the relative frequency of co-occurrence of the rule’s items. The ”confidence” measures the rule accuracy by the quotient between the support of that rule and the relative frequency of the items belonging to the left part of the rule. It is easy to show that for our rule above the support is 3/6 whereas the confidence is 3/4.
10
Support and Confidence framework The former approaches to Data Mining look for Association Rules with support and confidence larger than some respective thresholds which are supposed to be fixed by the users.
11
Formal model Let I be a set of items (objetcs) and T a set of transactions, (i.e. sets of items on I), both assumed to be finite. Definition. An association rule is an expression of the form A ⇒ C, where A, C ⊆ I, A, C 6= ∅, and A ∩ C = ∅. The rule A ⇒ C means “every transaction of T that contains A contains C too”.
12
Support and Confidence framework Definition. The support of an itemset I0 ⊆ I with respect to a set of transactions T is supp(I0, T ) =
|{τ ∈ T | I0 ⊆ τ }| |T |
(1)
i.e., the probability that a transaction of T contains I0. Definition. The support of the association rule is Supp(A ⇒ C, T ) = supp(A ∪ C, T )
(2)
The confidence of A ⇒ C in T is Conf (A ⇒ C, T ) = supp(A ∪ C, T ) Supp(A ⇒ C, T ) = . supp(A, T ) supp(A, T ) 13
Support and Confidence framework Support is the percentage of transactions where the rule holds. Confidence is the conditional probability of C with respect to A or, i.e. the relative cardinality of C with respect to A. The techniques to mine for association rules attempt to discover rules with support and confidence are greater than user’s thresholds called minsupp and minconf respectively, (strong rules). It is usual to assume that T is fixed for each problem and thus it is customary to avoid any reference to it. 14
A new framework to measure accuracy and importance Several authors have pointed out some drawbacks of the support/confidence framework to assess association rules. A new approach: the concepts of certainty factors and very strong rule .
15
A new framework to measure accuracy and importance. The certainty factor Definition. The certainty factor of a association rule A ⇒ C is the value: • Conf (A ⇒ C) > supp(C) then Conf (A ⇒ C) − supp(C) CF (A ⇒ C) = 1 − supp(C) • Conf (A ⇒ C) ≤ supp(C) then CF (A ⇒ C) =
Conf (A ⇒ C) − supp(C) supp(C)
• supp(C) = 1 then CF (A ⇒ C) = 1 • supp(C) = 0 then CF (A ⇒ C) = −1. The certainty factor takes values in [−1, 1]. 16
A new framework to measure accuracy and importance Proposition. Conf (A ⇒ C) = 1 if and only if CF (A ⇒ C) = 1 The certainty factor of an association rule achieves its maximum possible value, 1, if and only if the rule is totally accurate. Strong Association Rules Definition. An association rule is strong when its certainty factor and support are greater than user’s respective thresholds minCF and minsupp. A fuzzy association rule A ⇒ C is very strong if both A ⇒ C and ¬C ⇒ ¬A are strong. 17
Algorithms Most algorithms obtain strong rules from large itemsets: itemsets with support greater than respective user’s thresholds. The most known algorithm, the A Priori Algorithm is based on the A Priori Property : ”Every subset of a frequent itemset must be a frequent itemset too”. This algorithm was designed to proceed iteratively starting from frequent itemsets containing a single item. The original algorithms has been adapted, modified or improved in multiple senses: AIS, Apriori and AprioriTid , SETM, OCD , DHP, DIC, CARMA, TBAR, FP-Growth, etc. 18
Algorithms Most of the existing algorithms work in two steps:
P.1. To find the itemsets with support greater than minsupp (the so-called frequent itemsets) This step is the most expensive from the computational point of view.
P.2. To obtain rules with accuracy greater than an user-defined threshold, from the frequent itemsets obtained in step P.1. Specifically, if the itemsets A and A ∪ C are frequent, we can obtain the rule A ⇒ C. We must verify the accuracy of the rule, in order to determine whether it is strong. 19
Association Rules in Relational Databases The development of models to find patterns in relational databases is a must. Data in relational databases are stored in tables, where rows are the description of an objects and columns are characteristics/attributes of the objects. For each object (row) t, t[X] stands for the value of attribute (column) X. Algorithms to mine association rules have been applied to relational databases by defining items as hattribute, valuei and transactions as tuples.
20
Association Rules in Relational Databases Let RE = {X1, . . . , Xm} be a set of attributes. We denote I RE to the set of items associated to RE, i.e. I RE = {hXj , xi, x ∈ Dom(Xj ), j ∈ {1, . . . , m}} Every instance r of RE is associated to a Tset, denoted T r , with items in I RE . Each tuple t ∈ r is associated to an unique transaction τ t ∈ T r : τt
n
h
i
= hXj , t Atj i | j ∈ {1, . . . , m}
o
No pair of items in one itemset/transaction share the same attribute (First Normal Form constraint).
21
Association Rules in Relational Databases The first works dealt with categorical attributes but the problem of discovering associations in relational databases rules involving quantitative ones arose very early too: quantitative association rules. But the support and the semantic content for the quantitative association rules is quite poor when numerical values (the finest possible granularity) are directly used (a very known problem in Machine Learning). Moreover, the mining task is very expensive.
22
Association Rules in Relational Databases A solution is to split (in some way) the domain of each quantitative attribute into intervals, and to consider this set as its new domain (i.e. to take a coarser granularity). Several approaches based on this idea have been proposed, either performing the clustering during the mining process or before it. Drawbacks:
• may be difficult to find a user’s semantics for the intervals
• the importance and accuracy of rules can be very sensitive to (even small) variations of boundaries. 23
Association Rules in Relational Databases To avoid these drawbacks, a soft alternative is to define a set of meaningful linguistic labels represented by fuzzy sets on the domain of the quantitative attributes, and to use them as a new domain. Now, the meaning of the values in the new domain is clear, and the rules are not sensitive to small changes of the boundaries because they are fuzzy. This has been the adopted approach in a lot of papers providing very good results as it is a natural use of Fuzzy Sets .
24
Fuzzy Transactions and Fuzzy Association Rules An historical overview Data Mining in general and (fuzzy) Association Rules Mining in particular have received a considerable attention from the beginning. Although they are young topics the number of paper devoted to them (from both theoretical and practical point of view) is quite impressive. Since 1997, most papers in this field have been devoted to mine association rules involving quantitative attributes in relational databases. In the following we are going to report the main research lines on Fuzzy Association Rules.
25
Fuzzy Transactions and Fuzzy Association Rules An historical overview Item: label/fuzzy subset One vs. several items for attribute Predetermined labels Labels obtained from a cluster-partition process ”Horizontal” attributes ”Hierarchical” attributes (taxonomy of attributes)
26
Fuzzy Transactions and Fuzzy Association Rules An historical overview Several approaches to measure support/confidence Several approaches to measure support/certaintyfactor (Cardinality assessment questions) Other measurements schemes Relevance measures for items Relevance measures for rules Mining fuzzy association rules algorithms
27
Fuzzy Transactions and Fuzzy Association Rules: Our Approach We will present a summary of our research on this topic. We introduce some definitions and after that some ideas about the corresponding search algorithms.
28
Fuzzy Transactions and Fuzzy Association Rules: Our Approach Definition. A fuzzy transaction is a nonempty fuzzy subset τ˜ ⊆ I. We note τ˜(i) the membership degree of every i ∈ I in τ˜. We note τ˜(I0) the degree of inclusion of an itemset I0 ⊆ I in τ˜, τ˜(I0) = min τ˜(i) i∈I0
Any transaction is a special case of fuzzy transaction.
29
Fuzzy Transactions and Fuzzy Association Rules: Our Approach A set of fuzzy transactions may be represented by a table again. Columns and rows are labelled with identifiers of items and transactions respectively. The cell for item ik and transaction τ˜j contains a [0, 1]-value: the membership degree of ik in τ˜j , τ˜j (ik ).
30
Fuzzy Transactions and Fuzzy Association Rules: Our Approach Let I = {i1, i2, i3, i4} be a set of items.
τ˜1 τ˜2 τ˜3 τ˜4 τ˜5 τ˜6
i1 0 0 1 1 0.5 1
i2 0.6 1 0.5 0 1 0
i3 0.7 0 0.75 0.1 0 0.75
i4 0.9 1 1 1 1 1
Some inclusion degrees are τ˜1({i3, i4}) = 0.7, τ˜1({i2, i3, i4}) = 0.6, τ˜4({i1, i4}) = 1. 31
Fuzzy Transactions and Fuzzy Association Rules: Our Approach We call T-set a set of ordinary transactions, and FT-set a set of fuzzy transactions. Let us remark that a FT-set is a crisp set. Definition. Let I be a set of items, T a FTset, and A, C ⊆ I two crisp subsets, with A, C 6= ∅, and A ∩ C = ∅. A fuzzy association rule A ⇒ C holds in T iff τ˜(A) ≤ τ˜(C) ∀˜ τ ∈T
32
Fuzzy Transactions and Fuzzy Association Rules: Our Approach This definition preserves the meaning of association rules: If A ⊆ τ˜ then C ⊆ τ˜ (as τ˜(A) ≤ τ˜(C)). Since a transaction is a special case of fuzzy transaction, an association rule is a special case of fuzzy association rule. Let us remark that the main characteristic feature of our approach is to model Fuzzy Transactions with crisp items. This is quite general because in the case of have actually fuzzy items (labels, fuzzy numbers, etc.) finally they will produce a table as the one before. 33
Support and confidence of fuzzy association rules We employ a semantic approach based on the evaluation of quantified sentences. A quantified sentence is an expression of the form ”Q of F are G”. F and G are fuzzy subsets of a finite set X, and Q is a relative fuzzy quantifier. Relative quantifiers are linguistic labels for fuzzy percentages that can be represented by means of fuzzy sets on [0, 1], such as “most”, “almost all”, or “many”.
34
Support and confidence of fuzzy association rules ”Many young people are tall”,
• Q = many,
• F and G are possibility distributions on X = people induced by ”young” and ”tall” respectively.
A special case of quantified sentence appears when F = X: ”most of the terms in the profile are relevant”.
35
Support and confidence of fuzzy association rules Coherent quantifiers are specially relevant for us:
• Q(0) = 0 and Q(1) = 1
• If x < y then Q(x) ≤ Q(y) (monotonicity)
36
Support and confidence of fuzzy association rules The evaluation of a quantified sentence yields a [0, 1]-value, that assesses the accomplishment degree of the sentence. Definition. Let I0 ⊆ I. The support of I0 in T is the evaluation of the quantified sentence e Q of T are Γ I0 e where Γ I0 is a fuzzy set on T defined as e (˜ Γ ˜(I0) I0 τ ) = τ
37
Support and confidence of fuzzy association rules Definition. The support of A ⇒ C in the set of fuzzy transactions T is supp(A ∪ C), i.e., the evaluation of the quantified sentence ³
e e e Q of T are Γ A∪C = Q of T are ΓA ∩ ΓC
´
Definition. The confidence of A ⇒ C in the set of fuzzy transactions T is the evaluation of the quantified sentence e are Γ e Q of Γ A C
These definitions establish families of support and confidence measures, depending on the evaluation method and the quantifier of our choice. 38
Support and confidence of fuzzy association rules Many evaluation methods and quantifiers can be chosen, provided that the following four intuitive properties of the measures for ordinary association rules hold: e ⊆Γ e then Conf (A ⇒ C) = 1. 1. If Γ A C
e ∩Γ e = ∅ then Supp(A ⇒ C) = 0 and 2. If Γ A C Conf (A ⇒ C) = 0.
e ⊆Γ e 0 (particularly when A0 ⊆ A) then 3. If Γ A A Conf (A0 ⇒ C) ≤ Conf (A ⇒ C).
e ⊆Γ e 0 (particularly when C 0 ⊆ C) then 4. If Γ C C Conf (A ⇒ C) ≤ Conf (A ⇒ C 0). 39
Support and confidence of fuzzy association rules We evaluate the sentences by means of method GD (introduced by us in 2000). The evaluation of ”Q of F are G” by means of GD is GDQ(G/F ) = P
³
´
µ
|(G∩F )αi | α − α Q i i+1 αi ∈∆(G/F ) |Fαi |
¶
∆(G/F ) = Λ(G ∩ F ) ∪ Λ(F ), Λ(F ) is the level set of F , ∆(G/F ) = {α1, . . . , αp} with αi > αi+1, i ∈ {1, . . . , p}. F is assumed to be normalized. Else, F is normalized and the normalization factor is applied to G ∩ F . 40
Support and confidence of fuzzy association rules The evaluation of a quantified sentence “Q of F are G” by means of method GD can be interpreted as:
• the evidence that the percentage of objects in F that are also in G (relative cardinality of G with respect to F ) is Q
• a quantifier-guided aggregation of the relative cardinalities of G with respect to F for each α-cut of the same level of both sets.
41
Support and confidence of fuzzy association rules
• Supp(A ⇒ C) can be interpreted as the evidence that the percentage of transactions e in Γ A∪C is Q, • Conf (A ⇒ C) can be seen as the evidence e that the percentage of transactions in Γ A e is Q. that are also in Γ C The quantifier is a linguistic parameter that determines the final semantics of the measures.
42
Support and confidence of fuzzy association rules The choice of the quantifier allows to change the semantics of the values. Any coherent quantifier yields support and confidence measures that verify the four aforementioned properties. We have used the quantifier QM (x) = x, since it is coherent and reproduces the ordinary measures in the crisp case. From now on we shall consider support and confidence based on QM and GD. The study of the support/confidence framework with other quantifiers is to future research. 43
Support and confidence of fuzzy association rules Let I = {i1, i2, i3, i4} be a set of items.
τ˜1 τ˜2 τ˜3 τ˜4 τ˜5 τ˜6
i1 0 0 1 1 0.5 1
i2 0.6 1 0.5 0 1 0
i3 0.7 0 0.75 0.1 0 0.75
i4 0.9 1 1 1 1 1
44
Support and confidence of fuzzy association rules
Itemset {i1} {i4} {i2, i3} {i1, i3, i4}
Rule {i2} ⇒ {i3} {i1, i3} ⇒ {i4} {i1, i4} ⇒ {i3}
Support 0.583 0.983 0.183 0.266
Support 0.183 0.266 0.266
Confidence 0.283 1 0.441
e e Conf ({i1, i3} ⇒ {i4}) = GDQM (Γ {i4 } /Γ{i1 ,i3 }) = 1 e e since Γ {i1 ,i3 } ⊆ Γ{i4 }. 45
Certainty factors The certainty factors and the very strong rules was introduced in the setting of crisp Data Mining to avoid some drawbacks of support and confidence measures. Now will present the extension of these ideas to the fuzzy case.
46
Certainty factors Definition. The certainty factor of A ⇒ C is the value: • Conf (A ⇒ C) > supp(C) then Conf (A ⇒ C) − supp(C) CF (A ⇒ C) = 1 − supp(C) • Conf (A ⇒ C) ≤ supp(C) then CF (A ⇒ C) =
Conf (A ⇒ C) − supp(C) supp(C)
• supp(C) = 1 then CF (A ⇒ C) = 1 • supp(C) = 0 then CF (A ⇒ C) = −1. This definition directly extend the crisp one. The only difference is how to assess the used confidence and support values. 47
Certainty factors Definition. • A ⇒ C is strong when its certainty factor and support are greater than two user-defined thresholds minCF and minsupp respectively. • A ⇒ C is said very strong if both A ⇒ C and ¬C ⇒ ¬A are strong. The itemsets ¬A and ¬C, means “absence of A” (resp. C) : e e (˜ e e (˜ Γ τ) = 1 − Γ τ) = 1 − Γ ¬A (˜ A τ ) and Γ¬C (˜ C τ ).
The rules A ⇒ C and ¬C ⇒ ¬A represent ”the same knowledge”. If both rules are strong we can be more certain about the presence of that knowledge in a set of transactions. 48
Certainty factors Several experiments have shown that by using certainty factors and very strong rules we avoid to report a large amount of false, or at least doubtful, rules. In some experiments, the number of rules was diminished by a factor of 20 and even more. We propose to use certainty factors to measure the accuracy of a fuzzy association rule. Anyway let us point out that the assessment of the quality of (fuzzy) Association Rules is an open problem that is still receiving considerable attention.
49
Algorithms Most of the existing association rule mining algorithms can be adapted in order to discover fuzzy association rules. Roughly speaking: • In the step P1 of the general algorithm the support and confidence are obtained from the corresponding quantified sentences. • In the step P.2. we obtain the certainty factor of the rules from the confidence and the support of the consequent. Let us note that it is easy to decide whether a rule is strong, because its support and certainty factor are available. The different algorithms may be adapted to find strong rules. 50
Instances of Fuzzy Association Rules Fuzzy “Item” and “transaction” are abstract concepts that are usually associated to “an object” and “a subset of objects”. Hence, to deal with ”vague” patterns in different environments is possible by particularizing them. Now we shall describe briefly several instances of this ”simple” idea.
51
Fuzzy Association Rules in Relational Databases We have used the following model in several applications. X
X
Let Lab(Xj ) = {L1 j , . . . , Lcj j } be a set of linguistic labels for attribute Xj . X
Let L =
S
Lk j : Dom(Xj ) → [0, 1] j∈{1,...,m} Lab(Xj ).
The set of ”items” associated to RE is now: ( RE = IL
Xj hXj , Lk i
)
| Xj ∈ RE and
k ∈ {1, . . . , cj } j ∈ {1, . . . , m}
Every instance r of RE is associated to a FTRE . set, denoted TLr , with items in IL 52
Fuzzy Association Rules in Relational Databases Each tuple t ∈ r is associated to a unique fuzzy transaction τ˜Lt ∈ TLr RE → [0, 1] τ˜Lt : IL
such that τ˜Lt
µ
¶ Xj hXj , Lk i
=
´ Xj ³ Lk t[Xj ]
53
Remembering By using a suitable definition of items and transactions, fuzzy association rules can be employed to define and to mine other kind of structures. Some of them are related to the concept of functional dependence.
54
Fuzzy and Approximate Functional Dependencies Let RE be a set of attributes and r an instance of RE. A functional dependence X → Y , X, Y ⊂ RE, holds in r if the value t[X] determines t[Y ] for every tuple t ∈ r.
∀t, s ∈ r if t[X] = s[X] then t[Y ] = s[Y ] (3)
55
Fuzzy and Approximate Functional Dependencies To find perfect dependencies is difficult, mainly because of the usual existence of exceptions. Two main groups of smoothed dependencies have been proposed:
• fuzzy functional dependencies: to introduce some fuzzy components (e.g. the equality can be replaced by a similarity relation
• approximate dependencies: to establish the functional dependencies with exceptions (i.e., with some uncertainty)
56
Fuzzy and Approximate Functional Dependencies To represent approximate dependencies transactions are associated to pairs of tuples and and items to attributes respectively. The item associated to the attribute X, IX , is in the transaction τts associated to the pair of tuples ht, si when t[X] = s[X]. The set of transactions associated to an instance r of RE is denoted Tr , and contains |r|2 transactions. We define an approximate dependence in r to be an association rule in Tr The support and certainty factor of an association rule IX ⇒ IW in Tr measure the importance and accuracy of the approximate dependence X → W . 57
Fuzzy and Approximate Functional Dependencies RE = {ID, Y ear, Course, Lastname} ID 1 2 3
Year 1991 1991 1991
Course 3 4 4
Lastname Smith Smith Smith
58
Fuzzy and Approximate Functional Dependencies The set Tr of transactions for r: Pair h1, 1i h1, 2i h1, 3i h2, 1i h2, 2i h2, 3i h3, 1i h3, 2i h3, 3i
itID 1 0 0 0 1 0 0 0 1
itY ear 1 1 1 1 1 1 1 1 1
itCourse 1 0 0 0 1 1 0 1 1
itLastname 1 1 1 1 1 1 1 1 1
59
Fuzzy and Approximate Functional Dependencies Some association rules that hold in Tr i.e. approximate dependencies that hold in r. Ass.Rule {itID } ⇒ {itY } {itY } ⇒ {itC } {itY , itC } ⇒ {itID }
Conf. 1 5/9 3/5
Supp. 1/3 5/9 1/3
Appx. dep. ID → Y Y →C Y, C → ID
60
Fuzzy and Approximate Functional Dependencies When there are quantitative attributes, the former developments may be extended by using a set of linguistic labels Lab(Xj ) for each. This induces a fuzzy similarity relation SLab(Xj ) in the domain of X from which the fuzzy transactions are defined. Most of the existing definitions of ”relaxed” functional dependences can be characterized by replacing the equality and the universal quantifier in the definition by a similarity relation S and a fuzzy quantifier Q respectively. We have also faced a more general problem what we have called fuzzy quantified dependencies (i.e. fuzzy functional dependencies with exceptions). 61
Gradual Rules Gradual rules are expressions of the form Y ”, “The more X is LX , the more Y is L i j
“The more Age is Y oung, the more Height is T all”. The semantics of that rules are “the greater the membership degree of the value of X in LX i , the greater the membership degree of the value of Y in LY j ”. There are several formal specifications of this idea.
62
Gradual Rules
∀t ∈ r
Y LX i (t[X]) ≤ Lj (t[Y ])
The items are pairs hAttribute, Labeli and the transactions are associated to tuples. The item hX, LX i i is in the transaction τt associated to Y (t[Y ]), and the the tuple t when LX (t[X]) ≤ L i j r set of transactions, denoted TG (L), is a T-set. 1
r Ordinary association rules in TG are gradual 1 (L) rules in r.
A more general expression is ∀t ∈ r
Y (t[Y ]) LX (t[X]) → L ∗ i j
where →∗ is a fuzzy implication. Inequality before is a particular case for RescherGaines implication. 63
Gradual Rules Another possibility of interpreting the semantics of gradual rules is is, ∀t, s ∈ r X if LX i (t[X]) ≤ Li (s[X]) then Y LY j (t[Y ]) ≤ Lj (s[Y ])
Items keep being pairs hAttribute, Labeli but transactions are associated to pairs of tuples. This alternative can be extended with fuzzy implications like we have made before.
64
Fuzzy association rules in text mining Searching the web is not always so successful as users expect. Most of the retrieved sets of documents in a web search meet the search criteria but do not satisfy the user needs. This is due generally to a lack of specificity in the formulation of the queries: the user does not know the vocabulary of the topic or query terms do not come to user’s mind at the query moment.
65
Fuzzy association rules in text mining One possible solution to this problem is the process known as query expansion or query reformulation: After the query process is performed, new terms are added and/or removed to the query in order to improve the results:
• to discard uninteresting retrieved documents
• and/or to retrieve interesting documents that were not retrieved by the query.
66
Fuzzy association rules in text mining Our proposal is to use mining technologies to build systems with queries reformulation ability. The very idea is to obtain association rules that suggest new terms that could be added to the query. In the following we summarize the main results related to our approach to use fuzzy association rules for query optimization.
67
Fuzzy association rules in text mining Text Items We have considered term-level items. By using automatic indexing techniques coming from Information Retrieval each document is represented by a set of terms with a weight meaning the presence of the term in the document.
68
Fuzzy association rules in text mining Text Items Several weighting schemes for this purpose can be found in the literature. In our work we consider three of them: Boolean scheme: The weights values are in 0 or 1 i.e. absence or presence of the word in the document. Frequency scheme: the weight measures the relative frequency of the term in the document. TFIDF scheme: A high weight is associated to any term that occurs frequently in a document but infrequently in the collection.
69
Fuzzy association rules in text mining Text Transactions We define a text transaction τi ∈ T as the extended representation of document d i. TD = {d1, . . . , dn} stands for the set of transactions associated to D. When a boolean weighting scheme is used (i.e. the weights in W take the values {0,1}), the transactions can be called boolean or crisp. When we consider a normalized weighting scheme in the unit interval we may speak about fuzzy transactions.
70
Document representation building process Input: a set of documents D = {d1, . . . dn}. Output: a representation for all documents in D. 1. Let D = {d1, . . . dn} be a collection of documents 2. Extract an initial set of terms S from each document di ∈ D 3. Remove stop words 4. Apply stemming (via Porter’s algorithm) 5. The representation of d i obtained is a set of keywords S = {t1, . . . , tm} with their associated weights {wi1, . . . , wim} 71
Fuzzy Association Rules for Query Refinement Start with a set of documents obtained from an initial query. From the initial retrieved set of documents, called local set, association rules are found and additional terms are suggested to the user in order to refine the query.
72
Extraction process of Fuzzy association rules from text transactions Input: a set of transactions TD = {d1, . . . dn} Output: a set of association rules.
1. Construct the itemsets from the set of transactions T .
2. Establish the thresholds minsupp and minconf
3. Find all the itemsets with a support above minsupp,
4. Generate the rules, discarding those rules below threshold minconf 73
Fuzzy association rules in text mining The whole process is summarized in: 1. The user queries the system. 2. A first set of documents is retrieved. 3. From this set, the representation of documents is extracted and association rules are generated. 4. Terms that appear in certain rules are shown to the user. 5. The user selects those terms more related to her/his needs. 6. The selected terms are added to the query, which is used to query the system again. 74
The Selection of Rules for Query Refinement Once the (strong) association rules are extracted, the selection of useful rules for query refinement depends on the appearance in antecedent and/or consequent of query terms. Let us suppose that qterm is a term that appears in the query and let term ∈ S, S0 ⊆ S. • Rules term ⇒ qterm. We suggest the term term to the user as a way to restrict the set of results. • Rules S0 ⇒ qterm with S0 ⊆ S. We the set of terms S0 to the user as a whole, i.e., to add S0 to the query. • Rules qterm ⇒ term with term ∈ S. suggest to replace qterm with term.
We
75
Fuzzy association rules in text mining The utility of the rules can be improved if a previous categorization of the documents is available, and items meaning that the document is in a given category are employed in the document representation. Rules containing category labels can give us new information about the category itself. For instance, if a rule of the form term → category appears with enough accuracy, we can assert that documents where that term appears can be classified in that category. As future work, we will implement the application of the model to this query reformulation procedure and compare the results with other approaches to query refinement coming from Information Retrieval. 76
Further remarks. Future researches Mining association rules in fuzzy transactions is useful to find patterns when data are fuzzy in nature or because we must improve their semantics. The proposed models has been tested mainly to discover fuzzy association rules in quantitative relational databases.
77
Further remarks. Future researches We have introduced a very general model for fuzzy transactions in which the items are considered as being crisp. We have shown that our model is easy to formulate and use. We have studied how our model can be employed in mining distinct types of patterns:
• ordinary fuzzy association rules,
• fuzzy and approximate functional dependencies,
• gradual rules, 78
Further remarks. Future researches The model will be used in multimedia data mining and web mining. We have paid an special attention to the problem of text mining. Technical issues will be studied in the future as well.
79
Further remarks. Future researches Knowledge Discovery in Databases (KDD) is undoubtedly recognized as a key ”technology” in business and industry. Fuzzy Sets and Fuzzy Logic are also considered as a need to represent the inherent non random uncertainty which lies in most part of information and decision processes. However the papers about disclosing fuzzy patterns within this field scarce. We have started a multidisciplinary research project to disclose and use Fuzzy Association Rules from financial data.
80
Further remarks. Future researches Mining (crisp) association rules has been shown to be interesting in the field of medical and clinic data. Also in proteomic and genomic to detect biological patterns in the protein composition and the conditioning from a particular genome in the presence of some disease. We plan to study the use of our model to cope with these problems because we consider that fuzzy tools will be very appropriate to model these associations as vagueness is present everywhere in this field.
81