Mrs. Mamta Kathuria. (Assistant Professor). Department of Computer Engineering,. YMCA University of Sc. & Tech.,. Faridabad. Dr. A. K. Sharma. (Professor ...
A DISSERTATION On
Fuzzy Clustering of Web Documents Using Equivalence Relations and Fuzzy Hierarchical Clustering Submitted in partial fulfillment of the requirements for the award of degree of
MASTER OF TECHNOLOGY In
COMPUTER ENGINEERING Submitted By SATENDRA KUMAR (Roll No. MCE-81-2K10) Under the Guidance of Mrs. Mamta Kathuria
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING YMCA UNIVERSITY OF SCIENCE & TECHNOLOGY FARIDABAD – 121006 JUNE-2012 1
CANDIDATE’S DECLARATION I hereby certify that the work which is being presented in this thesis titled “Fuzzy Clustering of Web Documents Using Equivalence Relations and Fuzzy Hierarchical Clustering”, in fulfillment of the requirement for the degree of Master of Technology in Computer Engineering and submitted to “YMCA University of Science & Technology”, Faridabad, is an authentic record of my own work carried out under the supervision of Mrs. Mamta Kathuria.
The work presented in this dissertation has not been submitted by me for the award of any other degree of this Institute or any other University.
SATENDRA KUMAR (MCE-81-2K10)
2
CERTIFICATE This is to certify that the thesis titled “Fuzzy Clustering of Web Documents Using Equivalence Relations and Fuzzy Hierarchical Clustering” submitted by Satendra Kumar to “YMCA University of Science and Technology, Faridabad”, for the award of degree of Master of Technology in Computer Engineering is an authentic of bonafide work carried out by him under the my supervision. In my Opinion, the dissertation has reached the standards of fulfillment of the requirements of the regulation of the degree. The work contained in this dissertation has not been submitted to any other University or Institute for the award of any other degree or diploma. Mrs. Mamta Kathuria (Assistant Professor) Department of Computer Engineering, YMCA University of Sc. & Tech., Faridabad
Dr. A. K. Sharma (Professor & Chairmen) Department of Computer Engineering, YMCA University of Sc. & Tech., Faridabad The M.Tech viva voce examination of Satendra Kumar, Research Scholar, has been held on……
Signature of Head of Department
Signature of External Examiner
3
ACKNOWLEDGEMENT It is with deep sense of gratitude and reverence that I express my sincere thanks to my supervisor Mrs. Mamta Kathuria for her guidance, encouragement, help and useful suggestions throughout. Her untiring and painstaking efforts, methodical approach and individual help made it possible for me to complete this work in time. I consider myself very fortunate for having been associated with the scholar like them. Her affection, guidance and scientific approach served a veritable incentive for completion of this work.
I am thankful to the staff members of Department of Computer Engineering at YMCA University of Science and Technology, for their valuable suggestions. Although it is not possible to name individual, I cannot forget my well-wishers for their persistent support and cooperation. Although it is not possible to name individually, I cannot forget my well-wishers at YMCA University of Science and Technology, Faridabad and outsiders for their persistent support and cooperation.
This acknowledgement will remain incomplete if I fail to express my deep sense of obligation to my parents and God for their consistent blessings and encouragement.
Satendra Kumar
4
ABSTRACT The conventional clustering algorithms have difficulties in handling the challenges posed by the collection of natural data which is often vague and uncertain. Fuzzy clustering methods have the potential to manage such situations efficiently. Clustering can be considered the most important unsupervised learning problem; it deals with finding a structure in a collection of unlabeled data. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to others. However such a partition is insufficient to represent many real situations. Therefore a fuzzy clustering method is offered to construct clusters with uncertain boundaries and allows that one object belongs to one or more clusters with some membership degree. In other words, the essence of fuzzy clustering is to consider not only the belonging status to the clusters, but also to consider to what degree do the object belong to the cluster. In this paper, an algorithm is proposed for fuzzy clustering of web documents using equivalence relations and fuzzy hierarchical clustering.
5
TABLE OF CONTENTS
CANDIDATE‟S DECLARATION
i
CERTIFICATE
ii
ACKNOWLEDGEMENT
iii
ABSTRACT
iv
TABLE OF CONTENTS
v
LIST OF FIGURES
vii
Chapter
Page No.
1. INTRODUCTION 1.1 Clustering 1.2 Fuzzy Clustering 1.3 Web Mining 1.3.1 Web Mining Process 1.3.2 Web Mining Taxonomy 1.3.2.1 Web Content Mining 1.3.2.2 Web Structure Mining 1.3.2.3 Web Usage Mining 1.4 Fuzzy Logic
1 1 1 2 3 3 4 4 5 5
2. SOFT COMPUTING 2.1 Introduction 2.2 Fuzzy Computing 2.2.1 Fuzzy Sets 2.2.1.1 Basic properties of Fuzzy Sets 2.2.1.2 Basic Operations on Fuzzy Sets 2.2.1.3 Fuzzy Arithmetic 2.2.2 Fuzzy Relations 2.2.2.1 Properties of Fuzzy Relations 2.2.2.2 Representation of Fuzzy Relations 2.2.2.3 Operations on Fuzzy Relations 2.2.3 Fuzzy Logic 2.2.4 Fuzzy Control
6 6 7 7 9 11 12 13 13 14 14 15 19
3. WEB MINING
20
3.1 Introduction
20 6
3.2 Categories of Web Mining
21
3.2.1 Content and Structure Mining
22
3.2.2 Usage mining
23
3.3 Advantages of Web Mining
23
3.3.1 Content and structure mining advantages
24
3.3.2 Usage mining advantages
25
3.4 Values Threatened by Web Mining
26
3.4.1 Privacy
26
3.4.2 Individuality
27
4. RELATED WORK
30
5. PROPOSED WORK 5.1 Proposed Algorithm
33 33
6. EXPERIMENTAL RESULTS 6.1 Experimental Result for Euclidean Distance 6.2 Experimental Result for Hamming Distance
35 35 39
7. CONCLUSION
42
REFERENCES
43
APPENDIX
50
7
LIST OF FIGURES FIGURE NO.
NAME OF FIGURE
PAGE NO.
1.1
Web mining process
3
1.2
Web mining taxonomy
4
2.1
Example of Crisp Sets
8
2.2
Fuzzy Set Representation
9
2.3
Graphical Representation of the Analytical Representation
10
2.4
Example fuzzy addition and subtraction
12
5.1
Flow Chart of proposed algorithm
34
6.1
Input of files and keyword with doc id
35
6.2
Keyword with keyword ID
36
6.3
Document clustering data
36
6.4
Fuzzy relation matrix
37
6.5
Transitive closure matrix
37
6.6
α-cuts of transitive closure matrix
38
6.7
Dendogram representation
38
6.8
Fuzzy compatibility matrix
39
6.9
Transitive closure matrix
39
6.10
α-cuts of transitive closure matrix
40
6.11
Dendogram representation
40
8
Chapter 1 INTRODUCTION 1.1 CLUSTERING Clustering can be considered the most important unsupervised learning problem; it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be the process of organizing objects into groups whose members are similar in some way. A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. Clustering algorithms can have different properties:
Hierarchical or flat: Hierarchical algorithms induce a hierarchy of clusters of decreasing
generality, for flat algorithms, all clusters are the same.
Iterative: The algorithm starts with initial set of clusters and improves them by reassigning
instances to clusters.
Hard and soft: Hard clustering assigns each instance to exactly one cluster. Soft clustering
assigns each instance a probability of belonging to a cluster.
Disjunctive: Instances can be part of more than one cluster.
Clustering involves two forms of object cluster which are overlap and non-overlap cluster. The cluster is said as overlap when the object can belong to other cluster and non-overlap is the object belongs to one and only cluster.
1.2 FUZZY CLUSTERING In fuzzy clustering, data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the association between that data element and a particular cluster. Fuzzy clustering is a process of assigning these membership levels, and then using them to assign data elements to one or more clusters. There are few types of fuzzy clustering, such as Fuzzy c-varieties (FCV) algorithm, adaptive fuzzy clustering (AFC) algorithm, Fuzzy C-mean (FCM) algorithm, Gustafson-Kessel (GK) algorithm and Gath-Geva (GG) algorithm. Borgelt has used FCM in the high number of dimensions and the special distribution characteristics of the data. The selection of Fuzzy methods is to find fuzzy clusters of ellipsoidal
9
shape and differing size since the data is too complex. As the studies from Borgelt, it is said that FCM succeeds in cluster the document and yields the best result in classification accuracy. In IR field, cluster analysis has been used to create groups of documents with the goal of improving the efficiency and effectiveness of retrieval, or to determine the structure of the literature of a field. The terms in a document collection can also be clustered to show their relationships. The two main types of cluster analysis methods are the non-hierarchical, which divide a data set of N items into M clusters and hierarchical clustering produces a nested data set in which pairs of items or clusters are successively linked. The non-hierarchical methods such as the single pass and reallocation methods are heuristic in nature and require less computation than the hierarchical methods. However, the hierarchical methods have usually been favored for cluster-based document retrieval. The commonly used hierarchical methods, such as single link, complete link, group average link, and Ward‟s method have high space and time requirements.
1.3 WEB MINING Over the last decade there is tremendous growth of information on World Wide Web (WWW).It has become a major source of information. Web creates the new challenges of information retrieval as the amount of information on the web and number of users using web growing rapidly. It is practically impossible to search through this extremely large database for the information needed by user. Hence the need for Search Engine arises. Search Engines uses crawlers to gather information and stores it in database maintained at search engine side. For a given user's query the search engine searches in the local database and very quickly displays the results. The ability to form meaningful groups of objects is one of the most fundamental modes of intelligence. Human perform this task with remarkable ease. Cluster analysis is a tool for exploring the structure of data. The core of cluster analysis is clustering; the process of grouping objects into clusters such that the objects from the same cluster are similar and objects from different cluster are dissimilar. The need to structure and learn vigorously growing amount of data has been a driving force for making clustering a highly active research area. Web Mining is the use of Data Mining techniques to automatically discover and extract information from web. Clustering is one of the possible techniques to improve the efficiency in information finding process. It is a Data Mining tool to use for grouping objects into clusters such that the objects from the same cluster are similar and objects from different cluster are dissimilar. 10
Web Mining has fuzzy characteristics, so fuzzy clustering is sometimes better suitable for Web Mining in comparison with conventional clustering. Fuzzy clustering is a relevant technique for information retrieval. As a document might be relevant to multiple queries, this document should be given in the corresponding response sets, otherwise, the users would not be aware of it. Fuzzy clustering seems a natural technique for document categorization. There are two basic methods of fuzzy clustering, one which is based on fuzzy c-partitions, is called a fuzzy c-means clustering method and the other, based on the fuzzy equivalence relations, is called a fuzzy equivalence clustering method. The purpose of this research is to propose a search methodology that consists of how to find relevant information from WWW. In this paper, a method is being proposed of document clustering, which is based on fuzzy equivalence relation that helps information retrieval in the terms of time and relevant information [8]. World Wide Web is a major source of information and it creates new challenges of information retrieval as the amount of information on the web increasing exponentially. Web Mining is use of Data Mining techniques to automatically discover and extract information from web documents and services [6].
1.3.1 Web Mining Process Web mining may be decomposed into the following subtasks:
Resource Discovery: process of retrieving the web resources.
Information Pre-processing: is the transform process of the result of resource discovery
Information Extraction: automatically extracting specific information from newly discovered
Web resources.
Generalization: uncovering general patterns at individual Web sites and across multiple sites [3].
Fig.1.1 Web mining process 1.3.2 Web Mining Taxonomy Web has different facets that yield different approaches for the mining process: 11
Web pages consist of text.
Web pages are linked via hyperlinks
User activity can be monitored via Web server logs.
This three facets leads to the distinction into three categories i.e. Web content mining, Web structure mining and Web usage mining [4][5][9][10]. Following Fig 1.2 shows the Web Mining Taxonomy.
Web Mining
Web Content Mining
Web Structure Mining
Web Usage Mining
Fig.1.2 Web mining taxonomy
1.3.2.1 Web Content Mining (WCM) Web content mining is the process of extracting useful information from the contents of web documents. Content data is the collection of facts a web page is designed to contain. It may consist of text, images, audio, video, or structured records such as lists and tables. Application of text mining to web content has been the most widely researched.
1.3.2.2 Web Structure Mining (WSM) The structure of a typical web graph consists of web pages as nodes, and hyperlinks as edges connecting related pages. Web structure mining is the process of discovering structure information from the web.
12
1.3.2.3 Web Usage Mining (WUM) Web usage mining is the application of data mining techniques to discover interesting usage patterns from web usage data. Web usage data includes data from web server logs, browser logs, user profiles, registration data, cookies etc. WCM and WSM uses real or primary data on the web whereas WUM mines the secondary data derived from the interaction of the users while interacting with the web.
1.4 FUZZY LOGIC The modeling of imprecise and qualitative knowledge, as well as handling of uncertainty at various stages is possible through the use of fuzzy sets. Fuzzy logic is capable of supporting, to a reasonable extent, human type reasoning in natural form by allowing partial membership for data items in fuzzy subsets [20]. Integration of fuzzy logic with data mining techniques has become one of the key constituents of soft computing in handling the challenges posed by the massive collection of natural data [19].Fuzzy logic is logic of fuzzy sets. A Fuzzy set has, potentially, an infinite range of truth values between one and zero [21]. Propositions in fuzzy logic have a degree of truth, and membership in fuzzy sets can be fully inclusive, fully exclusive, or some degree in Between [23].The fuzzy set is distinct from a crisp set is that it allows the elements to have a degree of membership. The core of a fuzzy set is its membership function: a function which defines the relationship between a value in the sets domain and its degree of membership in the fuzzy set (exp. 2). The relationship is functional because it returns a single degree of membership for any value in the domain [22]. μ=f(s,x)……….. (2) Here, μ: is he fuzzy membership value for the element s: is the fuzzy set x : is the value from the underlying domain. Fuzzy sets provide a means of defining a series of overlapping concepts for a model variable since it represent degrees of membership. The values from the complete universe of discourse for a variable can have memberships in more than one fuzzy set.
13
Chapter 2 SOFT COMPUTING
2.1 INTRODUCTION Soft Computing is a new multidisciplinary field that was proposed by Dr. Lotfi Zadeh, whose goal was to construct new generation Artificial Intelligence, known as Computational Intelligence. The idea of Soft Computing was initiated when Dr. Zadeh published his first paper on soft data analysis [24]. Since then, the concept of Soft Computing has evolved. Dr. Zadeh defined Soft Computing in its latest incarnation as the fusion of the fields of Fuzzy Logic, Neuro-computing, Evolutionary and Genetic Computing, and Probabilistic Computing into one multidisciplinary system. The main goal of Soft Computing is to develop intelligent machines and to solve nonlinear and mathematically unmodeled system problems [25-27]. The applications of Soft Computing have proved two main advantages. First, it made solving nonlinear problems, in which mathematical models are not available, possible. Second, it introduced the human knowledge such as cognition, recognition, understanding, learning, and others into the fields of computing. This resulted in the possibility of constructing intelligent systems such as autonomous self-tuning systems, and automated designed systems. Soft Computing is a new science and the fields that comprise Soft Computing are also rather new. Though, a tendency toward the expansion of Soft Computing beyond what Dr. Zadeh initiated has been rapidly progressing. For example, Soft Computing has been given a broader definition in the literature to include Fuzzy Sets, Rough Sets, Neural Networks, Evolutionary Computing, Probabilistic and Evidential Reasoning, Multi-valued Logic, and related fields [28]. Other scientists [29] proposed the notion of Extended Soft Computing (ESC) as a new discipline developed by adding Chaos Computing and Immune Network Theory to the classical Soft Computing, as defined and proposed by Lotfi Zadeh. ESC was proposed for explaining complex systems and cognitive and reactive AIs. Moreover, Fuzzy Logic, which is the basis on which Soft Computing is built, has been expanded into what is known today as Type-2 Fuzzy Logic [30]. Now on the rise is the new science of Bios and Biotic Systems. The author of this dissertation proposes, and expects, the inclusion of Bios Computing to become one of the pillars of Soft Computing. The author also proposes the 14
replacement of Soft Probability for the traditional probability computing techniques to process soft systems‟ computations. From the above presentation of the subject, it is obvious that Soft Computing is still growing and developing. Hence, a clear definite agreement on what comprises Soft Computing has not yet been reached. Different views of what it should include have been proposed and more new sciences are still merging into Soft Computing. Therefore, only the four main components of Soft Computing as proposed by the founder, Dr. Zadeh, are considered in this introductory chapter. Furthermore, only Type-1 Fuzzy Logic, the original Fuzzy Logic, is presented. 2.2 FUZZY COMPUTING 2.2.1 Fuzzy Sets Fuzzy Logic is built on The Fuzzy Set Theory, which was introduced to the world, for the first time, by Lotfi Zadeh [31]. The invention, or proposition, of Fuzzy Sets was motivated by the need to capture and represent the real world with its fuzzy data due to uncertainty. Uncertainty can be caused by imprecision in measurement due to imprecision of tools or other factors. Uncertainty can also be caused by vagueness in the language. We use linguistic variables often to describe, and maybe classify, physical objects and situations. Lotfi Zadeh realized that the Crisp Set Theory is not capable of representing those descriptions and classifications in many cases. In fact, Crisp Sets do not provide adequate representation for most cases. A very classical example is what is known as the Heap Paradox [32]. If we remove one element from a heap of grains or sand, we will still have a heap. However, if we keep removing single elements, one at a time, there will be a time when we do not have a heap anymore. At what time does the heap turn into a countable collection of grains that do not form a heap? There is no one correct answer to this question. This example represents a situation where vagueness and uncertainty are inevitable. Throughout the history, until the end of the nineteenth century, uncertainty, whether due to vagueness or imprecision, were always considered undesirable in science and philosophy. Hence, the way to deal with uncertainty was to either ignore it or assume its non-existence, or to try to eliminate it. Though, obviously any investigation that involves a concept such as the heap will have to deal 15
with vagueness. Moreover, the ever existing imprecision due to the physical limitations of measurement tools would disqualify any investigation that assumes zero uncertainty. Uncertainty in the macroscopic world is always viewed as lack of knowledge. Nevertheless, both excess knowledge and uncertainty lead to increased complexity. Let us take for example the computational model of a car driver. Using a manual transmission rather than an automatic one requires more knowledge to drive, which increases the complexity involved. However, the uncertainty caused by not knowing the road or some bad weather also increases the complexity of the computational model of the driver. Uncertainty and complexity result in the failure of Crisp Sets to represent many concepts, notions, and situations. Crisp Sets can be ideal for certain applications. For example, we can use crisp sets to represent the classification of coins. We can list US coins and put a boundary on the set that encloses them so that other coins like French francs or English pounds are definitely out of the set and the US coins are definitely in the set, as illustrated in Fig.2.1.
Fig.2.1 Example of Crisp Sets
However, we cannot always define precise boundaries to describe sets in real life. In fact we often cannot do that. For instance, when we try to classify students in a school into tall students that qualify for a basketball team and short students who do not, if we consider students who are six feet 16
and four inches tall to be qualified, should we then exclude a student who is one-tenth of an inch less than the specified height? Should we even exclude a student who is a whole inch shorter? Instead of avoiding or ignoring uncertainty, Lotfi Zadeh developed a set theory that captures this uncertainty. The goal was to develop a set theory and a resulting logic system that are capable of coping with the real world. Therefore, rather than defining Crisp Sets, where elements are either in or out of the set with absolute certainty, Zadeh proposed the concept of a Membership Function. An element can be in the set with a degree of membership and out of the set with a degree of membership. Hence, Crisp Sets are a special case, or a subset, of Fuzzy Sets, where elements are allowed a membership degree of 100% or 0% but nothing in between. Figure (2.2) illustrates the use of Fuzzy Sets to represent the notion of a tall person. It also shows how we can differentiate between the notions of tall and very tall, resulting in a more accurate model than the classical set theory.
Fig.2.2 Fuzzy Set Representation 2.2.1.1 Basic properties of Fuzzy Sets Fuzzy Sets are characterized by Membership Functions. The membership function assigns to each element x in a fuzzy set a number, A(x), in the closed unit interval [0, 1]. The number, A(x), represents the degree of membership of x in A. Hence, membership functions are functions of the form: Α: Χ → [0, 1] 17
In the case of Crisp Sets, the members of a set are either out of the set, with membership degree of zero, or in the set, with the value one being the degree of membership. Therefore, Crisp Sets ⊆ Fuzzy sets or in other words, Crisp Sets are Special cases of Fuzzy Sets. There are four ways of representing fuzzy membership functions, namely, graphical representation, tabular and list representation, geometric representation, and analytic representation. Graphical representation is the most common in the literature. Fig.2.2 above is an example of the graphical representation of fuzzy membership functions. Tabular and list representations are used for finite sets. In this type of representation, each element of the set is paired with its degree of membership. Two different notations have been used in the literature for tabular and list representation. The following example illustrates the two notations for the same membership function. A = {, , , } 1
2
3
A = 0.8/x + 0.3/x + 0.5/x + 0.9/x 1
2
3
4
4
The third method of representation is the geometric representation and is also used for representing finite sets. For a set that contains n elements, n-dimensional Euclidean space is formed and each element may be represented as a coordinate in that space. Finally analytical representation is another alternative to graphical representation in representing infinite sets, e.g., a set of real numbers. The following example illustrates both graphical and analytical representation of the same fuzzy function:
Fig.2.3 Graphical Representation of the Analytical Representation 18
The above example also illustrates the important notion of a fuzzy number. A Fuzzy Number is a fuzzy set represented by a membership function of the form A: → [0, 1] With the additional restriction that the membership function must capture an intuitive conception of a set of real numbers surrounding a given central real number, or interval of real numbers. In this context, the example above illustrates the concept of the fuzzy number “about six”, “around six”, or “approximately six”. Another very important property of fuzzy sets is the concept of α−cut (alpha cut). ∝-cuts reduce a fuzzy set into an extracted crisp set the value α represents a membership degree, i.e., α ∈ [0, 1]. The α-cut of a fuzzy set A is the crisp set (A – α), i.e., the set of all elements whose membership degrees in A are ≥ α [33]. 2.2.1.2 Basic Operations on Fuzzy Sets The basic operations on fuzzy sets are Fuzzy Complement, Fuzzy Union, and Fuzzy Intersection. These operations are defined as follows: Fuzzy Complement: Ā(x) =1-A(x) Fuzzy Union: (A∪B) (x) =max [A(x), B(x)] for all x∈ X Example: A(x) =0.6 and B(x) =0.4 ∴ (A∪B) (x) =max [0.6, 0.4] =0.6 Notice that AUĀ≠X, which violates the Law of Excluded Middle. Fuzzy Intersection:
(A∩B)(x)=min [A(x), B(x)] for all x∈ X Example: A(x) =0.6 and B(x) =0.4
∴ (A∩B) (x) =max [0.6, 0.4] =0.4 19
Also notice that A∩Ā≠ φ, which violates the Law of Contradiction. The above operations are the Standard operations on Fuzzy Sets. Different definitions have been developed for the fuzzy union, intersection, and complement but a detailed listing or study of the different definitions is beyond the scope of this dissertation. 2.2.1.3 Fuzzy Arithmetic Fuzzy Arithmetic uses arithmetic on closed intervals. The basic fuzzy arithmetic operations are defined as follows: Addition: [α, b] + [c, d] = [α+ c, b + d] Subtraction: [α, b] - [c, d] = [α - d, b - c] Multiplication: [α, b]. [c, d] = [min (αc, αd, bc, bd), max (αc, αd, bc, bd)] Division: [α, b] / [c, d] = [α, b]. [1/d, 1/c] = [min (α/c, α/d, b/c, b/d), max (α/c, α/d, b/c, b/d)] Fig.2.4 illustrates graphical representations of fuzzy addition and subtraction.
Fig.2.4 Example fuzzy addition and subtraction 20
2.2.2 Fuzzy Relations
2.2.2.1 Properties of Fuzzy Relations Fuzzy Relations were introduced to supersede classical crisp relations. Rather than just describing the full presence or full absence of association of elements of various sets in the case of crisp relations, Fuzzy Relations describe the degree of such association. This gives fuzzy relations the capability to capture the uncertainty and vagueness in relations between sets and elements of a set. Furthermore, it enables fuzzy relations to capture the broader concepts expressed in fuzzy linguistic terms when describing the relation between two or more sets. For example, when classical sets are used to describe the equality relation, it can only describe the concept “x is equal to y” with absolute certainty, i.e., if x is equal to y with unlimited precision, then x is related to y, otherwise x is not related to y, even if it was slightly different. Thus, it is not possible to describe the concept “x is approximately equal to y”. Fuzzy Relations make the description of such a concept possible. Table (2.1) provides comparison of the special properties of Crisp and Fuzzy relations (E is the Equality x
Relation and O is the Empty Relation) [34]. x
Table.2.1 Properties of Crisp vs. Fuzzy Relations
It is important to note here that the concept of local reflexivity was introduced for the first time in Crisp Relational Theory by Bandler and Kohout. The fast fuzzy relational algorithms that employ local reflexivity in fuzzy computing were introduced also by Bandler and Kohout [35][36].
21
2.2.2.2 Representation of Fuzzy Relations The most common methods of representing fuzzy relations are lists of n-tuples, formulas, matrices, mappings, and directed graphs. A list of n-tuples, i.e., ordered pairs, can be used to represent finite fuzzy relations. The tuple consists of a Cartesian product with its membership degree. When the membership has degree zero, the tuple is usually omitted. Suitable formulas are usually used to define infinite fuzzy relations, which involve n-dimensional Euclidean space, with n ≥ 2. Matrices, or n-dimensional arrays, are the most common method to represent fuzzy relations. In this method, the entries of the matrix are the membership degrees associated with the n-tuple of the Cartesian product. The mapping of fuzzy relations is an extension of the mapping method of classical binary relations. For fuzzy relations, the connections of the mapping diagram are labeled with the membership degree. The same technique is used to extend the directed graph representation of classical relations to represent fuzzy relations. 2.2.2.3 Operations on Fuzzy Relations All the mathematical and logical operations on fuzzy sets explained above are also applicable to fuzzy relations. In addition, there are operations on fuzzy binary relations that do not apply to general fuzzy sets. Those operations are the inverse, the composition, and the BK-products of fuzzy relations. The inverse of a fuzzy binary relation R on two sets X and Y is also a relation denoted by R-1 such that xR-1y=yRx. Therefore, for any fuzzy binary relation (R-1)-1 =R. When using matrix representation, the inverse can be obtained by generating the transpose of the original matrix, i.e., swapping the columns and the rows of the matrix as in the following example.
The composition of two fuzzy relations is defined as follows:
22
Let P be a fuzzy relation from X to Y and Q be a fuzzy relation from Y to Z such that the membership degree is defined by P(x, y) and Q(y, z). Then, a third fuzzy relation R from X to Z can be produced by the composition of P and Q, which is denoted as PοQ. Fuzzy relation R is computed by the formula [37]: R(x, z) = (PοQ) (x, z) = max{min[P(x, y),Q( y, z)]} y∈Y The idea of producing fuzzy relational composition was expanded by Bandler and Kohout when they introduced, for the first time, special relational compositions called the Triangle and Square products [38-40]. The Triangle and Square products were named after their inventors and became known as BK-products. Since Bandler and Kohout introduced three new types of products, namely, Triangle sub-product, Triangle super-product, and Square product, a name was needed for the original composition. Therefore, it was called the Circle product. The four different types of fuzzy relational products are defined as follows [41]: Circle product
x(RοS)z ⇔ xR intersects Sz
Triangle sub-product
x(RS)z xR ⊇ Sz
Square product
x(R□S)z ⇔ xR ≅ Sz
2.2.3 Fuzzy Logic According to the Fuzzy Set Theory, a statement about the status of an element in a set may not be true or false with unlimited certainty. For example, the proposition “x is in X” may be 80% true and 20% false. Consequently, any reasoning based on such a proposition would have to be approximate, rather than exact. Such a reasoning system is the goal and the basis of Fuzzy Logic. The attempts of philosophers and logicians to go beyond the classical two-valued logic, where propositions are either definitely true or definitely false, have been motivated by the need to represent and conduct reasoning on reality. Those attempts to expand the two-valued logic into a more realistic and flexible logic, where propositions may be partly true and partly false, have started very early in history. Aristotle presented the problem of having propositions that may not be true or false in his work On Interpretation [37]. He argued that propositions on future events would not have any Truth-values. The way to resolve their truth-values is to wait until the future becomes present. However, the indeterminate truth-value of many propositions may not be easily resolved. 23
With advent of the twentieth century, we realized that propositions about future events are not the only propositions with problematic truth-values. In fact, the truth-values of many propositions may be inherently indeterminate due to uncertainty. This uncertainty may be due to measurement limitations such as the one that resulted in the well-known Heisenberg Uncertainty Principle. It can also be caused by the intrinsic vagueness of the linguistic hedges of natural languages when used in logical reasoning.
Multi-valued logics have been invented to enable capturing the uncertainty in truth-values. Logicians such as Lukasiewicz, Bochvar, and Kleene devised three-valued logics, which relaxed the restrictions on the truth and falsity of propositions. In a three-valued logic, a proposition may have a truth-value of half, in addition to the classically known zero and one. This resulted in the expansion of the concept of a tautology to the concept of a quasi-tautology and contradiction into quasicontradiction. This was necessary since under the truth-value of one half we cannot have a truthvalue of one or zero on all the rows of a truth table in many three-valued logic systems. The three-valued logic concept was generalized in the first half of the twentieth century into the nvalued logic concept; where n can be any number. Hence, in an n-valued logic, the degree of truth of a proposition can be any one of n possible numbers in the interval [0, 1]. Lukasiewicz was the first one to propose a series of n-valued logics for which n ≥ 2. In the literature, the n-valued logic of Lukasiewicz is usually denoted as L , where 2 ≤ n ≤ ∞. Hence, L∞ is the infinite valued logic, which n
is obviously isomorphic to Fuzzy Set Theory as L is the two-valued logic, which is isomorphic to 2
the Crisp Set Theory [37]. In a sense, Fuzzy Logic can be considered to be a generalization of a logic system that includes the class of all logic systems with truth-values in the interval [0, 1]. “In a broader sense, fuzzy logic is viewed as a system of concepts, principles, and methods for dealing with modes of reasoning that are approximate rather than exact” [32]. The inference rules of classical logic are certainly not suitable for approximate reasoning. However, those inference rules can be generalized to produce an inferential mechanism that is adequate for approximate reasoning and is based on Fuzzy Logic. The generalized modus ponens, introduced by Lotfi Zadeh, is the basic inference mechanism in fuzzy reasoning [42]. 24
The classical modus ponens can be expressed as the conditional tautology: [(p ⇒ q) ∧ p] ⇒ q Alternatively, modus ponens can be represented by the schema: Rule: p ⇒ q Fact: p ___________________ Conclusion: q The generalized modus ponens is formally represented as follows. Let χ and γ be variables taking values from sets X and Y, A and A′ are fuzzy sets on X, and B and B′ are fuzzy sets on Y. Rule: χ is A ⇒ γ is B Fact: χ is A′ ___________________ Conclusion: γ is B′ If A′ is given, then B′ can be computed by the following equation: B'(y) = sup {min [A'(x), R(x, y)]} x∈X Where R is a fuzzy relation on X ×Y, and sup is defined to be supremum (the minimum upper bound). The generalized modus ponens provided the beginning of the development of fuzzy inference rules. It was followed by the formation of the generalized modus tollens and the generalized hypothetical syllogism, which together with the generalized modus ponens served as the basis for the fuzzy logic based approximate reasoning [37]. The generalized modus ponens is the basis for interpreting fuzzy rule sets. The most common definition of a fuzzy rule base R was proposed by Mamdani and Assillian and is represented by the formula:
According to the definition, R is composed of a disjunction of a number m of fuzzy rules. Each rule ri associates a fuzzy state vector
with the corresponding fuzzy action Yi [43].
25
Using the generalized modus ponens, an inference engine of a Fuzzy Controller can be built. The output of the resulting fuzzy system can be described by the formula:
Where λi is the degree of applicability of rule ri, which is determined by the matching of an input vector
with the n-dimensional state vector
. The degree of applicability, λi, is determined from
the resulting degree of matching. Therefore, λi can be computed by the formula:
In this formula,
is the possibility measure representing the matching between the
reference state variable and the input and can be computed as follows: [42]
When utilizing the above inference mechanism for Fuzzy Control, an actuator is expected to be triggered eventually to perform some function. The action to be taken should be based on a single scalar value. Therefore, a defuzzification mechanism is needed to convert the fuzzy membership distribution into the required scalar value. A variety of defuzzification techniques exist. The selection of one defuzzification technique is application dependent and involves some tradeoff between elements of computational costs such as storage, performance, and time. The BK-products of fuzzy relations proved to be very powerful not only as a mathematical tool for operations on fuzzy sets and fuzzy relations but also as a computational framework for fuzzy logic and fuzzy control. In addition to the set based definitions presented above, many valued logic operations are also implied and are defined as follows: Circle product
(R ο S)ik = ∨j(Rij∧Sik)
Triangle sub-product
(RS)ik = ∧j(R← Sik)
Square product
(R □ S) ik = ∧j(R≡Sik)
Where Rij and Sik represent the fuzzy degree of truth of the propositions xiRyj and yjSzk respectively[41]. BK-products have been applied, as a powerful computational tool, in many fields such as computer protection, AI [44], medicine, information retrieval, handwriting classification, urban studies, investment, control, [45][41] and most recently in quality of service and distributed networking [46].
26
2.2.4 Fuzzy Control Fuzzy Control is considered to be the most successful area of application of Fuzzy Set Theory and Fuzzy Logic. Fuzzy controllers revolutionized the field of control engineering by their ability to perform process control by the utilization of human knowledge, thus enabling solutions to control problems for which mathematical models may not exist, or may be too difficult or computationally too expensive to construct. A typical Fuzzy controller consists of four modules: the rule base, the inference engine, the fuzzification, and the defuzzification. A typical Fuzzy Control algorithm would proceed as follows: 1. Obtaining information: Collect measurements of all relevant variables. 2. Fuzzification: Convert the obtained measurements into appropriate fuzzy sets to capture the uncertainties in the measurements. 3. Running the Inference Engine: Use the fuzzified measurements to evaluate the control rules in the rule base and select the set of possible actions. 4. Defuzzification: Convert the set of possible actions into a single numerical value. 5. The Loop: Go to step one. Several defuzzification techniques have been devised. The most common defuzzification methods are: the center of gravity, the center of maxima, the mean of maxima, and the root sum square.
27
Chapter 3 WEB MINING
3.1 INTRODUCTION The World Wide Web can be seen as one of the largest databases in the world. This huge, and evergrowing, amount of data is a fertile area for data mining research. In the introduction of this book, Meij describes data mining as the process of extracting previously unknown information from (usually large quantities of) data, which can, in the right context, lead to knowledge. When data mining techniques are applied to web data, we speak of web mining. In their contribution to this book, Kosala, Blockeel and Neven explain that in the web mining context the term 'mining' is often used in a more general sense than just referring to data-mining techniques in the classical sense. Therefore, they define web mining as the whole of data mining and related techniques that are used to automatically discover and extract information from web documents and services. Note that again in the right context this can lead to knowledge. By looking at web mining from an ethical perspective, we shall discover a field of tension, between advantages on the one hand and disadvantages on the other. As ethics is the branch of philosophy concerned with the nature of morals and moral evaluation [47], an ethical perspective will raise questions like what is right or wrong, what is beneficial or harmful? Ethical research focuses on three types of problems. First, there are situations in which normative principles are clearly disregarded. Then there are ethical problems concerning new issues (types of problems that do not match existing cases) where it is a question of how traditional principles can be applied. The third type of ethical situations deals with the category of normative conflicts. A normative conflict appears whenever there are both good and bad sides to a matter. The issue of web mining is a normative conflict where good refers to the benefits of web mining and bad refers to its possible harmful implications, in other words the ethical values that are threatened. Values are core beliefs or desires that guide or motivate attitudes and actions, and determine how people behave in certain situations. As ethics is a reflection on morality, ethical values could be described as that which subjects affirms as moral in human behavior [48]. Thus ethical values have a normative function and are the motive for moral human behavior. A value can be seen as a global goal. Such a goal needs to be driven by a means, presented by more specific norms. For instance, the value of privacy is driven
28
by norms like respecting someone's private life and not misusing someone's personal data. Norms would be meaningless without values. Knowledge discovered after mining the web, could pose a threat to people, when for instance personal data is misused. However, it is this same knowledge factor that can imply lots of different advantages, as it is of high value to all sorts of applications concerning planning and control. Kosala, Blockeel and Neven have already described some specific benefits of web mining, like improving the intelligence of search engines. Web mining can also contribute to marketing intelligence by analyzing the web user's on-line behavior and turning this information into marketing knowledge. It is this normative conflict, an area of tension between the advantages on the one hand and the disadvantages (threatened values) on the other, that is the subject of this study. It should be noted that ethical issues could arise from mining web data that do not involve personal data at all, such as data on different kinds of animals, or technical data on cars. However this section is limited to web mining that does in some way involve personal data. We shall only look at the possible ethical harm that can be done to people, which means that harm done to organizations, animals, or other subjects of any kind are not a part of the scope of this study. Since most webs mining applications are currently found in the private sector this will be our main focus. So web mining involving personal data will be viewed from an ethical perspective in a business context. Note that this paper is not intended to be of a moralistic nature. Within the ethical perspective of this normative conflict, it is clearly recognized that web mining is a technique with a large number of good qualities and potential. However, examining the possible threats might create certain awareness, leading to a well-considered application and further development of this technique. This topic reads as follows. It shall be made clear that there are different ways to mine the web. To grasp the possible problems we shall introduce two different categories of web mining. By describing some scenarios we shall illustrate those categories. The ethical context presented here might help to find suitable solutions so that web mining can be both as profitable and as harmless as possible. Finally, some concluding remarks will be made regarding the findings of this article.
3.2 CATEGORIES OF WEB MINING There are different ways to mine the web. To structurally analyze the field of tension we need to be able to distinguish between those different forms of web mining. The different ways to mine the web are closely related to the different types of web data. We can distinguish actual data on web pages, 29
web structure data regarding the hyperlink structure within and across web documents, and web log data regarding the users who browsed the web pages. Therefore, in accordance with Madria et al, we shall divide web mining into three categories. First, there is content mining, to analyze the content data available in web documents. This can include images; audio files etc., however in this study content mining shall only refer to mining text. Second, there is the category of structure mining, which focuses on link information. It aims to analyze the way in which different web documents are linked together. The third category is called usage mining. Usage mining analyses the transaction data that is logged when users interact with the web. Usage mining is sometimes referred to as 'log mining', because it involves mining the web server logs. Structure mining is often more valuable when it is combined with content mining of some kind to interpret the hyperlinks' contents. Content and structure mining also share most of the same advantages and disadvantages. We shall discuss them together, considering them as one category. It should however be noted that content and structure mining are not the only ones that can be combined in one tool. Mining content data on a web site can for instance be of added value to usage mining analyses as well. By combining the different categories of web mining in one tool, the results could become more valuable. Web usage mining, however, is quite distinct in its application. As it is also used for different advantages and threatens values in a different way, we shall discuss it separately. The two remaining categories shall be used to see whether the different kinds of web mining will lead to different beneficial or harmful situations. But first the categories will be illustrated by the following scenarios.
3.2.1 Content and structure mining Sharon really likes to surf on the web and she loves to read books. On her personal homepage she likes to share information about her hobbies (surfing and reading) and she mentions her membership of a Christian Youth Association. In her list of recommended links she also refers to the web site of the Christian Youth Association. She has included her e-mail address in case someone wants to comment on her homepage. An on-line bookstore decides to use a web mining tool to search the web for personal homepages to identify potential clients. It matches the data provided on homepages to existing customer profiles. After analyzing the content and the structure of the mined pages, they discover that people who link to Christian web sites of some kind all show a great interest in reading and generally spend a lot of money on buying books. So if the bookstore then makes a special effort 30
to solicit Christians as customers, it might lead to more buying customers and an increase in profits. The web mining tool, not only provides the bookstore with a list of names, but it also clusters people with the same interests and so on. After analyzing the results, Sharon is identified as a potential, high-consuming customer. The bookstore decides to send Sharon a special offer by e-mail. Sharon is somewhat surprised by receiving the e-mail from this bookstore. She has never heard of the store before and she wonders how they could have obtained her e-mail address. A bit annoyed, Sharon deletes the e-mail hoping she will never be bothered by this bookstore again.
3.2.2 Usage mining Sharon always goes to her 'own' on-line bookstore. She frequently visits its web site to read about the newest publications and to see if there are any interesting special offers. The on-line bookstore analyses its web server logs and notices the frequent visits of Sharon. By analyzing her clickstreams and matching her on-line behavior with profiles of other customers, it is possible to predict whether or not Sharon might be interested in buying certain books, and how much money she is likely to spend on that. Based on their analyses they decide to make sure that a banner is displayed on her browsing window that refers to a special offer on a newly published book that will most likely be of interest to Sharon. She is indeed appealed by the banner and she follows the hyperlink by clicking on it. She decides to accept the special offer and she clicks on the order button. On the on-line ordering form there are a lot of fields to be filled in, some don't really seem to be relevant, but Sharon does not see any harm in providing the information that is asked for. The people at the bookstore who developed the ordering form intend to use the data for marketing intelligence analyses. In the privacy statement that can be found on the bookstore's web site this intended use of the collected information is explained. The statement also contains a declaration that the gathered information shall not be shared with third parties. However, after a while the on-line bookstore discovers that web users who come from a certain provider, hardly ever buy anything but do cause a lot of traffic load on their server. They decide to close the adaptive part of their web site to visitors who use that certain provider. Sharon happens to be one of them and the banner in her browser window no longer displays special offers when she visits the site of the bookstore.
31
3.3 ADVANTAGES OF WEB MINING Web mining is attractive for companies because of several advantages. In the most general sense it can contribute to the increase of profits, be it by actually selling more products or services, or by minimizing the costs. In order to do this, marketing intelligence is required. This intelligence can focus on marketing strategies and competitive analyses or on the relationship with the customers. The different kinds of web data that are somehow related to customers will then be categorized and clustered to build detailed customer profiles. This not only helps companies to retain current customers by being able to provide more personalized services, but it also contributes in the search for potential customers. The scenario clearly illustrates this. By analyzing the web log data (usage mining), Sharon's favorite bookstores discovered that Sharon is a potential buyer. They were able to make her a tempting offer by displaying a specific banner on her browser window. The other bookstore was able to identify Sharon as a potential customer by searching the web for homepages and analyzing the data on the pages (content and structure mining). She was sent a special offer by e-mail, which would most probably match her preferences. From Sharon's point of view we could say that she was pleased by the fact that the web site of her favorite bookstore displayed an interesting banner and she was not aware of any missed offers from this bookstore. And although in this scenario Sharon was a bit annoyed by the unsolicited e-mail send by the first bookstore, she might just as well have been attracted by the offer and she might even have become a customer of this bookstore. Clearly both web mining categories contribute to the general goal of gaining marketing intelligence, be it each in its own way.
3.3.1 Content and structure mining advantages One of the things that make the web so special is that its content and structure data largely are publicly available. So in theory everybody can perform content and structure mining on the web, provided they have the right knowledge and sufficient means for it. Content and structure mining tools are developed by organizations that specialize in web search and data mining technologies. Techniques like finding related words based on common occurrence within the same page, result in a larger number of interesting patterns. Applying data mining techniques to web content data could therefore result in better ways of finding relevant information on the web [49]. Structure mining can aid to this goal, by identifying popular sites (so-called 'authorities'), for example, by analyzing the 32
number of links that refer to a particular site [50]. Web content and structure mining are not only used to improve the quality of public search engines. Special search services can also be offered. Content and structure mining tools can for instance track down online misuse of brands, or analyze the content and structure of competitive web sites in detail to gain some strategic advantage. With content and structure mining tools, things like online curriculum vitae or personal homepages can be collected. After interpreting the personal data found on personal pages this information could be used for marketing purposes. Profiles on potential customers can be produced and more detailed information is added to profiles of current customers. So mining the web not only contributes to acquiring new customers, it can also aid in retaining existing ones.
3.3.2 Usage mining advantages Just like content and structure mining, usage mining also provides marketing intelligence [51]. In contrast to content and structure data, web usage data is, however, not publicly available. Only those who provide the user access on the web and those who own the sites visited by the user are able to produce transaction logs. The advantages of usage mining therefore lie in building profiles based on this transactional data. Web logs provide an exciting new way of collecting information on visitors. In a way a site owner can actually see what the visitor is looking at [52]. Any action that tailors the web experience to a particular user, or set of users, can be seen as 'web personalization' [53]. Web personalization is often regarded to be an indispensable part of e-commerce. The ability to track web users' browsing behavior down to individual mouse clicks makes it possible to personalize services for individual customers on a massive scale [54]. This 'mass customization' of services not only helps customers by satisfying their needs, but also results in customer loyalty. Due to a more personalized and customer-centered approach, the content and structure of a web site can be evaluated and adapted to the customer's preferences and the right offers can be made to the right customer. When web sites automatically improve their organization and presentation by learning from visitor's access patterns, we speak of adaptive web sites. Web usage mining is also used to create personalized search engines, which can understand a person's search queries in a personal way by analyzing and profiling the user's search behavior. It offers better, more personalized information after filtering out the links that a user most likely will not be interested in. Thus mining web usage data can improve personalization, create selling opportunities and, ultimately, lead to more profitable relationships with customers. Let us now have a look at the way in which individual web users can 33
benefit from all those advantages. Clearly web users benefit from web mining when the techniques are used to improve the quality of public or personalized search engines. It assists them while navigating through the huge and ever- growing web. When companies provide more personalized services on their (adaptive) web sites, the individual web user can benefit from offers and services that are adjusted to his personal needs and preferences. Some even argue that the growing accuracy of profiles will lead to less unsolicited marketing approaches, for when a company is approaching a web user, it will most likely be for something he or she actually is interested in. Apparently a lot of the advantages of web mining are based on customer profiling. It is often more cost efficient to look at a group instead of looking at each individual, because groups are cheaper and easier to approach (for instance by placing an ad in the right magazine instead of mailing every individual member of the group) [55]. These so-called group profiles can also be of added value to individual profiles, because of the fact that some individual characteristics only become clear after looking at the individual from a group perspective [55]. A person, who never reads, may still be interested in books because he lives in a neighborhood where everybody reads a lot and he likes giving neighbors books on their birthdays. An individual profile will not show this characteristic, while the neighborhood group profile does. Furthermore, group profiling can result in better predictive values, making it easier to point out potential customers. Web mining can obviously be quite beneficial to both businesses and individuals. However, to make sure that this technique will be further developed in a properly thought-out way, we shall also have to consider the possible objections to it. Awareness of all the possible dangers is of great importance for a well-guided development.
3.4 VALUES THREATENED BY WEB MINING In this section we shall point out that web mining, involving the use of personal data of some kind, can lead to the disruption of some important normative principles. One of the most obvious ethical objections lies in the possible violation of peoples' privacy. However, when the judgment and treatment of people is based on patterns resulting from web mining, the value of individuality is also threatened. Before discussing this value, we shall first have a look at the issue of privacy violation.
34
3.4.1 Privacy Privacy is generally referred to as the quality or condition of being secluded from the presence or view of others. Privacy can involve one's private life; referred to as relational privacy, but it can also involve one's personal data. The latter is referred to as informational privacy, which can be defined as an individual's ability to autonomously control the unveiling and dissemination of data referring to his private life [56]. Privacy issues due to web mining often fall within this category of informational privacy [56][57]. Therefore this study shall be restricted to this category. In the remainder of this paper the term privacy will be used as referring to informational privacy. There are some differences between privacy issues related to traditional information retrieval techniques and the ones resulting from data mining. The technical distinction between data mining and traditional information retrieval techniques does have consequences for the privacy problems evolving from the application of such techniques [57]. While in traditional information retrieval techniques one has to 'talk' to a database by specifically querying for information, data mining makes it possible to 'listen' to a database [58]. A system of algorithms searches the database for relevant patterns by formulating thousands of hypotheses on its own. That way, interesting patterns can be discovered in huge amounts of data. Tavani argues that it is this very nature of data mining techniques that conflicts with some of the current privacy guidelines formulated by the OECD. These guidelines correspond to the European Directive 95/46/EC of the European Parliament and the Council. Every European Union member state has to implement these basic guidelines in their national laws. So the guidelines obviously are highly influential, therefore the definitions and principles should not be underestimated. Principles like the "use limitation principle" and "the purpose specification principle" state, that information cannot be used for other purposes than the one it was originally retrieved for, and that this purpose has to be clearly explained to the person whose data is being mined before the actual collection. However, one of the features of data mining is the fact that it is not clear what kind of patterns will be revealed. That makes it impossible to clearly specify the exact purpose and notify the data-subjects in advance [57]. Besides, data mining is often performed on historical data sets, which also makes it rather difficult to comply with these guidelines. With web mining, and the way the different types of web data can be collected and analyzed by data mining tools, it is also difficult to comply with the current guidelines.
35
4.4.2 Individuality Individuality can be described as the quality of being an individual; a human being regarded as a unique personality. Individuality is one of the strongest Western values. In most Western countries people share a core set of values maintaining that it is good to be an individual and to express oneself as that individual. Profiling through web mining can, however, lead to de-individualization, which is defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits [59]. Vedder distinguishes between two types of group profiles. Distributive group profiles are profiles in which every group characteristic is actually shared by every individual member of the group. However, in non-distributive group profiles not every characteristic of the group is shared by every individual member. The properties in non-distributive profiles apply to individuals as members of the group, but the individuals themselves need not in reality exhibit all of these properties. In nondistributive group profiles, data are framed in terms of probabilities and averages and so on, and the data are therefore generally made anonymous. When data is made anonymous before producing a profile, the discovered information no longer links to individual persons and there is no direct sense of privacy violation. The profiles do not contain 'real' personal data, which is commonly defined as data relating to an identified or identifiable person. Basic principles from the European Directive like the "collection limitation principle" and the "openness principle" state that personal data should be obtained by fair means, with the knowledge or consent of the data subject, and that a subject has the right to know what personal data is being stored. The "individual participation principle" even gives a subject the right to object to the processing of his data. All of those principles heavily depend on the assumption that there is some kind of direct connection between a person and his or her data. However, in anonymous profiles this direct connection has been erased. Nevertheless, the profiles are often used as if they were personal data. This sometimes makes the impact on individual persons stronger than with the use of 'real' personal data. The information cannot be traced back to individual persons. Therefore, individuals can no longer protect themselves with traditional privacy rules. However, when group profiles are used as a basis for decision-making and formulating policies, or if profiles somehow become public knowledge, people will be judged and treated as group members rather than unique individuals. By threatening the value of individuality, people could even be discriminated. This is for instance the case when profiles that contain data of a sensitive nature, are used for selections in certain allocation procedures. Some criteria (usually) are inappropriate and 36
discriminatory to use in decision-making, like race, religion, and so on. Especially when the information is irrelevant (and often illegal to use) for decisions, like turning someone down for a job, or not give him a loan and so on [60]. As both categories of web mining are used for profiling, both categories pose the same threat to the value of individuality and the closely related value of non-discrimination. Web mining ultimately jeopardizes the values of fair judgment and fair treatment. In the next section we shall try to demonstrate the seriousness of these dangers.
37
Chapter 4 RELATED WORK The goal of traditional clustering is to assign each data point to only one cluster. In contrast, Fuzzy clustering assigns different degrees of membership to each point where the membership of a point is shared among various clusters (Fung). Topics that characterize a given knowledge domain are somehow associated with each other. Those topics may also be related to topics of other domains. Hence, documents may contain information that is relevant to different domains to some degree. With Fuzzy clustering methods documents are attributed to several clusters simultaneously and thus, useful relationships between domains may be uncovered, which would otherwise be neglected by hard clustering methods. Data Mining has emerged as a new discipline in world of increasingly massive datasets. Data Mining is the process of extracting or mining knowledge from data. Data Mining is becoming an increasingly important tool to transform data into information. Knowledge Discovery from Data i.e. KDD is synonym for Data Mining [8]. Clustering is one of the Data Mining techniques to improve the efficiency in information finding process. Many clustering algorithms have been developed and used in many fields. A. K. Jain, M. N. Murty and P. J. Flynn [11] provide an extensive survey of various data clustering techniques. Clustering algorithms can be broadly categorized into partition and hierarchical techniques. Agglomerative hierarchical clustering (AHC) algorithms are most commonly used .It use a bottom – up methodology to merge smaller cluster into larger ones, using techniques such as minimal spanning tree. These algorithms find to be slow when applied to large document collection. It has different variants such as single-link, group-average and complete-link. Single-link and groupaverage methods typically takes O(n2) time while complete-link method typically takes O(n3) time. Partition algorithms such as K- means are linear time algorithms. It tries to divide data into subgroups such that the partition optimizes certain criteria, like inter – cluster distance or intracluster distances. They typically take an iterative approach. The time complexity of this algorithm is O(nkt) , where k is the number of desired clusters and T is the number of iterations. .
38
Most of the document clustering algorithm worked on BOW (Bag Of Words)model[5].Oren Zamir and Oren Etzioni[12] in their research listed the key requirements of web document clustering methods as relevance, brows able summaries, overlap, snippet tolerance, speed and accuracy. They have given STC (Suffix Tree Clustering) algorithm which creates clusters based on phrase shared between documents. Michael Steinbach, George Karypis and Vipin Kumar [13] presented the result of an experimental study of some common document clustering algorithms. They compare the two main approaches of document clustering i.e. agglomerative hierarchical clustering and K-means method. Nicholas O. Andrews and Edward A. Fox [14] presented the recent developments in document clustering. A single object often contains multiple themes like a web document on topic Web Mining may contain different themes like Data Mining, clustering and information retrieval. Many traditional clustering algorithms assign each document to a single cluster, thus making it difficult for the user to retrieve information. Based on this concept clustering algorithm can be divided into hard & soft clustering algorithm. In traditional clustering algorithm each object belongs to exactly one cluster where as in soft clustering algorithm each object can belongs to multiple clusters [15]. The conventional clustering algorithms in Data Mining have difficulties in handling the challenges posed by the collection of natural data which is often vague and uncertain. The modeling of imprecise and qualitative knowledge, as well as handling of uncertainty at various stages is possible through the use of fuzzy sets. Therefore a fuzzy clustering method was offered to construct clusters with uncertain boundaries, so this method allows that one object belongs to multiple clusters with some membership degree. Pawan Lingras, Rui Yan and Chad West [16] applied fuzzy technique to discover usage pattern from web data. The fuzzy c-means clustering was applied to the web visitors of educational websites. The analysis shows the ability of the fuzzy c-means clustering to distinguish different user characteristics. Anupam Joshi and Raghu Krishnapuram[17] developed a prototype Web Mining system which analyzes web access logs from a server and tries to mine typical user access pattern. Maofu Liu, Yanxiang He and Huijun Hu [18] proposed a web fuzzy clustering model. In their paper the experimental result of web fuzzy clustering in web user clustering proves the feasibility of web fuzzy clustering in web usage mining. Friedman Menahem et al. in [1] have been presented a methodology for a new Fuzzy-based Document Clustering Method (FDCM), to cluster documents that are represented by variable length 39
vectors. Each vector element consists of two fields. The first is an identification of a key phrase (its name) in the document and the second denotes a frequency associated with this key phrase within the particular document. Jursic Matjaz et al. in [2] has been presented the fuzzy clustering of 2-dimensional points and documents. For the needs of documents clustering they implemented fuzzy c-means in the Text Garden environment. Oren Etzioni was the person who coined the term Web Mining first time. Initially two different approaches were taken for defining Web Mining. First was a “process-centric view”, which defined Web Mining as a sequence of different processes [6] whereas second was a “data-centric view”, which defined Web Mining in terms of the type of data that was being used in the mining process [7]. The second definition has become more acceptable, as is evident from the approach adopted in most research papers [3][5]. Web Mining is also a cross point of database, information retrieval and artificial intelligence [4].
40
Chapter 5 PROPOSED WORK We are proposed a fuzzy clustering method which is based upon fuzzy equivalence relation. The downloaded documents and the keywords contained therein and stored in a database by the crawler. The indexer extracts all words from the entire set of documents and eliminates stop words such as „a‟, „the‟, „on‟, „and‟ etc. from all the documents. There is a proposed algorithm for fuzzy clustering of web documents using equivalence relation. Proposed Algorithm:
1. Input the document files. 2. Remove the stop words from all document files. 3. List of keywords (common words) with document no. is generated. 4. Each keyword is assigned a keyword id. 5. Mapping the keyword id and document no. to generate document clustering data on the basis of same keyword. 6. Determine a fuzzy compatibility relation in terms of an appropriate distance function applied on the given data. p R(xi,xk) = 1- δ( Σ | xij – xkj |q )1/q…….(i) J=1 For all pairs (xi, xk) Є X(set of data),q Є RT and δ constant that ensures that R (xi,xk) Є [0,1], Clearly, δ is the inverse value of the largest distance in X. 7. Calculate transitive closure of this fuzzy compatibility relation. Calculated a relation R(X,X), and its transitive closure RT (X,X) can be determined by simple algorithm that consists of the following three steps: 1. R' = R U (R o R) 2. If R' ≠ R, make R = R' and go to step 1 3. Stop R' = RT 8. Calculate α-cut of transitive closure relation. 9. Draw the dendogram with help of α-cut. It will show the clustering of web documents. For the value of q=1 or 2 it should be same because it will be apply on same data.
41
Fig.5.1 Flow Chart of proposed algorithm
42
Chapter 6 EXPERIMENTAL RESULTS We present the results of an experimental evaluation of the fuzzy clustering technique using equivalence and fuzzy clustering. We implemented our proposed algorithm in Java and run our experiments on NetBeans IDE 7.1.2 over 64-bit Windows7 Operating System, Intel Core i5@ 2.40GHz with 4GB of RAM. In our system the experimental result shows the same clusters of web documents after applying q=1 or 2.Here we show the experimental results for values q=2 and q=1, experimental results generate the same clusters of web documents. 6.1 EXPERIMENTAL RESULT FOR EUCLIDEAN DISTANCE (a) Input the folder of documents files and enter the value of q=2 then start the process and generate the keywords with document Ids.
Fig.6.1 Input of files and keyword with doc id In the above fig. there are four documents and five keywords. By applying above algorithm, analyze the data for q= 2.which is corresponds to the Euclidean distance.
43
(b) Keyword Id is generated for each keyword and every keyword should be occurred at once.
Fig.6.2 Keyword with keyword ID (c) Showing mapping data table of above tables (a) and (b).
Fig.6.3 Document clustering data There are six data points:- x1= (0,0) , x2= (1,1) ,x3=(1,2) , x4=(2,3), x5= (2,0),x6=(3,4). The largest Euclidean distance between any pair of given data points is 5 then δ = 1/5 =0.20. 44
(d) Showing matrix of Fuzzy Compatibility relation. Now calculate membership values of R for equation (i) R (x2 , x4) = 1- 0.20(12+ 22)0.5 = 0.55 When determined, relation R may conveniently be represented by the matrix for data points.
Fig.6.4 Fuzzy relation matrix (e) Calculating the transitive closure of above relation matrix. This relation is not max-min transitive. It is transitive closure. This relation includes three distinct partitions of its α - cuts:
Fig.6.5 Transitive closure matrix 45
(f) Calculate the α-cut of above transitive closure matrix.
Fig.6.6 α-cuts of transitive closure matrix (g) Draw the dendogram with help of α-cut. It will show the clustering of web documents. For the value of q=1 or 2 it should be same because it will be apply on same data. (For q=2)
Fig.6.7 Dendogram representation 46
6.2 EXPERIMENTAL RESULT FOR HAMMING DISTANCE Now we check the experimental result for q = 1 in eq. (i) which represents the Hamming distance. Since the largest hamming distance in the data is 7, we have δ=0.14. (a) The matrix form of relation R is given by eq.(i) is now
Fig.6.8 Fuzzy compatibility matrix (b) Its transitive closure is shown below.
Fig.6.9 Transitive closure matrix 47
(c) The transitive closure relation gives the following portions in it‟s α – cuts.
Fig.6.10 α-cuts of transitive closure matrix (d) Draw the dendogram with help of α-cut. It will show the clustering of web documents.
Fig.6.11 Dendogram representation 48
Here dendogram for both values (1, 2) of q is similar such we say that our experimental result is correct. The dendrogram is a graphical representation of the results of hierarchical cluster analysis. This is a tree-like plot where each step of hierarchical clustering is represented as a fusion of two branches of the tree into a single one. The branches represent clusters obtained on each step of hierarchical clustering.
49
Chapter 7 CONCLUSION Fuzzy clustering is better than conventional clustering because it is suitable for Web Mining. Fuzzy clustering is also useful to detect the outlier data point or documents. This proposed technique for document clustering, based on fuzzy logic approach improves relevancy factor because experimental results shows the same clustering for both values of q. This technique keeps the related documents in the same cluster so that searching of documents becomes more efficient in terms of time complexity. In future work we can also improve the relevancy factor of fuzzy clustering to retrieve the web documents.
50
REFERENCES [1] Menahem Friedman and Abraham Kandel” A Fuzzy-Based Algorithm for Web Document Clustering” IEEE “, 2004. [2] Matjaž Juršič and Nada Lavrač” Fuzzy Clustering of Documents” IMCIS”, October 17, 2008, Ljubljana, Slovenia. [3] Hillol Kargupta, Anupam Joshi, Krishnamoorthy Sivakumar and Yelena Yesha, “Data Mining: Next Generation Challenges and Future Directions”, MIT Press,USA , 2004. [4] WangBin and LiuZhijing, “Web Mining Research”, In Proceeding of the 5th International Conference on Computational Intelligence and Multimedia Applications(ICCIMA‟03) 2003. [5] R. Kosala and H.Blockeel, “Web Mining Research: A Survey”, SIGKDD Explorations ACM SIGKDD, July 2000. [6] Oren Etzioni, “The World Wide Web: quagmire or gold mine?” Communications of ACM”, Nov 96. [7] R. Cooley,B. Mobasher and J. Srivastava ,”Web Mining: Information and Pattern Discovery on the World Wide Web”, In the Proceeding of ninth IEEE International Conference on Tools with Artificial Intelligence(ICTAI‟97),1997. [8] Anjali B. Raut and G. R. Bamnote, “Web Document Clustering Using Fuzzy Equivalence Relations”, In CIS Journal, Volume 2, 2010-11. [9] Sankar K. Pal,Varun Talwar and Pabitra Mitra , “Web Mining in Soft Computing Framework : Relevance, State of the Art and Future Directions ”, IEEE Transactions on Neural Network , Vol 13,No 5,Sept 2002.
51
[10] Andreas Hotho and Gerd Stumme, “Mining the World Wide Web- Methods, Application and Perceptivities”, in Künstliche Intelligenz, July 2007. (Available at http://kobra.bibliothek.unikassel.de/) [11] A. K. Jain,M. N. Murty and P. J. Flynn, “Data clustering: A review,” ACM computing surveys,31(3):264-323,Sept 1999. [12] O. Zamir and O. Etzioni, “Web document clustering: A feasibility demonstration”, in Proceeding of 19th International ACM SIGIR Conference on Research and Development in Informational Retrieval, June1998. [13] Michael Steinbach,George Karypis and Vipin Kumar, “A Comparison of Document Clustering Techniques”, KDD Worksop on Textmining, 2000. [14] Nicholas O. Andrews and Edward A. Fox, “Recent Development in Document Clustering Techniques”, Dept of Computer Science, Virgina Tech 2007. [15] King-Ip Lin and Ravikumar Kondadadi, “A Similarity Based Soft Clustering Algorithm for Documents”, in Proceeding of the 7th International Conference on Database Systems for Advanced Applications (DASFAA-2001), April 2001. [16] Pawan Lingras ,Rui Yan and Chad West, “ Fuzzy C-Means Clustering of Web Users for Educational Sites”, Springer Publication ,2003. [17] Anupam Joshi and Raghu Krishnapuram , “ Robust Fuzzy Clustering Methods to Support Web Mining”, Proceedings of the Workshop on Data Mining and Knowledge Discovery , SOGMOD ,1998. [18] Maofu Liu, Yanxiang He and Huijun Hu, “Web Fuzzy Clustering and Its Applications In Web Usage Mining”, Proceedings of 8th International Symposium on Future Software Technology (ISFST-2004). 52
[19] Sankar K. Pal, P. Mitra, “Data Mining in Soft Computing Framework: A Survey”, IEEE transactions on neural networks, vol. 13, no. 1, January 2002 [20]
R.
Cruse,
C.
Borgelt,
“Fuzzy
Data
Analysis
Challenges
and
Perspective”
http://citeseer.ist.psu.edu/ kruse99fuzzy.html [21] G. Raju, Th. Shanta Kumar, Binu Thomas, “Integration of Fuzzy Logic in Data Mining: A comparativeCase Study”, Proc. of International Conf. on Mathematicsand Computer Science, Loyola College, Chennai, 128-136,2008
[22] G. J Klir, T A. Folger, Fuzzy Sets, Uncertainty and Information, Prentice Hall, 1988.
[23] J. C. Bezdek, Fuzzy Mathematics in Pattern Classification, Ph.D. thesis, Center for Applied Mathematics, Cornell University, Ithica, N.Y., 1973. [24] Zadeh, Lotfi: “What is Soft Computing” Soft Computing. Springer-Verlag Germany/USA 1997. [25] Zadeh, Lotfi: “Fuzzy Logic and Softcomputing” Plenary Speaker, Proceedings of IEEE International Workshop on Neuro Fuzzy Control. Muroran, Japan 1993. [26] Zadeh, Lotfi: “The Role of Soft Computing and Fuzzy Logic in the Conception, Design, Development of Intelligent Systems” Plenary Speaker, Proceedings of the International Workshop on soft Computing Industry. Muroran, Japan, 1996. [27] Zadeh, Lotfi: “From Computing with Numbers to Computing with Words – from Manipulation of Measurements to Manipulation of Perceptions” Plenary Speaker, Proceedings of IEEE International Workshop on Soft Computing in Industry. Muroran, Japan 1999.
53
[28] Kacpzyk, Janusz (Editor): Advances in Soft Computing. Springer-Verlag, Heidelberg, Germany, 2001. [29] Dote, Yasuhiko; S. Taniguchi, and T. Nakane: “Intelligent Control Using Soft Computing” Proceedings of the 5th Online World Conference on Soft Computing in Industrial Applications (WSC5). 2000.
[30] Mendel, Jerry: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice Hall PTR, NJ, USA 2001. [31] Zadeh, Lotfi: “Fuzzy Sets” Information and Control. 8(3), pp.338-353, 1965. (Cited by [Klir 1995, 1997], [Bonissone 1997]).
[32] Klir, George; U. St.Clair, and B. Yuan: Fuzzy Set Theory: Foundation and Applications. Prentice Hall, NJ, USA 1997. [33] Kohout, Ladislav: “Notes on Fuzzy Logics” Class Notes, Fuzzy Systems and Soft Computing. Department of Computer Science, Florida State University, Tallahassee, FL, USA 1999. [34] Bandler, W. and L. J. Kohout: “Special properties, closures and interiors of crisp and fuzzy relations” Fuzzy sets and Systems 26(3): 317-332, June 1988. [35] Bandler, W. and L. J. Kohout: “Relations, mathematical” In M.G. Sigh, editor, Systems and Control Encyclopedia. Pergamon Press, Oxford, pp 4000-4008, 1987. [36] Kohout, Ladislav: “Boolean and fuzzy relations” In P.M. Pardalos and C.A. Floudas, editors, The Encyclopedia of Optimization. Kluwer, Boston, MA, USA 2001.
[37] Klir, G. and B. Yuan: Fuzzy Sets and Fuzzy Logic: Theory: and Applications. Prentice Hall, NJ, USA 1995.
54
[38] Bandler, W. and L. Kohout: “Mathematical Relations, Their Products and Generalized Morphisms” Technical Report, EES-MMS-REL 77-3, Man-Machine Systems Laboratory, Department of Electrical Engineering, University of Essex, Colchester, Essex, UK 1977. [39] Bandler, W. and L. Kohout: “Fuzzy Relational Products as a Tool for Analysis and Synthesis of the Behaviour of Complex Natural and Artificial Systems” Fuzzy Sets: Theory and Applications to Ploicy Analysis and Information Systems. P. Wang and S. Chang editors, pp. 341-367, Plenum Press, NY, USA 1980. [40] Bandler, W. and L. J. Kohout: “Relations, mathematical” In M.G. Sigh, editor, Systems and Control Encyclopedia. Pergamon Press, Oxford, pp 4000-4008, 1987. [41] Kohout, Ladislav: “Foundations of Knowledge Engineering for Systems with Distributed Intelligence: A relational Approach” Encyclopedia of Microcomputers. A. Kent and J. Williams editors, Vol. 24, Supplement 3, Marcel Dekker Inc., NY, USA 2000. [42] Bonissone, Piero: “Soft Computing: The Convergence of Emerging Reasoning Technologies” Soft Computing. Springer-Verlag, Germany/USA 1997. [43] Mamdani, E. and S. Assillian: “An Experiment in Linguistic Synthesis with a Fuzzy Logic Controller” International Journal of Man Machine Studies. 7(1), pp. 1-13, 1975. (cited by [Bonissone 1997]).
[44] Kohout, Ladislav: A Perspective on Intelligent Systems: A Framework for Analysis and Design. Chapman and Hall Computing, London, UK 1990. [45] Kohout, Ladislav and W. Bandler: “Fuzzy relational products in knowledge engineering” In V. Novak et al., editor, Fuzzy Approach to Reasoning and Decision Making. pp 51-66, Academia and Kluwer, Prague and Dordrecht, Zchec 1992.
55
[46] Moussa, Ahmed and L. Kohout: “Using BK-Products of Fuzzy Relations in Quality of Service Adaptive Communication” Proceedings of IFSA/NAFIPS-2001, pp. 681-686, IEEE, Vancouver, Canada, July 2001.
[47] Vos, H. (1995) Filosofie van de moraal. Inzicht in moraal en ethiek. Aula Spectrum, Utrecht. Weckert, J. (2000) Computer ethics: future directions, presented at the ARC Special Research Centre for Applied Philosophy and Public Ethics, Charles Sturt University in Canberra and The University of Melbourne. (http://www.acs.org.au/act/events/2000acs4.html)
[48] Xiaohe, L. (1998) on economic and ethical value, in The online journal of ethics, vol. 2, no. 1. (http://www.stthom.edu/cbes/volumetwo.html)
[49] Kosala R. and Blockeel, H. (2000) Web Mining Research: A Survey, in ACM SIGKDD, vol. 2, 1:1-15.
[50] Madria, S.K., Bhowmick, S.S., Ng, W.-K. and Lim, E.P. (1999) Research Issues in Web mining, in Lecture Notes in Computer Science, 1676:303-312.
[51] A.G. Büchner, S.S. Anand, M.D. Mulvenna, J.G. Hughes (1999) Discovering Internet Marketing Intelligence through Web Log Mining, in Proc. Unicom99 Data Mining & Datawarehousing: Realising the full Value of Business Data, pp. 127-138.
[52] Khabaza, T. (2000) As E-asy as falling off a web-log: data mining hits the web, in Proceedings of the fourth international conference on the practical applications of knowledge discovery and data mining,
Manchester
UK,
April
2000,
The
Practical
Application
Company.
(http://www.mining.dk/SPSS/Nyheder/as_easy.htm)
[53] Mobasher, B., Dai, H., Luo, T., Sun, Y. and Zhu, J. (2000) Integrating web usage and content mining for more effective personalization, in Proceedings of the International Conference on ECommerce and Web Technologies (ECWeb2000). (http://maya.cs.depaul.edu/~mobasher/papers/ecweb2000.pdf) 56
[54] Mobasher, B., Cooley, R., Srivastava, J. (2000) Automatic personalization based on web usage mining, in Communications of the ACM, vol. 43, 8:142-151.
[55] Custers, B. (2001) Data mining and group profiling on the Internet. In Ethics and the Internet, A. Vedder (ed.), Antwerpen, Groningen, Oxford: Intersentia, 87-104.
[56] Vedder, A. (1998) Het einde van de individualiteit? Datamining groepsprofilering en de vermeerdering van brute pech en dom geluk, in Privacy & informatie, 3:115-120.
[57] Tavani, H.T. (1999a) Informational privacy, data mining, and the internet, in Ethics and Information Technology, 1: 137-145.
[58] Holsheimer, M. (1999) Datamining ontdekt waardevolle informatie in databases, in Privacy & informatie, 3:100-104.
[59] Vedder, A. (1999) KDD: The challenge to individualism, in Ethics and Information Technology, 1:275-281.
[60] Johnson, D.G. (2001) Computer ethics, 3rd. edition, Prentice-Hall, Upper Saddle River, New Jersey.
57