A Rough Set Approach for Clustering the Data Using

IEEE International Conference on e-Business Engineering

A Rough Set Approach for Clustering the Data Using Knowledge Discovery in World Wide Web for E-Business Hrudaya Ku. Tripathy Institute of Advanced Computer and Research, Prajukti Bihar, Rayagada, Orissa, India-765002 [email protected]

B.K.Tripathy Department of Computer Science, Berhampur University, Bhanja Vihar Berhampur, Orissa, India [email protected] successfully increasingly competitiveness in the market places. The implementation of e-strategy is not as easy as simply adding an “e” in front of their current business strategy. Generally, the business data involved e-businesses are enormous.

Abstract Data mining and/or knowledge discovery is a very important part of today’s e-business. The World Wide Web has become in reality, the largest online information available practically to anyone with access to Internet. An e-business framework is proposed in the paper, as well as the knowledge discovery technique to personalize e-business, increase cross selling, and improve the customer relationship management. Due to the enormous size of the Web and low precision of user queries, results returned from present Web search engines can reach hundreds or even thousands data. Therefore, finding the right information can be difficult if not impossible. One approach that tries to solve this problem is by using clustering techniques for grouping similar data together in order to facilitate presentation of results in more compact form and enable browsing of the results set. In this paper, a data clustering techniques is presented with emphasis on application to Web search results. An algorithm for clustering Web data based on Rough Set is presented and its practical implementation is discussed.

Web Server Commerce Server

External Databas Web log

Customers

Campaign, Promotion Manager Transaction Database Reporting, Analysis and Mining

Business Manager Knowledge Discovery Database

Figure 1. Knowledge Discovery in World Wide Web for E-Business

Keywords: Data Mining, Knowledge Discovery, WWW, Rough set, clustering.

Mostly, it contains customer information, purchase information, product/service information, suppliers, security and priority information, management reports including standing and statistic analysis of production, sale, financial etc, as well as online web access data as in Figure 1. Effectively organizing and managing the data is a fundamental task to all e-businesses [1]. In the last decade, we have seen an explosive increase in our capabilities to both collect and store data, and generate even more data by further computer processing. Knowledge Discovery in Database (KDD) and other phrases, such as database mining,

1. Introduction Today, there is more interest, more discussion and more excitement than ever about “e.” The Internet, ecommerce and e-business undoubtedly hold an important input to every organization’s future and success, offering tremendous opportunities and worldwide markets. To differentiate themselves in the Internet economy, wining enterprises are realising that e-business is much more than a simple buy/sell transaction; right e-strategies are the key to

0-7695-3003-6/07 $25.00 © 2007 IEEE DOI 10.1109/ICEBE.2007.62

Business Rules Engine

705 717

originality in e-businesses can be theoretically understood by distinguishing between innovation in ebusiness models and innovation at the systems level. Innovation management theory and knowledge-based theory of the firm provide rich conceptual bases for exploring the relationship between business model innovation and Information System innovation [3]. In the context of web mining, clustering could be used to cluster similar click-streams to determine learning behaviors in the case of e-learning or general site access behaviors in e-business.

information harvesting or data mining, have been used to refer to the process of finding useful patterns (or nuggets of knowledge) in the raw data. As in any emerging field, there are differences of views as to what the definition and scope of KDD should be. In some literature, the phrase ‘‘knowledge discovery in databases’’ is viewed as a broader discipline, and the term data mining is seen as just the component dealing with knowledge discovery methods (Fayyad et al., 1996) [2].

2. Knowledge discovery database in World Wide Web

Content and Structure Data

Pattern Discovery

PreProcessing

Raw usage data

Preprocessed data

Basically, the Web World Web is the most excited impacts to the human society in the last 10 years. It changes the ways of doing business, providing and receiving education, managing the organization etc. The most direct effect is the completed change of information collection, conveying, and exchange. Today, Web has turned to be the largest information source available in this planet. The Web is a huge, explosive, diverse, dynamic and mostly unstructured data depository, which supplies incredible amount of information, and also raises the complexity of how to deal with the information from the different perspectives of view – users, Web service providers, business analysts. The users want to have the effective search tools to find relevant information easily and accurately [14]. E-business is fundamentally transforming industry structures by enabling exceptional networked business models facilitated by the Internet. KDD is a technique to discover and analyze the useful information from the Web data. Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. It is the term of applying data mining techniques (as in Figure 2) to automatically discover and extract useful information from the World Wide Web documents and services [6]. It is a technique to discover and analyze the useful information from the Web data. The authors of [8] claims the Web involves three types of data: data on the Web (content), Web log data (usage) and Web structure data. The Web service providers want to find the way to predict the users’ behaviors and personalize information to reduce the traffic load and design the Web site suited for the different group of users. The business analysts want to have tools to learn the users needs. All of them are expecting tools or techniques to help them satisfy their demands and/or solve the

Pattern Analysis

Rules, Patterns and Statistics

“Interesting” Rules, Patterns and Statistics

Figure 2. The web using mining process [9]. E-business has changed the face of most business functions in competitive enterprises. Internet technologies have seamlessly automated interface processes between customers and retailers, retailers and distributors, distributors and factories, and factories and their several suppliers. In general, ebusiness has enabled on-line transactions. Also, generating large-scale real-time data has never been easier. With data pertaining to various views of business transactions being readily available. It is only relevant to seek the services of data mining to make sense out of these data sets. Data mining in ecommerce mostly relies on the controller for generating the data to mine on. In summary, it is little surprise that e-commerce is the killer application for data mining (Kohavi 2001) [3]. Web sites are often used to establish a company’s image, to promote and sell goods and to provide customer support. The success of a web site affects and reflects directly the success of the company in the electronic market. The authors are proposed a methodology to improve the success of web sites, based on the exploitation of navigation-pattern discovery. In particular, the authors present a theory, in which success is modeled on the basis of the navigation behavior of the site’s users. The fundamental meaning of this ongoing study is that

718 706

problems encountered on the Web. Therefore, Web mining becomes an active and popular research field.

terminology, a data table is also called an information system. If some of the attributes are interpreted as outcomes of classification, it is also called a decision system. More formally, a data table is a tuple (U,A,Va)a A where, 1. U is a set of objects 2. A is a set of attributes a : U • Va 3. Va is a set of values for the attribute a.

3. A rough set on knowledge discovery The knowledge discovery from real-life databases is a multi-phase process consisting of numerous steps, including attribute selection, discretization of realvalued attributes, and rule induction. In the paper, we discuss a rule discovery process that is based on rough set theory. Data mining and/or knowledge discovery is a very hot issue nowadays; as more and more information is being stored digitally the ability to collect data far outweighs the ability for a person to analyze it. Prior to the emergence of the data-mining field, it has been common practice to either design a database application on online data or use a statistical (or an analytical) package on online data along with a domain expert to interpret the results [13]. A piece of knowledge is a relationship or pattern among data elements that is potentially interesting and useful. In general, discovery means finding something that is hidden or previously unknown. A knowledge discovery system, then, is a system that can discover knowledge. When a knowledge discovery system operates on data in a large, real-world database, it becomes a KDD system or a data mining system [2]. Zdzislaw Pawlak developed rough set theory in the early 1980’s. Rough set deals with classification of discreet data table in a supervised learning environment. Although in theory rough set deals with discreet data, rough set is commonly used in conjunction with other technique to do discretization on the data set. The main feature of rough set data analysis is non-invasive, and the ability to handle qualitative data. This fits into most real life application nicely. Rough sets have seen light in many researches but seldom found its way into real world application. Knowledge discovery with rough set is a multi-phase process consisted of mainly: • • •

We can split the set of attributes in two subsets

C⊂A and D=A---C, respectively the conditional set of attributes and the decision (or class) attribute(s). Condition attributes represent measured features of the objects, while the decision attribute is an a posteriori outcome of classification [4]. Given the complexity and the task-dependence of the KDD process, it is difficult to decompose it into elementary steps. Data cleaning and preprocessing is one of the application fields of rough sets. It is known that prior knowledge is very important in the KDD process. On the other side, one of the main features of rough set data analysis is that it does not use information outside the target data set. These two things seem to be incompatible. At first, rough sets are used to reduce and clean data with minimal model assumptions. Then, the result is used as a basis for further analysis performed with other methods. Missing values can be handled before RS data analysis. We can use some data reparation methods from KDD, to obtain a complete system [5].

4. Why clustering the data in WWW? Searching on the web is tedious and timeconsuming. Search engines cannot index the huge and highly dynamic web contain. The user's ‘‘intention behind the search’’ is not clearly expressed which results in too general, short queries. Clustering is beside Classification and Association Rules Mining a basic technique for Knowledge Discovery as in Figure 3. For most data mining tasks, we start out with a preclassified training set and attempt to develop a model capable of predicting how a new record will be classified. In clustering, there is no pre-classified data and no distinction between independent and dependent variables. Instead, we are searching for groups of records that are similar to one another, in the expectation that similar records represent similar customers or suppliers or products that will behave in similar ways. Clustering is done such that the web access logs within the same group or cluster are more

Discretization Reducts and rules generation on training set Classification on test set

Because rough set theory is a symbolical method rather than a numerical method, roughest theory cannot process continuous data. Discretization is a process that converts continuous data into discreet intervals to be used in roughest. There a couple of popular techniques that is used to discreetize data. The reducts and rules generation is the core of the rough set. In this part, the algorithm will go through the dataset to generate reducts and rules [4]. In Rough Set

719 707

respect to a suitable similarity measure. Patterns that are similar are allocated in the same cluster, while the patterns that differ significantly are put in different clusters [10].

similar than data points from different clusters. Each of the generated clusters represents an access pattern of a group of people having similar behavior [15]. Clustering Frequent Pattern

Separation of classes

Association Rules

Classification

5. Rough set for clustering the data search 5.1. Cluster analysis

Clustering analysis is a technique to group together users or data items (pages) with the similar characteristics. Clustering of user information or pages can facilitate the development and execution of future marketing strategies. Clustering of users will help to discover the group of users, who have similar navigation pattern. It’s very useful for inferring user demographics to perform market segmentation in Ecommerce applications or provide personalized Web content to the individual users. The clustering of pages is useful for Internet search engines and Web service providers, since it can be used to discover the groups of pages having related content [7].

Cluster analysis is a second fundamental technique in both traditional data analysis and in data mining. The technique is defined as grouping ‘individuals or objects into clusters so that objects in the same cluster are more similar to one another than they are to objects in other clusters’. Many clustering methods have been identified, including partitioning, hierarchical, nonhierarchical, overlapping, and mixture models. In the last few decades, as data sets have grown in size and complexity, and the field of data mining has matured, many new techniques based on developments in computational intelligence have started to be more widely used as clustering algorithms. A technique currently receiving considerable attention is the theory of rough sets (Pawlak 1991). Applied to clustering problems, the technique is referred to as rough clustering [11].

4.1. Some benefits of Clustering on the Web.

5.2. Approximations

Figure 3. Role of Clustering in the KDD Process.

• • • • • • • •

Increasing Web information accessibility. Decreasing lengths in Web navigation pathways. Improving Web users requests servicing. Improving information retrieval. Improving content delivery on the Web. Understanding users’ navigation behavior. Integrating various data representation standards. Extending current Web information organizational practices.

Data Collection

Data Representation

Clustering Algorithm

Generalized approximation space is defined as a quadruple A = (U, I,v ,P)

Figure 5. Rough concept [12]. Consider a non-empty set of object U called the universe. Suppose we want to define a concept over universe of objects U. Let assume that our concept can be represented as subset X of U. P: I(U) → {0,1} is a structurality function.

Clusters

Figure 4. Data Clustering Process.

I(x) = [x]R , where R ⊆ U x U is an indiscernibility relation.

Clustering is the unsupervised classification of patterns into groups. The clustering problem is partitioning a population into clusters as in Figure 4. The population is a set of n elements described by m attributes. The goal is to group a set of patterns into a number of more or less homogeneous clusters with

v( X , Y ) =

| X ∩Y | where X , Y ⊆ U |Y |

P(I(x)) = 1 for every x ∈ U,

720 708

authors proposed a method of clustering the clicks of user navigations. Let ti ∈T is a user click-stream. The upper approximation R(ti ) is a set of transactions similar to ti , i.e. a user, who is visiting the hyperlinks in ti, may also visit the hyperlinks present in other transactions in R(ti ) . Similarly, RR (ti ) is a set of transactions that are

Approximations in A of any X ⊆ U are then defined as, LA(X) = {x∈ U : P(I(X)) = 1∧v (I(x), X) = 1} UA(X) = {x∈ U : P(I(X)) = 1∧v (I(x), X) > 0} The central point of rough set theory is the notion of set approximation: any set in U can be approximated by its lower and upper approximation [12]. With the above given definition, generalized approximation spaces can be used in any application where I, v and P can be appropriately determined.

possibly similar to R(ti ) , and this process continues until two consecutive upper approximations for ti are same. This can be called as Similarity Upper Approximation and denoted by Si . The similarity upper approximation for n number of user transactions, are to be calculated, which are as below, (1) for t1, the similarity upper approximation is S1, (2) for t2, the similarity upper approximation is S2, : (n) for tn, the similarity upper approximation is Sn.

Now, if Si = Sj (i and j are distinct) allocate ti and tj in the same cluster. Performing this way, we get a distribution of m disjoint clusters. Let these m clusters be Cj (j = 1, 2,….,m). Here, Cj’s are all distinct and »Cj = T. These Cj’s represent the subgroups of the transactions representing the transaction cluster.

Figure 6. Visualization of lower, upper approximation and boundary region of example set X in the universe U [12]. Based on a theoretical approach on rough set to cluster the data, the algorithm is designed for user access transactions over the Web. A user transaction is a sequence of items in e-business.

5.3. Algorithm: Input: A set of n objects contained in a set U, threshold th ∈[0,1]. Output: Cluster scheme C Step 1: Start. Step 2: Initially consider each object of U as a cluster of one member Ci={xi } and C={C1,C2,.., Cn}. Step 3: For each cluster Ci ∈C, for the threshold th, find out the similarity upper approximation Si. Step 4: For all Ci ∈C such that Si = Sj (i ∫ j), merge the clusters Ci’s and update C Step 5: Output C. Step 6: Stop.

Let there be m users and the user transactions be T = {t1, t2, t3,……,tm} Let U be the set of distinct n clicks (hyperlinks/URLs) clicked by users. Let U = {hl1, hl2,….., hln} Here, each ti T is a non-empty subset of U. Here, the temporal order of user clicks within transactions has not been taken into account. A user transaction t T can be represented as a vector t = {ut1, ut2,……….,utn} where, 1 if hli ∈ t, U t =  i  0 otherwise. Given two transactions t and s, the measure of similarity between t and s is given by, |t ∩ s| sim ( t , s ) = |t ∪ s| From the above definition it can be seen that sim(t,s)  [0,1]. sim(t,s) = 1, when two transactions t and s are exactly identical. sim(t,s) = 0, when two transactions t and s have no items in common. In [10]

It can be noted that the Step 3 of the algorithm iteratively computes the identical transactions based on the upper approximation and Step 4 merges the clusters that has same similarity upper approximation.

6. Conclusion During the last ten-year, the rapid development of information technology and the rise of the Internet and the Web have revolutionized the way people use and access information for e-business. The paper shows that rough set theory can be used as a tool for

721 709

knowledge discovery in e-business. There is a large increase in the amount of information available on World Wide Web and also in number of online business databases. This information abundance increases the complexity of locating relevant information. In our opinion, classical Rough Set theory was interesting because of two main reasons. The former was the idea of approximating an inconsistent set with two consistent ones, which is very elegant. The latter was its mathematical formulation, simple and formal at the same time. In this paper, we have theoretically explained how a successive application of the upper and the lower approximation from rough set theory can be used. The algorithm is also practically may applicable to cluster the web data using in web for ebusiness.

the Data Warehousing and Knowledge Discovery, First International Conference, DaWaK ’99, pages 303-312. [9] Myra Spiliopoulou, Bettina Berendt and Ernestina Menasalvas “Evaluation in Web Mining”, Tutorial at ECML/PKDD 2004, Pisa, Italy, September 20th, 2004. [10] Supriya Kumar De, P. Radha Krishna, “Clustering web transactions using rough approximation”, Fuzzy Sets and Systems, vol-148, 2004, pp. 131–138. [11] Kevin E. Voges, “Research Techniques Derived From Rough Sets Theory: Rough Classification and Rough Clustering”, Proceedings of the 4th European Conference on Research Methodology for Business and Management Studies. [12] Ngo Chi Lang, “A tolerance rough set approach to clustering web search results”, Master thesis in COMPUTER SCIENCE, Warsaw University, Faculty of Mathematics, Informatics and Mechanics, December 2003.

7. References [1] Amy Shi, Allen Long and David Newcomb, “Enhancing e-Business Through Web Data Mining”, Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, University Lyon 2, Lyon, and September 16, 2000.

[13] Caroline John Peter, “The Rough Set Approach in Data mining”, Project Submitted for the Course Rough sets and Applications, University of Regina. [14] Bettina Berendt, Bamshad Mobasher, Myra Spiliopoulou, “Web Usage Mining for E-Business Applications”, ECML/PKDD-2002 Tutorial, 19 August 2002.

[2] Vijay V. Raghavany, Jitender S Deogunz and Hayri Sever “Data Mining: Trends and Issues”, Proceedings of the Rough Set and Current Trends in Computing 2004, pp. 275-289.

[15] S. Asharaf, M. Narasimha Murty, “A rough fuzzy approach to web usage categorization”, Fuzzy Sets and Systems 148, 2004, pp. 119-129.

[3] N R Srinivasa Raghavan, “Data mining in e-commerce: A survey”, Sadhana Vol. 30, Parts 2 & 3, April/June 2005, pp. 275–289. [4] Ahmad Farhan Bin Jaafar, Jamilin Jais, Mohd Hakim Bin Haji Abdul Hamid, Zarina, Binti Abdul Rahman and Djame Benaouda. “Using rough set as a tool for knowledge discovery in DSS”, Current Developments in Technology-Assisted Education, 2006, pp. 1011-1015. [5] Matteo Magna, “Technical report on Rough Set Theory for Knowledge Discovery in Data Bases”, University of Bologna, department of Computer Science, July 1, 2003. [6] R. Cooley, “Web Usage Mining: Discovery and Application of Interesting Patterns from Web data”. PhD thesis, Dept. of Computer Science, University of Minnesota, May 2000. [7] R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for mining World Wide Web browsing patterns”, Knowledge and Information Systems, 1(1), 1999. [8] S.K.Madria, S.S.Bhowmick, W.K.Ng, and E.P.Lim, “Research issues in Web data mining”, Proceedings of

722 710

A Rough Set Approach for Clustering the Data Using

A Rough Set Approach for Clustering the Data Using

Suggest Documents

A Rough Set Approach for Clustering the Data Using ...

Rough Set Approach for Categorical Data Clustering - CiteSeerX

Rough Set Approach for Categorical Data Clustering - Semantic Scholar

SSDR: An Algorithm for Clustering Categorical Data Using Rough Set ...

Autonomous Clustering Using Rough Set Theory

A tolerance rough set approach to clustering web search ... - Carrot2

Clustering: A Rough Set Approach to Constructing Information Granules

A SWARM-BASED ROUGH SET APPROACH FOR FMRI DATA ...

A decision-theoretic rough set approach for dynamic data mining

A data set oriented approach for clustering algorithm ... - CiteSeerX

Analyzing Data Clusters: A Rough Set Approach to ... - Semantic Scholar

A Rough Set Approach towards Analysis of Cosmetic Data

Predicting Bankruptcies Using Rough Set

A Novel Approach for Data Clustering using Improved ...

A New Hybrid Approach for Data Clustering using Firefly ... - CiteSeerX

A rough set approach for determining weights of decision ... - PLOS

A Rough Set Based Approach for ECG Classification - Semantic Scholar

Rough Set based Attribute Clustering for Sample ... - ScienceDirect

A Rough Set Based Approach for ECG Classification

A novel approach for clustering proteomics data

Rough Set Based Fuzzy Scheme for Clustering and ...

A Distance Measure Approach to Exploring the Rough Set ... - Core

Rough set theory: a data mining tool for semiconductor manufacturing ...

Rough Set Approach to KDD - CiteSeerX