Comparative Study of Different Web Mining

Comparative Study of Different Web Mining Algorithms to Discover Knowledge on the Web Mohd Shoaib1,∗ and Ashish K. Maurya 1 PG

Student Faculty of Computer Science and Engineering, Shri Ramswaroop Memorial University, Uttar Pradesh 225 053, India. e-mail: [email protected], [email protected]

Abstract. Nowadays the World Wide Web (commonly called as Web) is used widely and it has impacted on almost every facet of our lives. To search and retrieve the information from the web requires an effective and efficient technique as it has become a challenge due to expanding size and complexity of web. Web Mining tackles this problem by gathering useful information from web by using its three categories web structure mining, web content mining and web usage mining. In this paper discussion is done by explaining the area of Web Mining, its categories and algorithms associated with it. The algorithms discussed are PageRank, SimRank, TF-IDF, k- nearest neighbour, PageGather and CDL4. Then we summarize the algorithms over parameters such as its working, input parameters, complexity and their pros and cons. Also we analyze discussed algorithms over the parameters: relevance, their technique and regression analysis. Keywords: Web mining, Web structure mining, Web content mining, Web usage mining, PageRank, SimRank, TF-IDF, kNN, PageGather, CDL4.

1. Introduction Web Mining is a technique of data mining, commonly defined as the process of discovering useful patterns, knowledge or information from sources of data, in form of texts, images, databases, multimedia, etc from the Web. The patterns must be suitable, convincing, potentially useful, and understandable. Web mining is a multi-disciplinary area which combines various fields such as artificial intelligence, information retrieval machine learning, statistics, databases and visualization [1]. The web mining can be decomposed into subtasks [2]: • • • •

Resource finding Information selection and pre-processing Generalization Analysis

Thus, the aim of web mining is to determine useful knowledge or information from the usage data, page content and web hyperlink structure. Although Web mining uses various techniques of data mining but still it is not completely applied usage of traditional data mining due to the multifarious and semi-structured or unstructured nature of the data on web. A variety of new mining tasks and algorithms were invented in past, based on the basic types of data used in the mining process, Web mining tasks can be categorized into three types [3–5]: Web content mining, Web structure mining and Web usage mining as shown in figure 1. 2. Web Structure Mining Web structure mining determines useful knowledge from hyperlinks or links, which represent the structure of the Web. For example, from the links, it can be determined about important Web pages which is a main technology used in search engines. It can also be used to discover about communities of users who share common interests. ∗ Corresponding author

648

© Elsevier Publications 2014.

Comparative Study of Different Web Mining Algorithms to Discover Knowledge on the Web

Figure 1. Web mining categories.

The web is denoted by objects which are the web pages and links denoted by in-citation, out-citation and co-citation two pages linked by same page. The object may contain diverse attributes such as HTML tags, word appearances and anchor texts [6]. Link mining is done to deal with diversity of objects. Summarizations of some possible tasks which are applied in Web structure mining are [7]: • • • • •

Link-based Classification Link-based Cluster Analysis Link Type Link Strength Link Cardinality

There are various web structure mining algorithms such as PageRank [8], weighted PageRank, topic sensitive PageRank, Hits [8], distance rank [9], SimRank [10] etc. We have selected two algorithms PageRank and SimRank of structure mining for discussion. PageRank is the basic and most used algorithm while SimRank is a more newer algorithm as compared to latter and can be varied using various technique to get better results. 2.1 PageRank PageRank [11] was proposed by L. Page and S. Brin, it calculates the significance of web pages using the link structure of the web. This approach works on the concept of equally counting in-links, by normalizing the number of links present on a page. PageRank is defined by assuming that page X has pages T1 . . . Tn , which have citations to it. The parameter d is a damping factor having a value between 0 and 1 but it is usually set to 0.85. The PageRank of a page X is denoted as: (1) P R(X) = (1 − d) + d(P R(Tm )/C(Tm ) + · · · + P R(Tn )/C(Tn )) where, C(Tm ) is number of out-links on page T i , P R(Tm ) is the PageRank of the Pages Tm which links to page X and d is damping factor. A probability distribution over the web pages is formed such that the sum of PageRank of all web pages will be one. Without identifying the value of PageRank of other pages, PageRank of any a page can be calculated. It is an algorithm which runs iteratively by following the principle of normalized link matrix of web. 2.2 SimRank SimRank [10] effectively measures similarity by link structure analysis by stating “two objects are similar if they are related to similar objects” [10]. By depending on the domain and the suitable definition of similarity for that domain various characteristics of objects can be used to decide similarity. SimRank proposes general approach that makes use of the object-to-object relationships found in many domains. For example: On the Web we can say that if there are hyperlinks between two pages then they are related. Scientific papers, their citations or any other document apply this approach to cross-reference information. Domains are denoted as graphs, with nodes denoting objects and edges denoting relationships. SimRank algorithm analyses the (logical) graphs derived from data sets to compute similarity scores based on the structural context between nodes (objects). The basic concept behind algorithm is that, objects m and n are similar if they are related to objects x and y, respectively, and x and y are themselves similar. The similarity between object m and n is given by s(m, n) ∈ [0, 1]. If m = n then sim (m, n) is defined to be 1. Otherwise sim(m, n) =

|I (m)| |I (n)| C s(Ii (m), I j (n)) |I (m)I (n)| i=1


(2)

j =1

649

Mohd Shoaib and Ashish K. Maurya

where c is a constant between 0 and 1. A case to be pointed out here is that either m or n may not have any in-neighbours. Therefore in this case, there is no way to conclude on any similarity between m and n, so set sim(m, n) = 0, and the summation in equation (2) is defined to be 0 when I (m) = Ø or I (n) = Ø. Computing sim(m, n), is calculated by iterating over all in-neighbour pairs (Ii (m), I j (n)) of (m, n), and by summation of the similarity sim(Ii (m), I j (n)) pairs. Then to normalize, divide the computed value by the total number of in-neighbour pairs, |I (m)I (n)|. That is, the similarity between m and n is the average similarity between in-neighbours of m and in-neighbours of n. 3. Web Content Mining Web content mining extracts useful information or knowledge from contents of web page. For example, according to the topics of web pages, they can be automatically classified and clustered. Patterns can also be discovered in Web pages to extract useful data for different purposes such as descriptions of products (for extracting information regarding the type of users which access the products, their reviews, market share, etc), postings of forums (for extracting information regarding the most discussed topic, reviews, feedback, etc). Also to determine consumer sentiments, extraction of customer reviews and forum postings can be done. In web content mining distinction could be made from two points of view: agent-based approach or database approach. The first approach intends on improving the information filtering and finding and can be defined in the following three categories: [12] • Intelligent Search Agents • Information Filtering/Categorization • Personalized Web The second approach involves a more structured way where aim is to model the data on the web so that standard querying mechanism of database can be applied and analysed by data mining applications. The two main categories are Multi-level databases and Web query systems [3,13,14]. There are various web content mining algorithms such as TF-IDF, k-Nearest Neighbour, Na¨ıve- Bayes, AdaBoost, decision tree based algorithm, induction rule based algorithm [15], etc. We here will discuss only TF-IDF and k-Nearest Neighbour as both of them are two most widely used content mining algorithms. 3.1 Term frequency-inverse document frequency The TF-IDF (Term Frequency-Inverse Document Frequency) [16] is used for statistically measure how significant a word/term is to a document in a collection or corpus. The significance of a word in a document increases proportionally with the number of times it appears but is offset by the frequency of the word in the corpus. For a given user query, TF-IDF weighting scheme with some variations is used by search engines as a major tool in scoring/ranking and deciding upon a document’s relevance. Let collection of documents in the corpus is denoted by D and collection of terms (unique tokens) in the collection D denoted by T . For a given term ti , the term frequency (tf ) within a specific document d j is defined as the number of occasions that term occurs in the d th j document, which is equal to n i , j : the number of occurrences of the term ti in the document d j . t f i, j = n i, j

(3)

Normalization of term frequency is done to avoid bias towards larger documents, as shown below: t f i, j =

n i, j K n k, j

(4)

where n i, j is the number of occurrences of the term ti in the document d j . The total number of terms is used for normalization. The inverse document frequency (idf) is obtained by taking logarithm of the result, found out by division of total number of documents by the number of documents containing the term ti : i d f i = log

650

|D| |{d : ti ∈ d}|

(5)


Comparative Study of Different Web Mining Algorithms to Discover Knowledge on the Web

where, |D| is total number of documents in the collection. |{d : ti ∈ d}| is number of documents where the term ti appears. To avoid division by zero, 1 is added to |{d : ti ∈ d}| making it: 1 + |{d : ti ∈ d}|. The tf-idf for a given corpus D is then defined as: (t f − i d f )i, j = t f i, j ∗ i d f i

(6)

TF-IDF calculates over the corpus, the following: • Number of times each term occurs in every document. • Number of unique terms. • Number of documents 3.2 k-Nearest neighbour kNN [17] is considered among the oldest nonparametric classification algorithms. It is a lazy learning method in the sense that no model is learned from the training data. When a test example needs to be classified only then learning occurs. k-nearest neighbour (kNN) classification, closeness of the test object is computed with a group of collected k objects in the training set and a label is assigned based on the prevalence of a particular class in this neighbourhood. There are three key components of this approach: e.g., value of k, number of nearest neighbours, a set of labelled objects, a similarity or distance metric to calculate distance between objects, a set of stored records. The distance of unlabeled object to the labelled objects is computed to classify an unlabeled object. Then its k-nearest neighbours are identified and to determine the class label of the object usage of the class labels of these nearest neighbours are used. Following are the steps of algorithm [18]: Algorithm kNN (D, d, k) 1) Calculate the distance between test instance d and every example in training set D. 2) Choose the most similar k examples in training set D that are nearest to test instance d, denote the set by P(⊆ D). 3) Assign test instance d the class that is the most frequent class in P (or the majority class). 4. Web Usage Mining Web usage mining is the determination of meaningful user access patterns from Web usage logs. Web Usage logs keep record every click made by each user. The meaningful patterns that can be discovered are: E-commerce and product-oriented user events, automatically generated data which is stored in server access logs, agent logs, client-side cookies events, referrer logs, User profiles and/or user ratings and Meta-data, page attributes, content of page, site structure. One of a main concern in Web usage mining is the pre-processing of number of clicks in click data of usage logs to produce the appropriate and accurate data for mining. The challenges in web usage mining are categorised in three phases [2]: • Pre-processing • Pattern discovery • Pattern Analysis Web usage mining depends on the cooperation of a user, to allow the access of the web log records. Due to this reliance on users in web usage mining privacy a major issue in privacy of user has evolved. Before users decide whether to reveal their personal data or not, they should be made conscious about privacy policies. There are various web usage mining algorithms such as PageGather, CDL4, Leader, Cobweb, Iterate, [19] etc., we have discussed PageGather and CDL4 algorithms. 4.1 PageGather PageGather [20] algorithm uses cluster mining (relying on the visit-coherence assumption) to find collections of related pages at a web site. PageGather takes input a web server access log [21] and maps it in a form which is ready for clustering. Cluster mining is then applied to the data and candidate index-page contents are produced as output. The algorithm has four basic steps: 1) Process the access log and find visits. 2) Create a similarity matrix by computing the co-occurrence frequencies between pages. © Elsevier Publications 2014.

651

Mohd Shoaib and Ashish K. Maurya Table 1. Comparison of web mining algorithms.

3) From the matrix a resultant graph is created, and find cliques (or connected components) in the graph. 4) For documents in each cluster create web page consisting of links to it.

4.2 CDL4 CDL4 algorithm [22] is that whose goal is to construct incrementally a decision list L from a sequence of training instances under the supervision. When a new instance x is misclassified, CDL4 revises L to classify x correctly. To accomplish this task, the algorithm should do three steps: 1) Establish whether there is a misclassification or not and which decision D j in L is responsible. 2) Establish upto what extend D j should be discriminated. 3) Discriminate D j and modify L. Where, D j is decision lists. For the first task, CDL4 searches throughout the decision list and returns the decision D j = ( f j , v j ), where j is the least index such that f j (x) = 1. f j is a conjunction of literals and v j is a value in {0,1}, as a whole we refer a pair ( f j , v j ) as the j th decision, f j is the j th decision test and v j is the j th decision value. If v j is not equal to x’s classification as provided, then f j is the decision test that needs to be discriminated. Else, the instance x is recorded as an example of D j . 652


Comparative Study of Different Web Mining Algorithms to Discover Knowledge on the Web Table 2. Performance measurement of algorithms.

5. Comparison and Discussion A Point to be noted is that there is no clear boundary to divide the above categories. As mentioned and defined these three categories of mining (structure, content, usage) are quite acceptable and can be used isolated or combined in an application. Table 1 shows a comparison of the discussed algorithms in the respective categories of mining. These algorithms can be combined together to get an effective ranking result of pages and documents on the web. Table 2 shows another set of parameters which compare web mining algorithms. The parameters used are relevance link analysis, supervised learning, un-supervised learning and regression analysis. Relevance is defined as how much the processing and results of algorithm are relevant to a useful application. Link analysis is defined as data analysis technique used to evaluate relationships between nodes. Supervised learning is machine learning from data, where data is collected in the past and it represents past experiences in some real-world applications. Un-supervised leaning is prediction of hidden data from unlabeled data. Regression analysis in machine learning is used to predict the relationship between variables. PageRank, a link analysis algorithm has a less relevance as it ranks pages on indexing time and is used to predict website’s future Page Rank based on the quality and quantity of back-links. SimRank, also a link analysis algorithm has a high relevance as this algorithms computes structural, textual, context similarity and does link-prediction in social networks and recommendation systems. PageRank and SimRank both can be implemented in supervised and un-supervised learning technique. TF-IDF, a text classification has a high relevance as it provides query dependent results according to relevance document has to query and can be used to predict hyperlink. KNN, a machine learning approach to text classification has a less relevance as test records do not get classified until they get exactly matched to any of the training records. PageGather, a partitioning clustering algorithm has a more relevance as it uses web server logs to identify user pattern and redesign web pages and predict specific URLs for recommendation. CDL4 has a less relevance as it computes preferences of the visitors and has a very complex decision rules which are hard to interpret, it also predicts visitor preferences. 6. Conclusions In this paper, we have studied various algorithms that are used for web mining. Algorithms were discussed to give an idea about in their application and effectiveness. In the Web mining Categories we discussed PageRank, SimRank, Term Frequency-Inverse Document Frequency, K-Nearest Neighbour, PageGather and CDL4 algorithms. All these algorithms simplify the web mining process and together work to give a better result. SimRank is used in varied forms to give better results in than PageRank. TF-IDF uses weightage to rank content of document. While k-NN uses training data to identify objects and find the closest similar object. CDL4 is not widely used due to complex rules but PageGather is easy to use by accessing web server logs. Users feel difficulty in finding desired information and deciding which information is relevant to them from general purpose search engines. These algorithms make document searching an easy process on internet as these together can be used in ranking process to refine results of web document. References [1] G. D. Kumar and M. Gosul, “Web Mining Reasearch and Future Directions”, Advances in Network Security and Applications, 4th International Conference on Network Security and Application, pp. 489–496, Springer Berlin Heidelberg, (2011). [2] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, (2000). [3] R. Kosala and H. Blockeel, “Web Mining Research: A Survey”, ACM SIGKDD Explorations Newsletter, vol. 2, issue 1, pp. 12–15, June (2000). [4] J. Srivastava, R. Cooley, M. Deshpande and Pag-Ning Tan, “Web Usage Mining: Discovery and Applications of usage Patterns from Web Data”, ACM SIGKDD Explorations Newsletter, vol. 1, issue 2, pp. 12–23, January (2000). © Elsevier Publications 2014.

653

Mohd Shoaib and Ashish K. Maurya [5] B. Singh and H. K. Singh, IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), pp. 1–10, (2010). [6] L. Getoor, “Link Mining: A New Data Mining Challenge”, ACM SIGKDD Explorations, vol. 5, issue 1, pp. 84–93, July (2003). [7] Miguel Gomes da Costa J´unior and Zhiguo Gong, “Web Structure Mining: An Introduction”, Proceeding of IEEE International Conference on Information Acquisition, Hong Kong and Macau, China, (2005). [8] Daxin Jiang, Jian Pei and Hang Li, “Mining Search and Browse Logs for Web Search: A Survey”, ACM Transactions on Computational Logic, vol. V, no. N, pp. 1–42, (2013). [9] Ali M. Z. Bidoki and N. Yazdani, “DistanceRank: An Intelligent Ranking Algorithm for Web Pages”, Information Processing & Management: An International Journal, vol. 44, issue 2, pp. 877–892, (2008). [10] Glen Jeh and Jennifer Widom, “SimRank: A Measure of Structural-Context Similarity”, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 538–543, (2002). [11] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Proceedings of the Seventh International World Wide Web Conference, (1998). [12] R. Cooley, B. Mobasher and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web”, Proceedings of Ninth IEEE International Conference on Tools with Artificial Intelligence, pp. 558–567, (1997). [13] A. A. Barfourosh, H. R. Motahary Nezhad, M. L. Anderson and D. Perlis, “Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition”, Technical Report, UM Computer Science Department, CS-TR-4291 UMIACS, UMIACS-TR-2001-69, (2002). [14] W. Jicheng, H. Yuan, W. Gangshan and Z. Fuyan, “Web Mining: Knowledge Discovery on the Web”, Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, vol. 2, pp. 137–141, (1999). [15] M. Azmy, “Web Content Mining Research: A Survey”, Draft Report, the Organizational Web Mining Group, the Central Lab for Agricultural Expert Systems, (2005). [16] L. H. Patil and M. Atique, “A Novel Approach for Feature Selection Method TF-IDF in Document Clustering”, IEEE 3rd International Advance Computing Conference (IACC), pp. 858–862, (2013). [17] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, Angus Ng, B. Liu, Philip S. Yu, Zhi-Hua Zhou, M. Steinbach, D. J. Hand and D. Steinberg, “Top 10 Algorithms in Data Mining”, Knowledge Information System, Springer-Verlag, (2007). [18] B. Liu, Web Data Mining-Exploring Hyperlinks, Contents and Usage Data, Springer-Verlag Berlin Heidelberg, (2007). [19] D. Pierrakos, Ge O. Paliouras, C. Papatheodorou and Constantine D. Spyropoulos, “Web Usage Mining as a Tool for Personalization: a Survey”, User Modeling and User-Adapted Interaction, vol. 13, issue 4, pp. 311–372, Kluwer Academic Publishers, (2003). [20] M. Perkowitz and O. Etzioni, “Adaptive Web Sites: Automatically Synthesizing Web Pages”, ACM Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, pp. 727–732, (1998). [21] S. Valsamidis, S. Kontogiannis, I. Kazanidis, T. Theodosiou and A. Karakos, “A Clustering Methodology of Web Log Data for Learning Management Systems”, Educational Technology & Society, vol. 15, no. 2, pp. 154–167, (2012). [22] W. M. Shen, “An Efficient Algorithm for Incremental Learning of Decision Lists”, Technical Report, USC-ISI-96-012, Information Sciences Institute, University of Southern California, (1996).

654


Comparative Study of Different Web Mining

Comparative Study of Different Web Mining

Suggest Documents

Comparative Study of Different Data Mining Techniques ... - IJARCSSE

Comparative Study of Different Data Mining Prediction Algorithms

Comparative Study of Different Data Mining Techniques ... - IJARCSSE

Comparative Study of Different Data Mining Prediction ...

A Comparative Study of Mining Web Usage ... - Semantic Scholar

A Comparative Study of Mining Web Usage ... - Semantic Scholar

Different Aspects of Web Log Mining

Different Aspects of Web Log Mining - CiteSeerX

A Comparative Study on Serial and Parallel Web Content Mining

Comparative Study of the Sensitivity of Different

Comparative Study of Different Formulations of ...

A Comparative Prospective Study of Two Different

Comparative study of different enzyme immunoassays for ...

Comparative Study of Different Wavelets for ...

Comparative pharmacokinetics study of two different

A Comparative Study of Different Metric Structures

Comparative Study of Different Hill Climbing MPPT

Comparative study of different Schlieren diffracting elements

A Study and Comparative Analysis of Different

Comparative Study of Different Modified Genetic ...

Comparative Study of Different Computational ... - Semantic Scholar

A Comparative Study Of Different Approaches For

A comparative study of different DNA barcoding

Comparative Study of Different Cryptographic Algorithms - IJETTCS