International Journal of Scientific & Engineering Research Volume 3, Issue 2, February-2012
1
ISSN 2229-5518
Web Mining Using Topic Sensitive Weighted PageRank Shesh Narayan Mishra, Alka Jaiswal, Asha Ambhaikar Abstract— The World Wide Web contains the large amount of information sources. While searching the web for particular topics, users usually fetch irrelevant and redundant information causing a waste in user time and accessing time of the search engine. So narrowing dow n this problem, user’s interests and needs from their behavior have become increasingly important. Web structure mining plays an effective role in this approach. Some page ranking algorithms PageRank, Weighted PageRank are commonly used in w eb structure mining. The original PageRank algorithm search-query results independent of any particular search query. To yield more specif ic and accurate search results against a particular topic , we proposed a new algorithm Topic Sensitive Weighted PageRank based on web structure mining that w ill show the relevancy of the pages of a given topic is better determined, as compared to the existing PageRank, Topic sensitive PageRank and Weighted PageRank algorithms. For ordinary keyword search queries, Topic Sensitive Weigted PageRan k scores will satisfy the topic of the query. Index Terms— Web structure mining; Weighted PageRank; Topic sensitive PageRank; TSWPR.
—————————— ——————————
1 INTRODUCTION
T
TODAY, the World Wide Web is the popular and inte ractive me dium to disseminate information. The Web is huge , diverse and dynamic. The We b contains vast amount of information and provides an access to it at any place at any time. The most of the pe ople use the interne t for retrie ving information. But most of the time , they ge ts lots of insignificant and irre le vant document eve n after navigating se veral links. For re trie ving information from the We b, We b mining techniques are use d.
2 WEB M INING OVERVIEW
Generalization: automatically discovers ge neral patterns at individual Web sites as we ll as across multiple sites. Analysis: validation and/or inte rpre tation of the mine d pa tterns. The re are three areas of We b mining according to the usage of the Web data use d as input in the data mining process, name ly, Web Conte nt Mining (WCM), Web Usage Mining (WUM) and Web Structure Mining (WSM).
Web mining is an application of the data mining techniques to automatically discover and e xtract knowle dge from the Web. According to Kosala e t al [3], Web mining consists of the fo llowing tasks: Resource finding: the task of retrieving intended Web documents.
Information selection and pre-processing: automatically se lecting and pre-processing specific information from re trie ve d Web resources.
———— ——— ——— ——— ———
Mr. S hesh Narayan Mishra is presently studying in the final semester of M. Tec h (Computer Technology) at Rungta College of Eng ineering and Tec hnology, Bhilai, C.G., India. Has obtained degree of MCA from CSVTU, Bhilai University in 2009. Research Interest inc ludes Web Mining Tec hniques Alka Ja iswa l presently working as Asst. Prof. in the Department of Informa tion Technology a t Rungta College of Engineering and Technology, Bhila i, C.G., India. Ha s been awarded degree of M.Tech (Information Security) with honors from Na tional Institute of Technology, Bhopal (M.P.), India in 2010. Resea rch Interest includes Database Security, Data Mining. 1 Resea rch pa per in interna tional Journal has been published. Prof. Asha Ambhaikar is currently working as Associate Professor in Computer Science Engineering in RCET Bhilai, India, PH-09229655211. Email:
[email protected]
Web conte nt usage mining, We b structure mining, and Web content mining. Web usage mining refe rs to the discove ry of user access patterns from Web usage logs. Web structure mining tries to discove r use ful knowle dge from the structure of hype rlinks which he lps to investigate the node and connection structure of web sites. According the type of web structural data, we b structure mining can be divide d into two kinds 1) e xtracting the docume nts from hype rlinks in the we b 2) analysis of the tree -like structure of page structure. Base d on the topology of the hype rlinks, we b structure mining will categorize the web page and ge nerate the information,
IJSER © 2012 http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 3, Issue 2, February-2012
2
ISSN 2229-5518
such as the similarity and mining is conce rne d with the re trieval of information from WWW into more structure d form and inde xing the information to retrie ve it quickly. We b usage mining is the process of ide ntifying the browsing patterns by analyzing the user’s navigational be havior. We b structure mining is to discover the mode l unde rlying the link structures of the Web pages, catalog the m and ge ne rate information such as the similarity and re lationship betwee n the m, taking advantage of the ir hype rlink topology. Web classification is shown in Fig 1.
anothe r page may be conside re d as a vote. Howe ver, not only the numbe r of votes a page rece ives is considere d important, but the “importance ” or the “re levance ” of the ones that cast these votes as we ll. Assume any arbitrary page A has pages T1 to Tn pointing to it (incoming link). PageRank can be calculate d by the following.
2.1 Web Content Mining (WCM)
The parame ter d is a damping factor, usually se ts it to 0.85 (to stop the other pages having too much influe nce , this total vote is “dampe d down” by multiplying it by 0.85). C(A) is de fine d as the numbe r of links going out of page A. The PageRanks form a probability distribution ove r the Web pages, so the sum of all Web pages’ PageRank will be one . PageRank can be calculate d using a simple ite rative algorithm, and corresponds to the principal e ige nvector of the normalize d link matrix of the We b.
Web Content Mining is the process of e xtracting use ful information from the conte nts of we b docume nts. The web documents may consists of te xt, images, audio, vide o or structure d records like tables and lists. Mining can be applie d on the web documents as we ll the results pages produce d from a search engine . The re are two types of approach in content mining calle d age nt base d a pproach and database base d approach. The age nt base d approach conce ntrate on searching re levant information using the characteristics of a particular domain to interpre t and organize the co llecte d information. The database approach is use d for re trie ving the se mi-structure data from the web.
2.2 Web Usage Mining (WUM) Web Usage Mining is the process of e xtracting use ful information from the secondary data derive d from the inte ractions of the user while surfing on the Web. It e xtracts data store d in serve r access logs, re ferrer logs, age nt logs, clie nt-side cookie s, user profile and me ta data.
2.3 Web Structure Mining The goal of the Web Structure Mining is to ge nerate the structural summary about the Web site and We b page. It tries to discove r the link structure of the hype rlinks at the inter-docume nt le ve l. Base d on the topology of the hype rlinks, Web Structure mining will categorize the Web pages and gene rate the information like similarity and re lationship be tween diffe re nt We b sites. This type of mining can be performe d at the docume nt le ve l (intra -page) or at the hype rlink le ve l (inte r-page ). It is important to unde rstand the Web data structure for Information Re trieval.
3
PR(A) = (1- d) + d(PR(T1) /C(T1) + ...+ PR(Tn /C(Tn ))
(1)
3.2 Weighted PageRank We npu Xing and Ali Ghorbani [1] propose d a Weighted PageRank (WPR) algorithm which is an e xte nsion of the PageRank algorithm. This algorithm assigns a large r rank values to the more important pages rather than dividing the rank value of a page eve nly among its outgoing linke d pages. Each outgoing link ge ts a value propo rtional to its importance . The importance is assigne d in te rms of we ight values to the incoming and outgoing links and are de note d as and W(in m, n) in out W(m, n) respective ly. W(m, n) as shown in (2) is the we ight of link(m, n) calculate d base d on the number of incoming links of page n and the numbe r of incoming links of all re fere nce pages of page m.
in W (m, n)
W(out m, n)
In Ip pR(m)
On Op pR ( m)
(2)
(3)
RELATED WORK
3.1 PageRank Brin and Page de ve lope d PageRank algorithm during the ir Ph D at Stanford University base d on the citation analysis. PageRank algorithm is use d by the famous search engine , Google . The y applie d the citation analysis in Web search by treating the incoming links as citations to the Web pages. Howe ver, by simply applying the citation analysis techniques to the diverse set of Web docume nts did not result in e fficie nt outcomes. The refore, PageRank provides a more advance d way to compute the importance or re le vance of a Web page than simply counting the numbe r of pages that are linking to it (calle d as “back links”). If a back link comes from an “important” page , the n that back link is give n a higher we ighting than those back links comes from non-important pages. In a simple way, link from one page to
Where and Ip are the number of incoming links of page n and page p respe ctive ly. R(m) de notes the re fe rence page list of page m. W out is as shown in (3) is the we ight of link(m, n) calculate d (m, n) base d on the numbe r of outgoing links of page n and the number of outgoing links of all re fe re nce pages of m. Where On and Op are the number of outgoing links of page n and p respective ly. The formula as propose d by We npu e t al for the WPR is as shown in (4) which is a modification of the PageRank formula.
WPR(n) (1 d) d
IJSER © 2012 http://www.ijser.org
in out WPR(m)W(m, n) W(m, n) mB(n)
(4)
International Journal of Scientific & Engineering Research Volume 3, Issue 2, February-2012
3
ISSN 2229-5518
3.3 Topic Sensitive PageRank In Topic Sensitive Page Rank, seve ral scores are compute d: multiple importance scores for each page under seve ral to pics that form a composite Page Rank score for those pages matching the query. During the offline crawling proce ss, 16 topic -se nsitive Page Rank vectors are ge nerate d, using as a guide line the top-le ve l cate gory from Ope n Directory Project (ODP). At que ry time, the similarity of the que ry is compare d to each of these vectors or topics; and subseque ntly, instead of using a single global ranking ve ctor, the linear combination of the topic-se nsitive vectors is we ighe d using the similarity of the que ry to the topics. This me thod yie lds a very accurate se t of results re levant to the conte xt of the particular query.
quire d documents easily. With a vie w to resolve the e xisting proble ms, a ne w algorithm calle d Topic Se nsitive We ighte d Page Rank has bee n propose d which e mploys We b Structure mining. This algorithm will improve the orde r of web pages in the result list so that use r may get the most re levant pages easily.
REFERENCES [1]
[2]
[3]
3.3.1 ODP Bia sing
[4]
The first ste p in this approach is performe d once during the offline pre processing of the Web crawl. It ge nerates a se t of biase d PageRank ve ctors using a se t of basis topics. The specific me thodology of creating the biase d PageRank vectors is to use the URLs liste d be low the 16 top-le ve l cate gorie s in ODP. These then, constitute the pe rsonalization vectors. A single , non-biase d Page Rank ve ctor is also compute d for comparison purposes. Third, a set of 16 class te rm-vectors is also compute d, which consists of the terms in the docume nts be low each of the 16 top-leve l categories.
[6]
[7]
[8]
3.3.2 Query-Time Importance Score The se cond ste p is pe rforme d at que ry time . If a que ry q was issue d by highlighting the term q in some we b page u, the n q (the conte xt of q) consists of the terms in u. For ordinary que ries not done in conte xt, le t q = q. using a unigram language mode l, with paramete rs se t to the ir maximum-like lihood estimates, the class probabilitie s for each of the 16 top-leve l classes in ODP conditione d on q are compute d. Using a te xt inde x, the URLs for all docume nts containing the original que ry te rms q are re trieve d. Finally, the que ry-se nsitive importance score of each of these re trieve d URLs is compute d. The results are the n ranke d according to the composite score .
4
[5]
[9] [10]
[11] [12]
[13]
[14]
M ETHODOLOGY
We are using We ighte d Page Rank to imple ment our Topic Se ns itive We ighte d Page Rank, we precompute importance scores o ffline using We ighte d Page Rank. For each page, we compute multiple importance scores, with respe ct to various topics. At time of query, the composite we ighte d page rank will be forme d by combining these importance scores.
[15]
[16]
[17]
5
CONCLUSION
The rapid prolife ration of World Wide We b has le d the web co nte nt to increase treme ndously. He nce , there is a great re quire ment to have algorithms that could list re levant we b pages accurate ly and e fficie ntly on the top of fe w pages. Mostly search e ngines use d Page Rank, We ighte d PageRank but use rs may not ge t re -
[18] [19]
[20]
W. Xing and Ali Ghorbani, “Weighted PageRank Algorithm”, Proc. ofthe Second Annual Conference on Communication Networks a nd Services Research (CNSR ’04), IEEE, 2004. Taher H. Haveliwala. Topic -Sensitive PageRank: A Context-S ensitive Ranking Algorithm for Web Search. IEEE Transac tions on Knowledge and Data Engineering, Vol. 15, No4, July/August 2003, 784-796. R. Kosala, H. Blockeel, “Web Mining Research: A S urvey”, S IGKDD Explorations, Newsletter of the ACM S pecial Interest Group on Knowledge Discovery and Data Mining Vol. 2, No. 1 pp 1-15, 2000. N. Duhan, A. K. S harma and K. K. Bhatia, “Page Ranking Algorithms:A S urvey, Proceedings of the IEEE Interna tiona l Conference on Advance Computing, 2009. M. G. da Gomes Jr. and Z.Gong, “Web S truc ture Mining: An Introduction”, Proceedings of the IEEE Internationa l Conference on Information Acquisition, 2005. A. Broder, R. Kumar, F Maghoul, P. Raghavan, S. Rajagopalan, R. S tata, A. Tomkins, J. Wiener, “Graph S tructure in the Web”, Computer Networks: The International Journal of Computer and telecommunica tions Networking, Vol. 33, Issue 1-6, pp 309-320, 2000. X. Wang, T. Tao, J. T. S un, A. S hakery and C. Zhai, “Diric hletRank: Solving the Zero -One Gap Problem of PageRank”. ACM Tra nsaction on Information Systems, Vol. 26, Issue 2, 2008. Z. Gyongyi and H. Garcia-Molina, “Web S pam Taxonomy”. Proc. of the First International Workshop on Adversarial Informa tion Retrieva l on the Web”, 2005. M. Bianchini, M.. Gori and F. Scarselli, “Inside PageRank”. ACM Transactions on Internet Technology, Vol. 5, Issue 1, 2005 C.. H. Q. Ding, X. He, P. Husbands, H. Zha and H. D. S imon, “PageRank: HITS and a Unified Framework for Link Analysis”. Proc. of the 25th Annual International ACM SIGIR Conference on Resea rch a nd Development in Information Retrieval, 2002. J. Cho and S. Roy, “Impact of S earch Engines on Page Popularity”. Proc. of the 13th International Conference on WWW, pp. 20-29, 2004. J. Cho, S . Roy and R. E. Adams, “Page Quality: In searc h of an unbiased web ranking”. Proc. of ACM Internationa l Conference on Management of Data”. Pp. 551-562, 2005. A. M. Zareh Bidoki and N. Yazdani, “DistanceRank: An intelligent ranking algorithm for web pages” Information Processing a nd Management, Vol 44, No. 2, pp. 877-892, 2008. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, R. Kumar, P. Raghavan, S . Rajagopalan, A. Tomkins, “Mining the Link S truc ture of the World Wide Web”, IEEE Computer Society Press, Vol 32, Issue 8 pp. 60 – 67, 1999. L. Page, S. Brin, R. Motwani, and T. Winograd, “The Pagerank Citation Ranking: Bringing order to the Web”. Technica l Report, Stanford Digital Libraries S IDL-WP-1999-0120, 1999. S. Brin, L. Page, “The Anatomy of a Large Scale Hypertextual Web search engine,” Computer Network and ISDN Systems, Vol. 30, Issue 17, pp. 107-117, 1998. J. Hou and Y. Zhang, “Effectively Finding Relevant Web Pages from Linkage Information”, IEEE Transactions on Knowledge a nd Da ta Engineering, Vol. 15, No. 4, 2003. J. Dean and M. Henzinger, “Finding Related Pages in the World Wide Web”, Proc. Eight Int’l World Wide Web Conf., pp. 389-401, 1999. R. Kumar, P. Raghavan, S. Rajagopalan, D. S ivakumar, A. Tompkins and E. Upfal, “Web as a Graph”, Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Database systems, 2000. R. Cooley, B. Mobasher and J. S rivastava, “Web Minig: Information and Pattern Discovery on the World Wide Web”. Proceedings of the 9 th
IJSER © 2012 http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 3, Issue 2, February-2012 ISSN 2229-5518
IEEE Interna tiona l Conference on Tools with Artificial Intelligence, pp. (ICTAI’97), 1997. [21] S ung Jin Kim and S ang Ho Lee, “An Improved Computation of the PageRank Algorithm”, In proceedings of the European Conference on Information Retrieval (ECIR), 2002. [22] Ric ardo Baeza-Yates and Emilio Davis ,"Web page ranking using link attributes" , In proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, PP.328-329, 2004.
IJSER © 2012 http://www.ijser.org
4