Improve Page Rank Algorithm using Normalized ...

6 downloads 110720 Views 680KB Size Report
All the major search engines such as Google, Yahoo, Ask, Bing, etc. rank web-pages based on certain factors that affect its ranking; therefore,. SEO aims at ...
Nidhi K.Chitroda |Improve Page Rank Algorithm using Normalized Technique

IIJDWM Journal homepage: www.ifrsa.org

Improve Page Rank Algorithm using Normalized Technique Nidhi K.Chitroda M.Tech. [Computer Engineering] Student, School Of Engineering R.K. University, Katurbadham, Tramba, Rajkot, Gujarat ABSTRACT Search engine is one kind of software, which enlists data about web sites. All the major search engines such as Google, Yahoo, Ask, Bing, etc. rank web-pages based on certain factors that affect its ranking; therefore, SEO aims at generating the right types of signals on the web-pages. The core methodology used in SEO is to upgrade both content and associated coding of the website to improve its visibility and prominence in organic searches made by the search engines. The optimized websites obtain better ranks, and typically get a higher number of visitors. This research is based on studying different optimization techniques for individual web-pages or the entire website to make them search engine friendly. Besides, this study also critically analyses and summarizes the core techniques proposed in the contemporary literature. An efficient web page ranking algorithm should meet out these challenges efficiently with compatibility with global principles of web technology. 1.

INTRODUCTION

Nowadays searching on the internet is most widely used operation on the World Wide Web. The amount of information is increasing day by day rapidly that creates the challenge for information retrieval. There are so many tools to perform efficient searching. Due to the size of web and requirements of users creates the challenge for search engine page ranking [19]. Ranking is the main part of any information retrieval system Today’s search engines may return millions of pages for a certain query It is not possible for a user to preview all the returned results So, page ranking is helpful in web searching. Rankers are classified into two groups: Content-based rankers and Connectivity-based rankers. Content-based rankers works on the basis of number of matched terms, frequency of terms, location of terms, etc. Connectivity-based rankers work on the basis of link analysis technique, links are edges that point to different

web pages. PageRank has been developed by Google and is named after Larry Page, Google’s co-founder and president [15]. PageRank is designed to simulate the behavior of a “random web surfer” [16] who navigate a web by randomly following links. If a page with none outgoing links is achieved, the surfer goes to a randomly chosen bookmark. In more to this normal surfing behavior, the surfer occasionally spontaneously goes to a bookmark instead of following a link. A. Importance of Page Rank and Back Links Back Linking can be an effective method for gaining Page Rank. Google Panda allowed the quality optimized sites with suitable Back Links, to top the search engine rankings and lower the less significant websites. Back Links have become extremely important to SEO and have furthermore developed into the main building blocks to high-quality SEO. [8] Back Links can be described as links that are directed towards a website. The amount of Back Links is a signal of the popularity/importance of that website. Search Engines such as Google, offers extra credit to websites that have a high number of quality Back Links and consider those websites more relevant than others in the Page Ranks. “The more back links website will have, the more frequently it will be visited by search engines robots, which means that new content will be indexed much faster”. [5] In theory, if one page on a website has a lot of extremely relevant Back Links, it will be ranked high in the search engines, whereas the remainder of the pages will be ranked low. “In reality, the whole website will be ranked high, because very often, each page will have a plenty of internal back links to the main page and vice versa”. [5] In short Page Rank is a "vote", by all the other pages of site on the Web, about how important a page is. [4] Any link to a web page determines as a vote of support. [4] If there's no link there's no support (but it's only an abstention from voting rather than a vote against the page). Page Rank is determined like this:

IFRSA International Journal of Data Warehousing & Mining |Vol 4|issue1|Feb. 2014

42

Nidhi K.Chitroda |Improve Page Rank Algorithm using Normalized Technique We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set in between 0 & 1. We usually set d to 0.85. There are more details about d in the next section. And also C (A) is defined as the number of links going out of page A. The Page Rank of the page A is given as follows: [4] PR (A) = (1-d) + d (PR (T1)/C (T1) + … + PR (Tn)/C (Tn)) The sum of all web pages Page Ranks will be one as Page Ranks form a probability distribution over web pages. PageRank or PR (A) can be calculated using the various algorithm by matrix of the web links. 2.

HOW TO MEASURE PAGE RANK

Let begin by picturing the Web net as a directed graph, with nodes represented by web pages and edges represented by the links between them. [10] Suppose for instance, that we have a small Internet consisting of just 4 web sites www.page1.com, www.page2.com, www.page3.com, www.page4.com, referencing each other in the manner suggested by the picture:

Figure 2.2 Node links In our model, each page should transfer evenly its importance to the pages that it links to. Node 1 has 3 outgoing edges, so it will pass on of its importance to each of the other 3 nodes. Node 3 has only one outgoing edge, so it will pass on all of its importance to node 1. In general, if a node has k outgoing edges, it will pass on of its importance to each of the nodes that it links to. Let us better visualize the process by assigning weights to each edge.

Figure 2.3 Importance Distribution So, denote by A the transition matrix of the graph, A=

Figure 2.1 Webpages We "translate" the picture into a directed graph with 4 nodes, one for each web site. When web site i references j, we include a directed edge between node i and node j in graph. For the purpose of computing their page rank, we ignore any navigational links such as back, next buttons, as we only care about the connections between different web sites. For instance, Page1 links to all of the other pages, so node 1 in the graph will have outgoing edges to all of the other nodes. Page3 has only one link, to Page 1, therefore node 3 will have one outgoing edge to node 1. After analyzing each web page, we get the following graph:

. After making such a matrix one can follow any algorithm to compute page rank of web page. 3.

PAGERANK ALGORITHMS

Page ranking algorithms are the heart of search engine and give result that suites best in user expectation. Need of best quality results are the main reason in innovation of different page ranking algorithms, HITS, PageRank, Weighted PageRank, Distance Rank, Dirichlet Rank Algorithm , Page content ranking are different examples of page ranking used in different scenario. Since GOOGLE search engine has great importance now days and this affect many web users now days, so page rank algorithm used by GOOGLE become very important to researches. A. Page rank Algorithm

IFRSA International Journal of Data Warehousing & Mining |Vol 4|issue1|Feb. 2014

43

Nidhi K.Chitroda |Improve Page Rank Algorithm using Normalized Technique PageRank was proposed by S. Brin and L. Page at Stanford University. PageRank is used by very popular search engine of today, GOOGLE. The basic concept behind GOOGLE is, it combined the actual page rank value of page with the matching text value of query and find the overall score of a page [6] [5]. PageRank deals with the link structure of web page. It gives more importance to back links of a web page. If a page has important back links that means this page automatically gets more importance and provide more importance to other pages which this page is pointing. [1] Basic formula of a page rank is, ( ) = ∑ ∈ ( ) ( ) / ( ) (1) But in some research paper it was given differently ( )=c∑ ∈ ( ) ( )/ ( ) (2) Where pr (u) = page rank of page u that we want to find out B (u) = Back link set of page u pr (v) = Page rank value of page v that is pointing to page u L (v) = Number of link present in page v C = Normalization factor Since these formula basically holds for those users which follow the link structure of page. But for WWW user do not follow the link structure; they randomly click web pages, so for that the modified page rank formula is ( ) = 1− + ∑ ∈ ( ) ( )/ ( ) (3) Where; D=damping factor (Whose value is 0.85(approx.)) In Equation (3) we can see that (1-d+d) =1 so it simply multiply 1 with the equation (1) and (2). Here d means user actually follows the link structure and (1-d) means user is randomly select any page on WWW. This formula actually calculated in iterative manner, And after conversation, the actual value of page rank is calculated. PageRank Value converges roughly in logon [1]. B. Weighted PageRank algorithm Weighted page rank was proposed by Wenpu Xing and Ali Ghorbani [1] [5]. It is the modified version of traditional PageRank algorithm. In this technique, it does not divide page rank value equally in between number of page links available in the page. Unlike the traditional Pagerank algorithm it gives more value to the important pages. Each out link page gets a value proportional to its popularity. Popularity of web pages can be determine by the help of outgoing links and in coming links to the web pages, so this algorithm determine this value by the help of two functions, namely Win(v,u) = weight of the link (v , u) calculated based upon number of in links of page u and the number of in links of all references pages of page v. Win (v, u) = u / ∑ ∈( ) p Where, Iu = Number of inlinks of page u.

Ip = Number of inlinks of page p. Wout(v,u) = weight of the link (v,u) calculated based upon number of out links of page u and the number of out links of all references pages of page v. Wout (v, u) = Ou / ∑ ∈( ) Op Where; Ou = Number of inlinks of page u. Op = Number of inlinks of page p. So the modified PageRank algorithm is as given below ( ) = 1− + ∑ ∈ ( ) ( ) in (v, u) Wout (v , u) Where; wpr (u) = page rank of page u that we want to find out wpr (v) = Page rank value of page v that is pointing to page u C. PageRank based on Visits of Links (vol) This algorithm was proposed by Gyanendra Kumar, Neelam Duhan, A. K. Sharma in 2011 at International Conference on Computer & Communication Technology (ICCCT)-2011 [9]. Unlike traditional PageRank algorithm, it does not divide page rank value equally between outgoing links. Instead of this it assign more rank value to the outgoing links which is most visited by users. So in this manner page rank is calculated based on visits of inbound links. The formula behind this algorithm is given below. ( ) = 1− + ∑ ∈( ) ( ( )/ ( ) ) Lu Where; d = damping factor, u = web page, B (u) = set of pages that point to u. PR(u) and PR(v) are the rank score of page u and page v. Lu = Number of visits of links which are pointing page u from page v. TL (v) = Denotes total number of visits of all links present on page v. D. Weighted PageRank based on Visits of Links (vol) It was proposed by Neelam Tyagi, Simple Sharma in 2012 at International Journal of Soft Computing and Engineering (IJSCE).In traditional weighted page rank algorithm, page rank value was assigned to its outgoing links according to its popularity and this popularity can be recorded by two functions namely Win and Wout [11]. But this algorithm did not consider the popularity of outlinks. This new algorithm do consider the weight age of page as well as visits of links (i.e. user behavior). In this algorithm it assigns more rank value to the outgoing links which is most visited by users and received higher popularity from number of inlinks. The main purpose behind this algorithm is to provide relevant information to users when they search on search engine and get result based on user conduct, that’s why the searching time gets reduced. The formula behind this algorithm is WPRvol (u) = (1-d) +d ∑ ∈( ) ( u WPRvol (v) Win (v, u)) / TL (v)

IFRSA International Journal of Data Warehousing & Mining |Vol 4|issue1|Feb. 2014

44

Nidhi K.Chitroda |Improve Page Rank Algorithm using Normalized Technique Where; d = damping factor, u = web page, B (u) = set of pages that point to u. WPRvol (u) and WPRvol (v) are the rank score of page u and page v. Lu = Number of visits of links which are pointing page u from page v. TL (v) = Denotes total number of visits of all links present on page v. Traditional Page rank Algorithm 1) Calculate page ranks of all pages by following formula: PR (A) = (1-d) + d (PR(T1)/C(T1) + ...... + PR(Tn)/C(Tn)) Where PR (A) is the PageRank of page A, C (Ti) is the number of outbound links on page Ti PR (Ti) is the PageRank of pages Ti which link to page A and d is a damping factor which can be set between 0 and 1, but it is usually set to 0.85 2) Repeat step 1 until values of two consecutive iterations match. Normalized Page rank Algorithm 1) Initially assume PAGE RANK of all web pages to be any value, let it be 1. 2) Calculate page ranks of all pages by following formula PR(A) = 0.15 + 0.85 (PR(T1)/C(T1) + PR(T2)/C(T2) +……. + PR(Tn)/C(Tn)) Where T1 through Tn are pages providing incoming links to Page A PR(T1) is the Page Rank of T1 PR(Tn) is the Page Rank of Tn C(Tn) is total number of outgoing links on Tn 3) Calculate mean value of all page ranks by following formula:Summation of page ranks of all web pages / number of web pages 4) Then new page rank of each page N PR (A) = PR (A) / mean value Where N PR (A) is New Page Rank of page A and PR (A) is page rank of page A 5) Assign PR(A)= N PR (A) 6) Repeat step 2 to step 4 until page rank values of two Consecutive iterations are same. The pages which have the highest page rank are more significant pages. So, number of iterations for calculating the page ranks in the Proposed Page Rank algorithm are reduced. 4.

Figure 3.1 Web Graph [2] The proposed algorithm is implemented on www.icare.manit.ac.in. Page rank of individual web page by this implementation is shown in Table 3.2. Table 3.2 Final Page Ranks for different web pages Home Page 6.5 About_MANIT

0.5804

Announcement

0.5804

Conference_Brochure

0.5804

Objective

0.5804

Themes

0.5804

Important_Dates

0.5804

Call_for_paper

0.5804

Registration_Form

0.5804

Contact_Person

0.5804

Accommodation

0.5804

Co_organizers

0.5804

Exhibition

0.5804

ProgrammeofConference

0.5804

EXPERIMENTS

IFRSA International Journal of Data Warehousing & Mining |Vol 4|issue1|Feb. 2014

45

Nidhi K.Chitroda |Improve Page Rank Algorithm using Normalized Technique

Figure 3.2 Page rank Vs. No. of iterations Another example:

Figure 3.3 Webpage connection Table 3.3 Pagerank Web pages Page Rank Page 1.com 2 Page 2.com 1.5609 Page 3.com

1.0954

Page 4.com

0.7687

formula. But when a simple calculation is applied hundreds (or billions) of times over the results can seem complicated. Page Rank is, in fact very simple apart from one scary seeming formula. But when a simple calculation is applied hundreds (or billions) of times over the results can seem complicated. Time on site and other factors like one’s links seem more likely to be bigger factors in the ranking algorithms. The average Actual PR of all pages in the index is 1.0. So if one ad pages to a site we are building the total PR will go up by 1.0 for each page but the average will remain the same. By comparison one can conclude that pagerank and pagerank (VOL) gives high result as compare to both weighted pagerank and weighted pagerank (VOL) for d=0.5. But when we compare Pagerank and Pagerank (VOL), these two algorithms may pass one another depending upon condition, sometimes Pagerank gives higher value, sometimes Pagerank (VOL). One can also observe from table that for small value of d that means if users often visit web pages randomly (d=0.25 means probability of user to follow link structure is 0.25) page rank of web page converge quickly as compare to large value of d. Comparative study of the previous research work regarding the techniques used in SEO offers pinpoints certain gaps in the known search engine optimization techniques. But still improvement is algorithm by reducing number iteration can be done which is as future work and efficiency of algorithm can be increased by reducing time complexity of algorithm. REFERENCES [1]

[2]

[3]

[4]

Figure 3.4 Page rank Vs. No. of iterations 5.

[5]

CONCLUSION & PROPOSED WORK

SEO basically relies on user behavior, social engagements, visitors and other publishers. Page Rank is, in fact very simple apart from one scary seeming

[6]

Kaushal K., Abhaya, Fungayi D., 2013: 'PageRank algorithm and its variations: A Survey report', IOSR Journal of Computer Engineering (IOSR-JCE), 14. Hema D. ,Prof. B. N. Roy, 2011: 'An Improved Page Rank Algorithm based on Optimized Normalization Technique', (IJCSIT) International Journal of Computer Science and Information Technologies, 2(5). Nursel Y, Utku K., 2010: ‘What is search engine optimization: SEO?’, 1877-0428 © 2010 Published by Elsevier Ltd. Ali H. Al-Badi, Ali O. Al Majeeni, Pam J. Mayhew and Abdullah S. Al-Rashdi, 2011: ‘Improving Website Ranking through Search Engine Optimization’, Journal of Internet and e-business, 2011, Article ID 969476. Meng Cui, Songyun Hu, 2011: ‘Search Engine Optimization Research for Website Promotion’, IEEE computer society. ‘Google Page Rank Algorithm’, http://www.sirgroane.net/google-page-rank/

IFRSA International Journal of Data Warehousing & Mining |Vol 4|issue1|Feb. 2014

46

Nidhi K.Chitroda |Improve Page Rank Algorithm using Normalized Technique [7] [8]

[9]

[10]

Michael David, 2013: 'Search Engine Optimization in 2013'. Dr. Khanna S., Omprakash V., 2011: 'CONCEPT OF SEARCH ENGINE OPTIMIZATION IN WEB SEARCH ENGINE', IJAERS, 1,235-237. Gyanendra K., Duhan N., and A. Sharma, 2011: ‘Page Ranking Based on Number of Visits of Links of Web Page’, proc. India International Conference on Computer & Communication Technology (ICCCT). Tyagi N., and Sharma S., 2012: ‘Weighted Page Rank Algorithm Based on number of visits of links (VOL)’, International Journal of Soft Computing and Engineering (IJSCE), 3(2), 2231-2307.

[11]

[12]

[13] [14]

[15] [16]

Sharma K., Shrivastwa G., and Kumar V., 2011: ‘Web mining: Today and Tomorrow’, IEEE, 978-1-4244-8679/11 Sharma D, and A. K. Sharma, 2010: ‘A Comparative Analysis of Web Page Ranking Algorithms’, International Journal on Computer Science and Engineering, 1, 26702676. SEO Tips, 2011: ‘Backlinks’, accessed January 2013, http://www.seotipsy.com/backlinks/ ‘Earnings per Click (EPC)’, http://www.seoadept.com/glossary/earnings_pe r_click.html http://www.searchenginejournal.com/category/ search-engine-optimization/ http://en.wikipedia.org/wiki/PageRank

IFRSA International Journal of Data Warehousing & Mining |Vol 4|issue1|Feb. 2014

47

Suggest Documents