CCIS 168 - Search Results Optimization - Springer Link

Search Results Optimization Divakar Yadav, Apeksha Singh, and Vinita Jain Jaypee Institute of Information Technology, A-10, Sector-62, Noida (India) [email protected], [email protected], [email protected]

Abstract. In this paper, we put forward a technique for optimization of the search results obtained in response to an end user’s query. With the enormous volume of data present on the web, it is relatively easy to find matched documents containing the given query terms. The difficult part is to select the best from the possible myriad of matching pages. Moreover, most Web search engines perform very well for a single keyword query but fail to do so in case of multiple terms. In this paper by using the concept of Meta search engines we propose a suitable query processing and optimization algorithm for giving the best possible result for multiple term keywords in the ranked order. Keywords: Search Optimization, Query processing, Re-ranking, tf-idf, inverted indexing.

1 Introduction The simple concept of search results optimization is to pick precisely the matched documents and produce it to the end user in the order of relevance with respect to the user query. It aims at giving the best suitable optimized results according to the users query, benefiting them in the structure of time and ease. According to [1] optimization involves comparison of a variety of parameters such as keyword density, link analysis, anchor text content, Meta tags and so on. A web crawler, which takes a seed URL and crawls the web pages accordingly, is used to retrieve the web pages [2]. Reranking involves calculation of the overall scores of the documents based on their weighing schemes and positions them in the order of their ranks [5]. The aim in this work is to develop a suitable weighing scheme by taking care of the factors mentioned in [1] which help to optimize the matched web pages when multiple terms query is provided as search input by the users. While optimizing the result various other factors such as page structure, the frequency of keywords and their position in a document are also to be considered. The paper is organized in following ways: In the current section we discussed about the introduction of search results optimization and the key problems. Section 2 describes the research work that has been already been done in the field. Various research papers have been studied and examined for this purpose. Section 3 contains the algorithm that has been proposed for multi keyword based search results optimization. Implementation and results have been discussed in section 4. Finally in S. Aluru et al. (Eds.): IC3 2011, CCIS 168, pp. 325–334, 2011. © Springer-Verlag Berlin Heidelberg 2011

326

D. Yadav, A. Singh, and V. Jain

section 5 we have concluded the work with some future aspects followed by references of section 6 which have been the source of our knowledge.

2 Literature Survey Extensive research and study has been done in the field of search results optimization in the past. Various research papers have been studied by us for the purpose of gaining knowledge in the field which motivated us to begin the work on this topic. Summarization of few selected research papers helpful on the topic is given hereafter. The impact of search engine ranking and result optimization for the websites has been discussed in [10]. The fundamental methods involved in the optimization of the websites have been described in [1]. It talks about the relevance of Page Title in deciding the rank of web pages. Keyword Optimization and Link Analysis have also been stated as important criteria for making a web page search engine friendly and optimized. The page rank algorithm given by Google founder which is based on link structure is discussed in [2]. It also describes the initial architecture behind the Google search engine. This paper helped us to know about the components and system structure involved in search engines, such as the web crawlers, repositories, web servers etc. A detailed explanation of the page ranking scheme has also been given in [6] and it describes the importance of the ranking scheme over citation counting method. The inefficiency of web search engines for multiple term keywords has been discussed in [3]. It states that while most web search engines perform very well for a single-keyword query, their precisions is not so good for queries involving two or more keywords. This is primarily because the search results usually contain a large number of pages with weak relevance to the query keywords. It proposes a method to improve the precision of Web retrieval based on proximity and density of keyword for two-keyword queries. The algorithm for better optimized search results has been discussed in [4]. It introduces a query processing scheme which forms the basis of our algorithm. Also a suitable ranking criterion, the weighing scheme and clustering method have been proposed. The re-ranking method which filters the unrelated documents by employing document comparison method, link extraction technique and by comparing the contents of the anchor text has been described in [5]. The various crawling algorithms, such as the shark algorithm etc have been discussed in [7], [8] while the inverted index scheme and its application have been discussed in detail in [9].

3 Proposed Algorithm In this paper, we propose a new algorithm for optimization of the search results in response to a user’s query. The major challenge while doing this was to devise an

Search Results Optimization

327

efficient method to handle multiple query terms and to incorporate the various optimization strategies together with it. The retrieval of the web pages for the algorithm has been done by a web crawler which works on the principle of seed URL. For query processing we broke the query term into different subsets, processed it and calculated a probability factor which is associated with each of the subsets depending on their relevance, in ascending order. Further, we applied the optimization strategies such as Keyword density, Inverse Document Frequency (idf) and some others to calculate the scores of all the documents. Finally the documents are re-ranked based on their respective scores. Given below is the sequential description of the proposed technique. 1. Retrieve various web pages for the query term with the help of a WebCrawler from various search engines and store them. 2. For handling multiple query terms to ensure better optimization, every query is broken down into sets which contains all possible combinations of its terms such as single terms, double terms and so on, up to the maximum length of the original query. Afterward, a probability factor is attached with these sets which assign highest probability to the set having maximum term combinations. The sole reason to do this is based on the fact that web pages which contain most of the query terms would be more relevant as compared to those having just one or two terms. 3. Normalize the query terms. For this a weight is associated to all the sets of the query terms. In our case the probability becomes the weight. This weight serves as the main multiplying factors while calculating the ranks of the web pages. The other factors that have been taken into consideration are as follows: (a) Different weights have been assigned to the different tags according to their relative importance in a web page. For example if title and Meta tags are present in the header of the page they are considered more important than the other contents of the page and hence, they are assigned more weights. Therefore, the query term occurring in them will fetch the most score. (b) Another factor taken into account is the relative position of keywords in a web page. A web page which has the query term in the abstract or beginning of the web page will carry higher relevance than the pages which have the same query term towards the end in web page. Thus query terms found in different paragraphs will contain different scores based on different weights assigned to the different paragraphs. (c) Anchor tags have been assigned a weight lower than title and Meta tags. Score is calculated by calculating the frequency of all n grade query terms in anchor tags and multiplying the frequencies with the weight of respective terms. (d) Term frequency and inverse document frequency of the query terms is computed and tf-idf is also assigned a relevant weight. The inverse document frequency takes into account the anomaly which arises when certain pages are small while others are pretty long.

328


Based on the above factors a score for each web page is calculated using following formula. Score (q, d) = frequency of the term (in each section)*weight (same section) ----- (1) Incorporating the term frequency and inverse document frequency in above equation the final score is computed using the following formula. Final Score =score (q, d) + tf-idf (q, d)

-----------(2)

Where, q = query, d = document, tf-idf = term frequency * inverse document frequency. Using the above formula the scores for each web document is computed and then they are arranged in their decreasing order of scores. The web page having the highest score is the most optimized result for a user’s query.

4 Implementation and Results We implemented a Meta search engine to optimize the web pages for a user’s query. While implementing Meta search engine following procedures have been followed. 4.1 Retrieval of the Web Pages As soon as the user enters the query term and hits the search button a web crawler crawls the search results of various existing search engines, retrieves the links of web pages and stores them in our database. For every keyword around 10-25 links are retrieved and stored randomly. Then the crawler iterates through the links and retrieves their corresponding web pages and stores it in the database. 4.2 Ranking Procedure The web pages are then ranked on the basis of the following factors: Instead of searching for the entire query entered by the user, for better optimization results, we broke the query term into a number of subsets. For example if a 5 word query is there then it will be broken down into sets containing all possible combinations of single terms, double terms and so on , till the length of the query. E.g. Query term: I like icecream Single phrases ● I ● Like ● Icecream


329

Double phrases ● I like ● Like icecream ● I icecream Triple phrase ● I like icecream The main reason for this was the fact that the pages that contain the entire query phrase are often more relevant than those containing single terms of the query. The combinations of all the words are stored in 2D array for better access and also because for different sets different weights have to be assigned. For example, a query term “how to construct a triangle” will be broken down and all the combinations will be stored in array as shown in fig. 1.

Array of combinations of 5 words Array ([0] => how to construct a triangle) Array of combinations of 4 words Array([0] => to construct a triangle [1] => how construct a triangle [2] => how to a triangle[3] => how to construct triangle[4] => how to construct a ) Array of combinations of 3 words Array([0] => construct a triangle [1] => to a triangle [2] => to construct triangle [3] => to construct a [4] => how a triangle [5] => how construct triangle [6] => how construct a [7] => how to triangle [8] => how to a [9] => how to construct) Array of combinations of 2 words Array([0] => a triangle [1] => construct triangle [2] => construct a [3] => to triangle [4] => to a [5] => to construct [6] => how triangle [7] => how a [8] => how construct [9] => how to) Array of combinations of 1 words Array ([0] => triangle [1] => a [2] => construct [3] => to [4] => how) Fig. 1. Combinations of query term phrases

4.3 Calculation of Relevancy Factor The relevancy factor (probability) that has been assigned to each query set is given by p(i) = pi/p

----------------(3)

Where pi is the number of permutations for the nth grade query term and p is the total number of permutations (summation of number of permutations for all the grades).

330


This probability ‘p(i)’ is the main multiplying factor while calculating the ranks. It serves as a kind of weight, thus having the highest value for the set containing maximum terms combinations and lowest for single terms combinations. This is in compliance with the fact that there is always more probability of existence of single word terms in a document. The frequency of occurrence for single term combinations in a document is higher in comparison to n term combinations. Hence least weight or relevance is assigned to single term combinations. On the same basis the set that has all the words of the query term will have the highest weight (as document containing the full query term will be always is more relevant). In this way score for more relevant page is higher even if it is shorter in length. This transfigures the probability factor into the relevancy factor for each query term subset. To perform more accurate ranking of the document we even calculate the Inverse Document Frequency. This takes into account the problem that often documents vary in size from each other. The tf-idf which is calculated for each document thus gives us a rough idea about the relevancy of each of them.

tfi,j = ni,j / ∑ k nk,j ni,j is the number of occurrences of the considered term (ti) in document dj and denominator is the sum of number of occurrences of all terms in document dj idfi = log( |D| / |{d : ti D, is the total number of documents in the corpus |{d : ti

∈ d}|)

∈ d}|, in which the term t appears. i

Values of tf-idf are calculated for every query term subset that has been derived from query term during normalization using equation (4). (tf-idf)i,j = tfi,j * idf

---------------(4)

4.4 Location in a Document The third step for optimization was to find the term frequency of each query term set in the different parts of the documents like in title, Meta tag, and anchor tag. The term frequency of the query terms have been calculated in all the tags and the respective tags are assigned weights. Next, for each set of query term we multiply these weights with the relevancy factor calculated above as shown in table 1. Table 1. Parameters and Weights ID

ITEM

WEIGHT

1

Title tag

W1

ΣFit*pi*

2

Meta tag

W2

ΣFim*pi*

3

Anchor tag

W3

ΣFia*pi*

RELEVANCY PARAMETER(λ)


Fig. 2. Keyword Frequency in Meta tags

Fig. 3. Keyword Frequency in Title tags

331

332


4.5 Position in a Document Another factor that has been kept in mind is the position of a particular query term inside a document, like the paragraph position. For example, the documents whose abstract or introduction contains the query term are often more relevant as compared to the documents in which query terms are appearing at the end. Terms in paragraphs are assigned weights in the decreasing order according to the order of paragraph they are found in. The term frequency and scores are calculated for every subset of the query term. Finally the relevancy factor calculated through the above ways is multiplied with the weights and the final score is decided by adding the tf-idf score to each of the documents as shown in equation (5). Score final=score (q, d) + tf-idf (q, d)

-----------(5)

Where, q = query, d = document 4.6 Re-ranking Scores of all the pages are calculated and arranged in the descending order and the pages are re-ranked. 4.7 Comparison Analysis We perform the optimization algorithm on the query “taj mahal india” and calculated various optimization factors. The results obtained thereof are depicted in the following graphs (fig. 4).

page1 page2 page3 page4 page5 page8

meta title anchor para tfidf

1600000 1400000 1200000 1000000 800000 600000 400000 200000 0

page1

Fig. 4. Comparison of pages for each relevance factor

page6 page7 page8 page9


333

The above graph depicts the frequencies of the various tags in the pages obtained using yahoo search engine. Here page 1, 2 and so on are actually yahoo ranked pages 1, 2 etc. According to this, the highest frequency accounts for page 4 and the second highest is of page 3. Fig. 5 is a pie chart representation of the final scores of all these pages based on our algorithm.

score 538500 500 1374620

page1 page2

0 738880 585680 1573920

page3 page4 page5 page6

1987700

page7 761540

3061620

page8 page9

Fig. 5. Full score for every page

According to the results obtained by us the highest rank will be that of page 4 and second highest of page 3 and so on. Given below is a brief look on the first five ranks of our results. Rank 1(Page 4)http://www.tahmahaltours-india.com Rank 2(Page 6) – http://www.tajmahalindia.net Rank 3(Page 3)http://www.tajmahalagra.in Rank 4(Page 7)http://www.tajmahalorg.uk Rank 5(Page 5) – http://en.wikipedia.org/wiki/Taj_Mahal Fig. 6. Yahoo links ranked by the implemented algorithm

In the bracket it is listed the yahoo ranking order on these pages. Our ranking order will change depending on the relevance of context in the page contents.

334


5 Conclusions and Future Work The above stated algorithm has been successfully implemented. We took certain web pages based on the user’s query and optimized the result on the basis of the proposed algorithm. Optimization of the search results can be taken a step further if the concept of semantic search, which takes a user’s thought process into consideration is also taken into account. The future scope would be to merge the above slated algorithm and semantic search process to give the highest possible optimized search results.

References 1. Chengling, Z., Jiaojiao, L., Fengfeng, D.: Application and Research of SEO in the Development of Web2.0 Site. In: Second International Symposium on Knowledge Acquisition and Modeling, pp. 236–238 (2009) 2. Brin, S., Page, L.: The Anatomy of a large-scale hypertextual web search engine. In: WWW7 Proceedings of the seventh international conference on World Wide Web, vol. 7, pp. 491–495 (1998) 3. Tian, C., Tezuka, T., Oyama, S., Tajima, K., Tanaka, K.: Web Search Improvement Based on Proximity and Density of Multiple Keywords. In: Proceedings of the 22nd International Conference on Data Engineering Workshops, pp. x133 (April 2006) 4. Zhang, Y.: Result Optimization Returned by Multiple Chinese Search Engines Based on XML. In: International Conference of Computational Intelligence and Software Engineering, pp. 1–3 (2009) 5. Kumar, S., Madhan, R.P., Vijayalakshmi, K.: Implementation of two-tier link extractor in optimized search engine filtering system. In: IEEE International Conference on Internet Multimedia Services Architecture and Applications (IMSAA), pp. 1–4 (2009) 6. Lawrence, P., Sergey, B., Motwani, R., Terry, W.: The Page Rank Citation Ranking: Bringing Order to the Web. In: Technical Report, Stanford University InfoLab (1999) 7. Hersovici, M., Jacovi, M., Maarek, Y., Pelleg, D., Shtalheim, M.: Ur Sigalit.: The SharkSearch Algorithm – an application: tailored web site mapping. In: Computer Networks and ISDN systems, Special Issue on 7th WWW Conference, Brisbane, Australia, vol. 30(1-7) (1998) 8. Ozel, S.A., Sarac, E.: Focused crawler for finding professional events based on user interests. In: 23rd International Symposium on Computer and Information Sciences, pp.1–4 (2008) 9. Taeho Jo.: Clustering news groups using inverted index based NTSO. In: First International Conference on Networked Digital Technologies, pp. 1–7 (2009) 10. Ozel, S.A., Sarac, E.: Search engine marketing as key factor for generating quality online visitors. In: Proceedings of the 33rd International Convention, pp.1193 (2010)