A Prototype of an Intelligent Search Engine Using Machine Learning Based Training for Learning to Rank Piyush Rai, Shrimai Prabhumoye, Pranay Khattri, Love Rose Singh Sandhu, and S. Sowmya Kamath Department of Information Technology, National Institute of Technology Karnataka, Surathkal, India {piyushrocks.rai,shrimai19,prakhattri92,lovesingh25}@gmail.com,
[email protected]
Abstract. Learning to Rank is a concept that focuses on the application of supervised or semi-supervised machine learning techniques to develop a ranking model based on training data. In this paper, we present a learning based search engine that uses supervised machine learning techniques like selection based and review based algorithms to construct a ranking model. Information retrieval techniques are used to retrieve the relevant URLs by crawling the Web in a Breadth-First manner, which are then used as training data for the supervised and review based machine learning techniques to train the crawler. We used the Gradient Descent Algorithm to compare the two techniques and for result analysis.
1
Introduction
Ranking is a crucial functionality that is inherent to any application offering search features to users. Hence, a lot of research in has been carried out the area of ranking. However, it is also a known fact that it is difficult to design effective ranking functions for free text retrieval. Often, a ranking function that works very well with one application will require major modifications in order to achieve the same level of quality with another application. Given a query, documents relevant to the query have to be ranked according to their degree of relevance to the query. The data that is retrieved usingan algorithm should be ranked appropriately irrespective of the data set it has been extracted from. Hence, the need to teach the system to rank instead of doing the same intuitively makes it unsuitable for most of database systems. Learning to rank is a relatively new field in which machine learning algorithms are used to generate an effective ranking function. Learning to rank can be employed in a wide variety of applications in Information Retrieval (IR) and Natural Language Processing (NLP).A Learning to Rank algorithm should incorporate various tags/criteria to rank the links in order of their appropriateness, like user ratings, time stamp, associated frequency/weights of the information to be retrieved etc. Since same criteria cannot be used for all kinds of data-set like political/sports news or M.K. Kundu et al. (eds.), Advanced Computing, Networking and Informatics - Volume 1, Smart Innovation, Systems and Technologies 27, DOI: 10.1007/978-3-319-07353-8_9, © Springer International Publishing Switzerland 2014
67
68
P. Rai et al.
various advancements in IT technology or historical events etc. more than one algorithm should be taken into account and results of most appropriate one should be referred to for making a raw data set. In this paper, we present the prototype of a Learning to Rank system where in machine learning is incorporated for not only teaching the system to rank different data sets but also to teach it to choose the most appropriate algorithm. The paper is organized as follows: Section 2 presents a discussion on relevant literature to this field of work, Section 3 discusses the proposed system in detail and in Section 4, we present some experimental results. Section 5 presents a comparative case study with conventional search engine results, conclusion and future work in Section 6, followed by references.
2
Related Work
In any ranking model, the ranking task is performed by using a ranking modelf (q, d) to sort the documents, where q denotes a query and d denotes a document. Traditionally, the ranking model f (q, d) is created without training. In the well known Okapi BM25 model [3], for example, it is assumed that f (q, d) is represented by a conditional probability distribution P (r | q, d) where r takes on 1 or 0 as value and denotes being relevant or irreverent, and q and d denote a query and a document respectively. In Language Model for IR (LMIR),f (q, d) is represented as a conditional probability distribution P (q | d). The probability models can be calculated with the words appearing in the query and document, and thus no training is needed (only tuning of a small number of parameters is necessary) [1]. It is fact that most users are uninterested in more than the first few results in a search engine results page (SERP). Taking this into consideration, the Learning to Rank from user feedback model [2] considers each query independently. Radlinski et al [2] proposed the model where the log files only provide implicit feedback on a few results at the top of the result set for each query. They referred to this as “a sequence of reformulated queries or query chains”. They state that these query chains that are available in search engine log files can be used to learn better retrieval functions. Several other works present various techniques for collecting implicit feedback from click-through logs [4, 8,11]. All are based on the concept of the Click through rate (CTR) considers the fact that documents clicked on in search results are highly likely to be very relevant to the query. This can then be considered as a form of implicit feedback from users and can be used for improving the ranking function. Kemp et al. [4] present a learning search engine that is based on actually transforming the documents. They too use the fact that results clicked on are relevant to the query and append the query to these documents. However, other works [9, 10] showed that implicit click-through data is sometimes biased as it is relative to the retrieval function quality and ordering. Hence this cannot be considered to be absolute feedback. Some studies [13, 14] have attempted to account the position-bias of click. Carterette and Jones [12] proposed to model the relationship between clicks and relevance so that clicks can be used to unbiasedly evaluate search engine performance
A Prototype of an Intelligent Search Engine Using Machine Learning Based Training
69
when there is a lack of editorial relevance judgment. Other research [14, 5,, 6] attempted to model user click behavior during search so that future clicks mayy be accurately predicted based on o observations of past clicks. RankProp [7] is a neurall net based ranking model. It basically uses two processes a MSE regression on the cu urrent target values, and an adjustment of the target vallues themselves to reflect the cu urrent ranking given by the net. The end result is a mappping of the data to a large numb ber of targets which reflect the desired ranking. RankP Prop has the advantage that it iss trained on individual patterns rather than pairs; howeever the authors do not discuss the conditions at which it converges, and also it does not provide a probabilistic mod del.
3
Proposed System m
Fig. 1 depicts the architectture of the proposed system. The Swappers search enggine API is built using HTML 5 and AJAX. The database is created using MySQL and PHP is used to connect APII to the database.
Fig. 1. Proposeed Training based Search Engine Prototype System
The search engine is trained using two supervised machine learning algorithhms namely selection based and d review based. The tags/weights are calculated to rank the links in the training data-sset. Both the algorithms follow the inclusion of differrent heuristics for the same. Thee weight of the link is determined by the frequency of the keyword in content of the link l and the position where it occurs. Also, heuristics llike whether the keyword is wriitten in bold or italics; position where it occurs, for e,gg. in page title, headings, metaadata etc; and the number of outgoing links having the keyword in the URL are con nsidered while calculating the weight of the link.
70
P. Rai et al.
In the review based module, the user selects the links he wants to train the search engine with and also rates those selected links. The review based module then normalizes the two weights, one given by the user and the other calculated from the keyword density. The weighting algorithm module finds a best fit line using gradient descent technique and assigns weights to links which are relevant to the query.
Fig. 2. The System’s Training Algorithm
A Prototype of an Intelligent Search Engine Using Machine Learning Based Training
71
In the stochastic gradient descent algorithm we are trying to find the solution to the minimum of some function f(x). Given some initial value x0 for x, we can change its value in many directions (proportional to the dimension of x: with only one dimension, we can make it higher or lower). To figure out what is the best direction to minimize f, we take the gradient ∇f of it (the derivative along every dimension of x). Intuitively, the gradient will give the slope of the curve at that x and its direction will point to an increase in the function. So we change x in the opposite direction to lower the function value[20]. xk+1 = xk– λ ∇f(xk)
(1)
The λ >0 is a small number that forces the algorithm to make small jumps. That keeps the algorithm stable and its optimal value depends on the function. Given stable conditions (a certain choice of λ), it is guaranteed that f(xk+1) ≤ f(xk). The algorithm shown in Fig. 2 is used to train the system to rank the relevant documents. In this method, the training set is given as an input and a best fit line is found passing through maximum of these input training set points by minimizing the distance between the ordinates of each of the training set points and a given line.After getting the equation of the line, next time when the ranking of the data set is required, the testing value should be inputted and the system will return the rank of the testing set. The method to calculate the weight of the link is explained above. This weight will be the y co-ordinate of the line. This weight will be substituted in the equation of the line and the x- coordinate will be calculated and the x co-ordinate will be the rank of the document which will be returned. The link with highest weight is labeled as rank 1 and so on. The equation of the line to be plotted is given by formula (2), where m is the slope of the line and c is a constant. Both the values are calculated by the gradient descent technique using the inputs of the training set. Rank = m * weight + c
(2)
Once the line and its slope is computed, the Graph Visualization module takes input of ‘m’ and ‘c’ calculated by the gradient module and draws the graph. The rank of the link is plotted on Y-axis and the weight of the link is plotted on X-axis. The basic equation of the line plotted is using the equation (2). The graph module uses canvas of the HTML to draw the graph and also makes use of JavaScript to draw the graph. Graph visualization helps in understanding the training process better and in comparing the two training techniques.
4
Experimental Results
The training of the selection based and the review based systems use the same set of links as selected by the user to get trained with. Our database consists of around a few thousand seed URLs from which more links are discovered by crawling these web
72
P. Rai et al.
pages. Since the crawler uses BFS technique andhence will fetch the links ofrelevant web pages till a particular average depth, hence providing with an average of 2^(h+1) links where 'h' depends on the total number of links extracted from each seed URL to fetch around 20-25 links, which makes 'h' to vary from 3-4, to be added depending on the keyword. A thousand of these relevant set of URLs were chosen to be appended into the training dataset. Adatabase query is used to fetch keyword related links from Seed URL. The system extracts links from the table named "training_data" in the database which gets generated based on users choice during training. Each of this link is then crawled upon by the various systems connected via wireless network. The keyword is then searched on the content of the page given by the link. Processing is done based on the places where it is present and other parameters explained later. The result is updated back into the table to be looked up later during further processing.
Fig. 3. Rank and Weight of Links given by Selection based training
Fig. 4. Feedback form of the review based technique
Fig. 5. User rating and weight of Links given by Review based technique
In the selection based training, the ranking of the links were derived by taking only the weights (includes the frequency of the keywords in the pages and the tags in the HTML source code of the page) of the links as found by the weighing algorithm. In the review based training, the ranking of the links were derived from the weights of
A Prototype of an Intelligent Search Engine Using Machine Learning Based Training
73
the links as in the selection based and the user assigned weights (user rating) to the links. The user rating is collected through the feedback form provided to the user. Finally, the values from these criteria are normalized to get the final weight for each link in the training set which is further used to plot the graph using the gradient descent algorithm. It plots all the points (weights) and finds the best fit line for both the algorithms. The characteristics of the lines (slope and constant) formed and used after training is done. Fig. 6 shows the best-fit line made by the selection based technique. The line corresponds to the data of ranks and weights in Fig. 3. The line covers a short range of ranks.Fig. 7 shows the best fit line made by the review based technique. The line corresponds to the data of ranks and weights in Fig. 5. The line covers a wide range of ranks. This makes the review based technique of training more useful than selection based technique in training to rank wide range of sets.
Fig. 6. Best-fit line from the selection based technique
5
Fig. 7. Best-fit line from the review based technique
Case Study
As a part of this case study, we compared the relative ranking of the links given by selection based technique and review based technique with the search results of Google. Google gives its search results based on many parameters and hence we have not compared the absolute ranks of the links. By comparing the relative ranks of the links, we are comparing the rank of the link given by our system to its rank given by Google search results relative to the ranks of the other links. The links and ranks given in Fig. 3and Fig. 4 are compared with the links and their relative ranks in Fig. 5. The relative ranking of the review based technique is matching with the relative ranking of the Google search as seen in Fig. 8.
74
P. Rai et al.
Fig. 8. Google search results for keyword “graph” (as on 01-Sept-2013)
6
Conclusions and Future Work
In this paper, we present a learning based search engine prototype that uses machine learning techniques to infer the effective ranking function. The experimental results show that the review based technique of training performs better and is more accurate than the selection based learning technique. The review based technique is more accurate because the two weights - the weight assigned by the user and the weight of the keyword density are considered and normalized. We also tried to match the ranking given by selection based and review based learning system with Google ranking. Google had many more results to offer based on different parameters but we tried to compare the ranking of the links given by our system to the ranking of the same links on Google. The review based learning technique had ranking matching with the ranking of Google.Supervised training is used in the system, and this could be made into a semi-supervised or unsupervised training. More parameters for considering the weight of the document can be used which would give a better result. As future work, we are focusing on using these trained systems to determine the rank of a link. The values of slope ‘m’ and the constant ‘c’ mentioned in equation (2) will be stored in the database for each query. When the rank of the new link is to be determined, the weight of the link will be calculated by the weighing algorithm. The rank of the link will be calculated by substituting the weight of the link in equation (2) and retrieving the values of ‘m’ and ‘c’ from the database.
A Prototype of an Intelligent Search Engine Using Machine Learning Based Training
75
References 1. Croft, W.B., Metzler, D., Strohman, T.: Search Engines -Information Retrieval in Practice. Pearson Education (2009) 2. Radlinski, F., Joachims, T.: Query Chains: Learning to Rank from Implicit Feedback. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 239–248 (2005) 3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley (1999) 4. Kemp, C., Ramamohanarao, K.: Long-term learning for web search engines. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 263–274. Springer, Heidelberg (2002) 5. Caruana, R., Baluja, S., Mitchell, T.: Using the future to “sort out” the present: Rankprop and multitask learning for medical risk evaluation. In: Advances in Neural Information Processing System, pp. 959–965 (1996) 6. Gradient Descent Methods, http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node87.html 7. Joachims, T., Granka, L., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data asimplicit feedback. In: Annual ACM Conference on Research and Development in Information Retrieval, pp. 154–161 (2005) 8. Tan, Q., Chai, X., Ng, W., Lee, D.-L.: Applying co-training to clickthrough data for search engine adaptation. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 519–532. Springer, Heidelberg (2004) 9. Carterette, B., Jones, R.: Evaluating search enginesby modeling the relationship between relevance andclicks. In: Advances in Neural Information ProcessingSystems, vol. 20, pp. 217–224 (2008) 10. Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: Anexperimental comparison of click position-bias models. In: Proceedings of the International Conference on Web Search and Web DataMining, pp. 87–94 (2008) 11. Dupret, G., Piwowarski, B.: User browsing model to predict search engine click data from past observations. In: Proceedings of the 31st Annual International Conference on Research and Development in Information Retrieval (2008) 12. Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the clickthrough rate for new ads. In: Proceedings of the 16th International Conference on World Wide Web, pp. 521–530 (2007) 13. Zhou, D., Bolelli, L., Li, J., Giles, C.L., Zha, H.: Learning user clicks in web search. In: International Joint Conference on Artificial Intelligence (2007) 14. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281 (1998)