Improved Web Search Engine by New Similarity Measures Vijayalaxmi Kakulapati1, Ramakrishna Kolikipogu2, P. Revathy3, and D. Karunanithi4 1,2
Computer Science Department, JNT University, Hyderabad, India
[email protected],
[email protected] 3 Education & Research, Infosys Technologies Limited, Mysore, India
[email protected] 4 Information Technology Department, Hindustan University, Chennai, India
[email protected]
Abstract. Information retrieval is a process of managing the user's needed information. IR system captures dynamically crawling items that are to be stored and indexed into repositories; this dynamic process facilitates retrieval of needed information by search process and customized presentation to the visualization space. Search engines plays major role in finding the relevant items from the huge repositories, where different methods are used to find the items to be retrieved. The survey on search engines explores that the Naive users are not satisfying with the current searching results; one of the reason to this problem is “lack of capturing the intention of the user by the machine”. Artificial intelligence is an emerging area that addresses these problems and trains the search engine to understand the user’s interest by inputting training data set. In this paper we attempt this problem with a novel approach using new similarity measures. The learning function which we used maximizes the user’s preferable information in searching process. The proposed function utilizes the query log by considering similarity between ranked item set and the user’s preferable ranking. The similarity measure facilitates the risk minimization and also feasible for large set of queries. Here we have demonstrated the framework based on the comparison of performance of algorithm particularly on the identification of clusters using replicated clustering approach. In addition, we provided an investigation analysis on clustering performance which is affected by different sequence representations, different distance measures, number of actual web user clusters, number of web pages, similarity between clusters, minimum session length, number of user sessions, and number of clusters to form. Keywords: Search engines, ranking, clustering, similarity measure, Information retrieval, click through data.
1 Introduction Web users request for accurate search results. Most of the Naïve users are poor in experts terminology in which they failed build right query to the search engine, due A. Abraham et al. (Eds.): ACC 2011, Part IV, CCIS 193, pp. 284–292, 2011. © Springer-Verlag Berlin Heidelberg 2011
Improved Web Search Engine by New Similarity Measures
285
this as one of the reason search engines are limited in capability in the providing accurate results. All most Google, Yahoo, Bing, Ask, etc search engines are in a nascent stage. Still they are interested in doing research to give better results to the end users through one click. Query expansion is one dimension of search engines problem, in which it allows to add new terms to the base query to form new query for better understandability of search engine. We had a survey on query expansion techniques in [7] our previous work. Even we found the difficulty to improve search results by adapting WordNet for term selection for Query reformulation [8]. With this experience [7] [8] we proposed a novel technique to improve the search results. One basic idea is to record the user interaction with the search engine. This information can be used by the user to feedback the base results. Such information is known as click-through data. This information helps to learn similarity between or among the query keywords. Firsthand information is always needed to decide the relevance of the search results. Similarity measure is a function that computes the degree of similarity between two vectors [6]. Different similarity measures are used to increase the function output as the item becomes more similar to the query. The query term based query expansion refers to the measurement of similarity between the terms of a query with the utilization of the similarity propagation of web pages being clicked [9] and the document term based query expansion refers to the measurement of similarity between or among document terms and search queries primarily based on the search engines’ query log of data [9]. The idea behind this is, that the web pages are similar if they are visited by users, which are issuing related queries and these queries are considered similar if the corresponding users visit related pages. The problem of web personalization has become very popular and critical with the faster growth of users in using WWW. The process of customizing web to meet the needs of specific users is called Web Personalization [10].Web customization is to meet the needs of users with the aid of knowledge obtained from the behavior of user navigations. User visits are essentially sequential in nature that needs the services of efficient clustering techniques, These provides sequential data Set similarity measure or S3M that are able to capture both the order of visits occurrence and the content of the web-page. We discuss how click-through information is used in section 3. We explore the importance of similarity measures in section 2 as Related Work.
2 Related Word Similarity Measures(SM) are used to calculate the similarity between documents ( or Web Items) and search query pattern. SM helps to rank the resulted items in the search process. It provides flexibility to present more relevant retrieved item in the search in the desired order. SM is used for Item clustering and term clustering. Statistical indexing and similarity measures [11].
286
V. Kakulapati et al.
2.1 Similarity Measure as Inner dot Product Similarity Measure SM between Item I and Query Q is measured as inner dot product for vectors. t SM(Ij , Q)=Σ Wi j Wi q i=1
Where Wij is weight of term i in Item j , and Wij is Weight of term I in query q. Wi j = TFi j / TOTFj TFi j = Fj /{ max(Fj )} Where TFij is occurrence of Term j in Item i and TOTFj is Total Term Frequency of term j in all Items of the Database. Sometimes less frequent terms in an item may have more importance than more frequent terms; in this case Inverse Item Frequency (IIF) is taken into consideration i.e TF-IIF weighting. Wi j = TFi j * IIFi = TFi j * Log (N/IF i ) Where N is total terms, IFi is Item Frequency. This is represented as binary vector and weighted vector. In binary vector inner dot product is number of matched query terms in Item. In Weighted vectors, it is the sum of product of the weights of the matched terms. It also used for clustering the Similar Items: SM(Ii , Ij)=Σ ( Termi *Termj ) This Inner dot product is unbounded good for larger Items with more number of unique terms. But one drawback of this technique is, it finds how many query terms have matched with Item terms, but not how many are not matched. Sometimes we use inverse Similarity for Relevance calculation in such cases it fails to provide good results. 2.2 Cosine Similarity Measure Inner dot product Similarity Measure is normalized by Cosine angle between two vectors. Cosine Similarity Measure (CSM) is defined as - ij .q CSM(Ij , Q) = -------|ij| .|q| CSM(Ij , Q) =
Improved Web Search Engine by New Similarity Measures
287
Fig. 1. Cosine Similarity Measure
Fig.1 Describes the similarity between Query Q terms and Item I 1 & I2 terms with angle θ1 and θ2 respectively. If the two vectors of Item Term and Query terms coincide and aligned to same line i.e. angle distance is zero, then those two vectors are similar [12]. Like few of the above many similarity measures are used to match the terms of user search to the repository item set. We used same similarity measures for comparison, but comparing information is taken not only from base search pattern, we extended the initial search pattern with the user personalized information and other sources of information to match the items that improve the search results. 2.3 User Personalization To improve the relevance of the user queries, user query logs and profiling is to be maintained as user logs. User Personalization can be achieved using adaption of user interface or adaption of content needed to specific user. To judge the relevance of search results users have no common mechanism. The order of ranking by user interest give better understanding of query results for future analysis. In domain specific search tools the relevance is closer to the ranking order and easy to judge the relevance. To capture the user behavior for future prediction [13] they used ranking quality measures. Using Implicit Feedback whether user get satisfied or not is predicted through learning by finding indicative features including way of search session termination, time spent on resultant pages[14]. The behavior of engine is observed by measuring the quality of ranking functions and observing natural user interactions with the search engine [15].
3 Click-through Data Measuring the similarity of search queries is observed by quarrying the increasing amount of click-through data recorded by Web search engines, which maintain log of
288
V. Kakulapati et al.
the interactions between users and the search engines [16]. The quality of training data considered by humans has major impact on the performance of learning to rank algorithms [17]. Employing human experts to judge the relevance of documents is the traditional way of generating the training examples. But in real time , it is very difficult, time-consuming and costly. From few observations [6] [7] [8] [11] [12] [14] [15] Simple Relevance judgment and normal personalization of user queries has no much affect in improving the search results. In this paper we claim a novel approach for selecting alternate source for user behavioral information i.e. click-through data. Click-through data helps the user to captures the similar features from the past user navigations and searches for alternate items to retrieve. This approach has significant information to decide whether the user option for relevance feedback improves search results or not. We used different similarity measures for matching the click through data aided to the personalized query logs or simple query logs. 3.1 Click-through Data Structure We took manually collected dataset for implementation setup. Our document collection consisting of 200 faculty profiles consisting of standardized attributes given as good meta-data. We begin ranking our document set using Coarse Grain Ranking Algorithm. Coarse grain ranking is good for the document ranking if the items are containing required query terms. This algorithm scores each document by computing a sum of the match between the query and the following document attributes: name of faculty, Department or branch, qualification summary, experience track, subjects handled publication details, references and other details. When we gave query to User interface it returns the following results: Query: “CSE Faculty with minimum of 5 years experience” Table 1. Ranking order of retrieved results for the above query 1
Dr.B.Padmaja Rani, 16 years of teaching experience. Http://www.jntuh.ac.in
2
Satya.K, CSE Faculty, http://www.cmrcet.ac.in
3
Ramakrishna Kolikipogu, CSE Faculty, 5 years experience in teaching http://www.jntuh.ac.in
4
Indiravathi, having 20 years experience, not working in CSE Department
5
Prof.K.Vijayalaxmi, Faculty of CSE, http://www.jntu.ac.in
6
Megana Deepthi Sharma, studying CSE, having 5 years experience in computer operation.
From the profiles document set that we have taken to experiment the model, we got the above result in the first attempt of query “CSE Faculty with minimum of 5 years experience”. We found interesting results i.e. 1, 3, 5 are relevant to the query and 2, 4, 6 are not relevant to the query. Due to blind similarity measure the results are not fruitful. Now we need user judgment for deciding the relevance of search results. The
Improved Web Search Engine by New Similarity Measures
289
user clicks are preserved for the future search process. If user clicks 3rd result first then it has to reserve the first rank among the relevance list. For capturing such click through data, we built a Click-through data data-structure as triplet. Click-through data in search engines is a triplet which consists of the query a, the ranking b presented to the user, and the set c or of links that the user clicks for every navigation. 3.2 Capturing and Storing Click-through Data Click through data can be captured with little overhead and without compromising the functionality and usefulness of the search engine. This does not add any overhead for the user compared to explicit user feedback in particular. The query q and the returned ranking r are recorded easily when ranking (resulted) is displayed to the user. A simple system can be used to keep log of clicks. The following system was used to do the experiments in this paper. We recorded queries submitted, as well as clicks on search results. Each record included the experimental condition, the time, IP address, browser, a session identifier and a query identifier. We define a session as a sequence of navigations (clicks or queries) between a user and the search engine, where less than 10 minutes passes between subsequent interactions. When attribute is clicked in query results keep track of recording clicks occurring within the same session as the query. This is important to eliminate clicks that appeared to come from stored or retrieved and captured search results. Sometimes if user is continuing search more than 10 minutes it is built in such a way that it continues the recording process. In order to capture the click through data as we used middle server. This proxy server records the user clicks information. It has no effect on overhead of the user in search. To give faster results we need to reduce processing time called overhead, in general recording increases the overhead, but in our approach recording click through data and ranking information has no effect on operational cost. The click-through data is stored in a triplet format Data-Structure. The query q and rank order r can be recorded when search engine returns initial results to the user. To record clicks, a middle server maintains a data store of the log file. User queries are given unique Ids, while searching IDs are stored into log file along with query terms and the rank information r. User need not think of storing Links displayed by the results page, but direct him to a proxy server. These links are steps to encode IDs of queries and URLs of the item being suggested. Recording of query, ranking order and URL address happens automatically through proxy server whenever a user clicks the feedback link. The server redirects the user to the clicked URL through HTTP protocol. All this process is done with no more operating cost, which keeps the search engine to present the results to the user with no much extra time.
4 Ranking and Re-ranking The ranking rule is set according to the rank score of the item equal to the number of selections of the same item in past. From the initial ranking we proposed a new ranking algorithm to redefine the user choice list. We use Probalistic Similarity Measure and Cosine Similarity Measure for Item Selection and Ranking for base Search.
290
V. Kakulapati et al.
1. Algorithm: Ranking (Relevant Items Set RIS) Input: Relevance Item Set RIS. Output: Ordered Item List with Ranking r. Repeat if (Reli >Relj) then Swap (Ii,Ij) else Return Item Set I with ranking Order Until (no more Items in RIS)
2. Algorithm: Re-ranking (Ranked Items Set S) Input: Ranked Item Set S. Output: Ordered Item List with Re-Ranking r. CTDRelj) then Swap (Ii,Ij) else Return Item Set I with Re-ranking Order Until (no more Items in S)
5 Experimental Setup We implemented the above concept using Java. We took 200 Faculty Profile Item Set S and Created Click-through Data set in a Table. Whenever user gives choice click from the retrieved Items to the visual place we recorded the click through data in to Click-through Data Table. Using Base algorithm 1 we rank the items in initial search process. We ran the search tool for more than 100 Times and build a click-through data table. For experimenting the algorithm 2, we ran the Search process again for multiple numbers of times and observed the results are more accurate than the initial search. This process has a number of advantages including, it is effortless to execute while covering a large collection of Items and the essential search engines provide a foundation for comparison. The Striver meta-search engine works in the following way. The user will type a query into the interface of the Striver. The query is then forwarded to MSN Search, Google, Excite, AltaVista, and Hotbot. The retrieved results of the pages returned by search engines are analyzed and diagonized for top 50 attempts that are suggested are somehow extracted. For every link, the system displays the name of the page along with its uniform resource locator (URL). The results of our experiment are shown in Table 2.
Improved Web Search Engine by New Similarity Measures
291
Table 2. Experimental Results Recommended Query from click through Data Table (Personalized Queries) CSE Faculty + 5 year experience
Q. No
Query
Average Relevance
Average Improvement
1
CSE Faculty
50.00%
82.00%
2
Faculty with 5 years experience
25.00%
98.00%
CSE Faculty with min. of 5 years experience
3
Experience Faculty in Computers
60.00%
79.00%
Experienced CSE Faculty
4
Experience
10.00%
18.00%
Minimum Experience
5
Computer Science Engineering
15.00%
50.00%
Computer Science Engineering Faculty
6
Teaching Faculty
40.00%
66.00%
Teaching Faculty for CSE
7
CSE
20.00%
50.00%
CSE Faculty
8
Faculty
12.00%
50.00%
CSE Faculty
9
CSE Faculty with good experience
80.00%
50.00%
CSE Faculty
6 Conclusion and Future Work With our proposed a model we measure the similarity of query term with Click through Data log Table instead of directly comparing the whole data set. This new Similarity gave positive result and improved the Recall along with Precision. For query suggestion based on user click through logs to implement it required less computational cost. The re-ranking and suggesting items for user to judge the results are enhance through this paper. Moreover, the algorithm does not rely on the particular terms appearing in the query and Item Set. Our experiments shows that click through data gave more related suggesting queries from the Table 2. Query Numbers 2>8>5>1>7>6>3>4>9 are the order of improved relevance. We observe some time if the Feedback is not right judged it even gives negative results. We experienced with Q.No .9 from Table 2. Inorder to overcome such negative impact from the Click through History we plan to enhance this base model more carefully by appending semantic network and Ontology as our future research direction.
References 1. Baeza-Yates, R.A., Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Amsterdam (1999) 2. Beitzel, D.M., Jensen, E.C., Chowdhury, A., Grossman, D., Frieder, O.: Hourly analysis of a very large topically categorized Web query log. In: Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 321–328 (2004)
292
V. Kakulapati et al.
3. Shen, X., Dumais, S., Horvitz, E.: Analysis of topic dynamics in Web search. In: Proceedings of the International Conference on World Wide Web, pp. 1102–1103 (2005) 4. Kumar, P., Bapi, R., Krishna, P.: SeqPAM: A Sequence Clustering Algorithm for Web Personalization. Institute for Development and Research in Banking Technology, India 5. Cohen, W., Shapire, R., Singer, Y.: Learning to order things. Journal of Artificial Intelligence Research 6. Shen, H.-z., Zhao, J.-d., Yang, Z.-z.: A Web Mining Model for Real-time Webpage Personalization. ACM, New York (2006) 7. Kolikipogu, R., Padmaja Rani, B., Kakulapati, V.: Information Retrieval in Indian Languages: Query Expansion model for telugu language as a case study. In: IITAIEEE, China, vol. 4(1) (November 2010) 8. Kolikipogu, R.: WordNet Based Term Selection for PRF Query Expansion Model. In: ICCMS 2011, vol. 1 (January 2011) 9. Vojnovi, M., Cruise, J., Gunawardena, D., Marbach, P.: Ranking and Suggesting Popular Item. IEEE Journal 21 (2009) 10. Eirinaki, M., Vazirgiannis, M.: Web Mining for Web Personalization. ACM Transactions on Internet Technology 3(1), 1–27 (2003) 11. Asasa Robertson, S.E., Spark Jones, K.: Relevance Weighting of Search Terms. J. American Society for Information Science 27(3) (1976) 12. Salton, G.E., Fox, E.A., Wu, H.: Extended Boolean Information Retrieval. Communications of the ACM 26(12), 1022–1036 (1983) 13. Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: A bibliography. ACM SIGIR Forum 37(2), 18–28 (2003) 14. Fox, S., Karnawat, K., Mydland, M., Dumais, S., White, T.: Evaluating implicit measures to improve web search. ACM Transactions on Information Science (TOIS) 23(2), 147– 168 (2005) 15. Radlinski, F., Kurupu, M.: How Does Clickthrough Data Reflect Retrieval Quality? In: CIKM 2008, Napa Valley, California, USA, October 26-30 (2008) 16. Zhao, Q., Hoi, S.C.H., Liu, T.-Y.: Time-dependent semantic similarity measure of queries using historical click-through data. In: 5th International Conference on WWW. ACM, New York (2006) 17. Xu, X.F.: Improving quality of training data for learning to rank using click-through data. In: ACM Proceedings of WSDM 2010 (2010)