Mining Web Graph For Query Recommendation - IJETTCS

4 downloads 10812 Views 525KB Size Report
diffusion which will create a web graph diffusion model. Then, query suggestion algorithm is applied to test the queries and to generate recommendation.
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 3, May – June 2013 ISSN 2278-6856

Mining Web Graph For Query Recommendation Mrs. Shweta Shinde1, Prof.Dr.V.H.Patil2 1,2

Matoshri College Of Engineering ,University Of Pune, Nashik-422013

Abstract: Recommender systems which comes under web content mining have become extremely important as user generated information is more free style and unstructured, that creates difficulties in mining important information from data sources. So, in order to satisfy the information requirements of Web users and to expand the user experience in many Web applications, recommendation system has been studied in academia and widely deployed in industry. This paper presents the system, in which data sources can be modeled in the form of various types of web graphs using DRec algorithm. These web graphs can be used for various recommendation systems. This framework is built upon heat diffusion which will create a web graph diffusion model. Then, query suggestion algorithm is applied to test the queries and to generate recommendation. The work is extended with personalized query recommendation and comparative analysis of algorithm and proves the results in the terms of accuracy. And this system can be used to most of the web graphs for query suggestions, image recommendation, and social as well as personalized recommendation.

Keywords: Web Mining, Recommendation, Heat Diffusion, Query Suggestion, Web Graph, DRec, Personalized Recommendation

1 Introduction Web mining is the important application of data mining technique to discover patterns from the Web. Considering to analysis targets, the web mining basically divided into three different types, which can be Web usage mining, Web content mining and Web structure mining. Content mining is the mining information, extraction and integration of valuable data, Knowledge and information from Web page contents. The heterogeneity and the nonexistence of structure that allows most of the ever expanding information and knowledge sources on the World Wide Web, like hypertext documents, creates automated discovery and organization, search and indexing tools of the Internet and the World Wide Web such as Web Crawler. A web graph describes the directed links between pages of the World Wide Web. The graph, in general, consists of some vertices, several pairs connected by edges. In a directed graph, edges are directed lines or arcs. A web graph is a directed graph, its vertices relates to the pages of the World Wide Web, the directed edge joins page X to page Y if there exists a hyperlink on page X, referring to page The applications of web graph can be as follows: Volume 2, Issue 3 May – June 2013

1. The web graph is used for computing the Page Rank of the WWW pages. 2. The web graph is used for computing the personalized Page rank. 3. The web graph can be used for detecting web pages of same topics, over graph theoretical properties. Recommender systems, is the system that predict the ’rating’ or the ’preference’ that a user can give to an item like music, books or social element for e.g. people. Recommender systems have become tremendously important for Web 2.0 related applications as user generated information is more free style and less structured , which creates difficulties in mining important information from data sources [8]. Recommender systems are based on Collaborative filtering approach. It is a technique that automatically gives interest of all active users. It collects rating information from other similar users. There are different challenges for a recommender system. 1. It is very difficult to recommend latent semantically relevant results to user. 2. It is difficult to use personalization features. 3. It is time consuming as well as inefficient to design different recommendation and applying these algorithms to different recommendation tasks.

2 Related Work In present mining system, different recommendation algorithms for different recommendation tasks have proposed. There are several approaches related to recommendation system, including collaborative filtering approach, query suggestion techniques and image recommendation methods, click through data analysis. G. Linden et.al. proposed traditional collaborative filtering algorithms that scales independently of the number of customers and also to the number of items in the product catalogue. This algorithm produces recommendations in real time, which can be scales to huge data sets, and generates high quality recommendations. These recommendations are computationally expensive. It is O (MN) in the worst case, Here M is the number of customers and N gives the number of product catalogue Items, since it examines M customers and up to N items for each customer [1].There are basically two types of collaborative filtering techniques.

Page 316

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 3, May – June 2013 ISSN 2278-6856 a. Neighborhood based approach P. Resnick et.al.[2] proposed user based approach that predicts the ratings of active users based on the ratings of their similar users. It uses Pearson Correlation Coefficient algorithm (PCC) and the Vector Space Similarity algorithm (VSS) as the similarity computation methods. But, its disadvantage is accuracy of recommendation is poor. It suffers with scalability problem. In item based item-to-item collaborative filtering matches each of the users purchased and rated items to similar items, after that it combines those same items into a recommendation list. In order to determine the most-similar match for a given item, the algorithm constructs a same items table by finding the items that customers tend to purchase together. It is well with limited user data, producing high quality recommendation [2]. b. Model based approach A. Kohrs et.al. [3] proposed an model based approach in which clustering is used as it overcomes sparsity. Here, user and item both clustered independently into two cluster hierarchies tried to balance robustness and correctness of predictions, particularly when few data were available [3]. These methods all focus on fitting the user item rating matrix using low-rank approximations, and use it in making further predictions. But, these methods require the user-item rating matrix. Though, on the Web, in lots of cases, rating data is not always available since information on the Web is less structured and more diverse [2]. Hence, collaborative filtering algorithms cannot be applied to most of the recommendation tasks on the Web directly. Query suggestion is nearly related to query expansion or query substitution which generate extended query with new search terms to narrow down the scope of search [2]. P.A Chirita et.al. [4] proposed query expansion techniques that fall into two categories according to the way of implementation. First thing is before searching, add new terms to an original query, and the other is to formulate a new query based on retrieved documents of the previous search. Query suggestion aims to suggest or recommend full queries that have been formulated by previous users so that query integrity as well as coherence are preserved in the suggested queries. Main disadvantage is that they ignore the rich information and it considers only queries that appear in the query logs, possibly losing the opportunity to recommend highly semantically related queries to users [2]. H. Cao et.al. [5] proposed context aware query suggestion, by observing the context in the concept sequence suffix tree, this approach suggests queries to the user in a context-aware manner. It tests user approach on a large scale search logs of a commercial search engine containing 1:8 billion search queries, and 2:6 billion clicks, and almost 840 million query sessions. The practical result clearly says that this approach Volume 2, Issue 3 May – June 2013

outperforms two baseline methods as coverage as well as the quality of suggestions required [2]. It may recommend some long tail queries (infrequent queries) to users [8]. However, this is also the disadvantage of query expansion approach since sometimes it may accidentally rank the infrequent queries mostly in the results and potentially downgrades the ranks of the highly related queries [8]. Several different ranking methods using random walks can also be carried out into the query suggestion tasks. HITS Page Rank algorithms are used to compute the stationary distribution of a smoothed Markov chain [9]. Personalized Page Rank generalizes page Rank by smoothing the Markov chain with a query-specific jumping. It uses probability vector instead of a uniform vector, therefore, it is often used for query-dependent ranking [10]. HITS is an alternative query dependent ranking algorithm. It is used to compute hub and authority scores in an iterative way [8]. But, main disadvantage of this algorithm is they are generally unavailable. Click through data analysis is used for optimizing web search results or rankings. It determines generalization or specialization technique. E. Agichtein et.al. [6] proposed an approach, In that web search logs are used to organize the clusters of search results effectively. This approach utilizes click through data for training, that is the query-log used for search engine in connection with the log of links the users clicked on in the presented ranking [6]. Such click through data is available heavily and can be recorded at very low cost .But there is a possibility of malicious noisy clicks in this method and it’s applicable to small data sets. Y.H.Yang et.al. [7] proposed method that asks users to rate some images to the user based on tastes of the users. But, as it is a context-based technique, the computational complexity is high and it cannot scale to large data sets [2].So, most of these recommendation problems have some of the common features, so a general framework is necessary to unify the recommendation tasks on the Web. Moreover, most of the existing methods are complicated and require tuning a large number of parameters. So, Hao Ma, Irwin King et.al. [2] proposed a general framework for the recommendation on the web based upon the heat diffusion on both directed and undirected graph. Web graph is a graph induced by web pages as vertices and hyperlinks as edges. Different data sources used for recommendation system can be modeled in the form of various types of graphs using DRec algorithm which means recommendation by diffusion. A graph diffusion model based on heat diffusion can be applied to the web graphs [2]. Query Suggestion is a technique widely employed by commercial search engines to provide related queries to users information need. Query suggestion graph can be generated based on the click through data of the AOL search engine [8]. Click through data record the activities of various Web users, which reflect their interests. The latent semantic relationships between users and queries as Page 317

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 3, May – June 2013 ISSN 2278-6856 well as queries and clicked Web documents [2]. After this query suggestion algorithm is applied to generate recommendation that suggests queries which are literally similar to the test queries. It also delivers latent semantically relevant recommendations [2]. The designed recommendation algorithm is scalable to very large datasets. The main aim of the project is to know more about recommendation system, its need and how to generate a general framework for mining Web graphs for recommendations. A novel framework for mining web graphs and to generate recommendation is designed. The method is comprised of two steps: 1. A novel graph diffusion based on heat diffusion 2. Query suggestion algorithm after that Query suggestion algorithm is applied to test the queries and to generate recommendation. Required operating environment will be Windows based OS which includes IDE as Net Beans and JDK 1.6 and MySQL.

3 Programmer’s Design The system architecture is shown in figure 1.It consist of two main module, first is a novel graph diffusion based on heat diffusion and second is query suggestion algorithm. In first module, a general framework for the recommendation on the web based upon the heat diffusion on both directed and undirected graph is designed. Web graph is a graph induced by web pages as vertices and hyperlinks as edges. Different data sources used for recommendation system can be modeled in the form of various types of graphs using DRec algorithm which means recommendation by diffusion. A graph diffusion model based on heat diffusion can be applied to the web graphs.

Figure1 System Architecture In second module Query suggestion graph can be generated based on the click through data. Click through data record the activities of Web users, which reflect their interests and the latent semantic relationships between users and queries as well as queries and clicked Web documents. After this query suggestion algorithm is applied to generate recommendation that suggests queries which are literally similar to the test queries .But it also provides latent semantically relevant recommendations. The designed recommendation algorithm is scalable to very large datasets. Input: A query to search engine Volume 2, Issue 3 May – June 2013

Outcome: 1. Increasing accuracy for query suggestion. 2. Providing semantically relevant query recommendations. 3. With increasing size of sub graph query recommendation accuracy increases, hence increases the performance. 4. Graph showing accuracy of query recommendation. Success Definition of work: 1. Data sources can be modelled in the form of web graphs using heat diffusion process[2]. is provided. 5. Query-URL can be ranked related to the original query using heat values of URL. 6. Results of comparison of heat ranking method with other web search results ranking approaches can be shown [2]. A novel graph diffusion model based on heat diffusion can be applied to both undirected graphs and directed graphs. This chapter includes system architecture and how to infer the parameter based on the graph structure is proposed. 3.1 Heat Diffusion Heat diffusion is a physical phenomenon. In a medium, heat always flows from a position with high temperature to a position with low temperature. Heat diffusion-based approaches have been successfully applied in various domains such as classification and dimensionality reduction problems [9]. In proposed system it uses heat diffusion to model the similarity information propagation on Web graphs. In Physics, the heat diffusion is always performed on a geometric manifold with initial conditions [9]. However, it is very difficult to represent the Web as a regular geometry with a known dimension. This motivates us to investigate the heat flow on a graph. The graph is considered as an approximation to the underlying manifold, and so the heat flow on the graph is considered as an approximation to the heat flow on the manifold. 3.2 Diffusion on undirected graphs The Web graphs are directed, especially in online recommender systems or knowledge sharing sites. Every user in knowledge sharing sites typically has a trust list. The users in the trust list can influence this user deeply [8]. These relationships are directed since user a is in the trust list of user b, but user b might not be in the trust list of user a. At the same time, the extent of trust relations is different since user ui may trust user uj with trust score 1 while trust user uk only with trust score 0.2. Hence, there are different weights associated with the relations. Based on this consideration, we modify the heat diffusion model for the directed graphs as follows. Consider a directed graph G{V,E,W}where V is the vertex set, and V={v1, v2,….Vn} . W={Wij| where wij is the probability that edge(vi,vj) exists } or the weight that is associated with this edge. E={(vi, vj) there is an edge from vi to vj and wij > 0g is the set of all edges. On a directed graph G(V,E), in the pipe vi, vj), heat flows only from vi to vj. Suppose at time t, each node vi receives RH Page 318

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 3, May – June 2013 ISSN 2278-6856 =RHi, j, ∆t) amount of heat from vj during a period of ∆t. We make three assumptions: 1) RH should be proportional to the time period ∆t ; 2) RH should be proportional to the heat at node vj; and 3) RH is zero if there is no link from vj to vi. As a result, vi will receive amount of heat from all its neighbors that point to it. At the same time, node vi diffuses DH (I, t, ∆t) amount of heat to its subsequent nodes. 1. The heat DH(I, t, ∆t) should be proportional to the time period ∆t. 2. The heat DH(I, t, ∆t) should be proportional to the heat at node vi. 3.Each node has the same ability to diffuse heat. 4. The heat DH(I, t, ∆t) should be proportional to the weight assigned between node vi and its subsequent nodes. In the case that the out degree of node vi equals zero, we assume that this node will not diffuse heat to others. To sum up, the heat difference at node vi between time t+ ∆t and t will be equal to the sum of the heat that it receives, deducted by what it diffuses. This is formulated as

Solving it,

Where

The values on the edges in specify how many times a query is clicked on a URL. We cannot simply employ the bipartite graph extracted from the click through data into the diffusion processes since this bipartite graph is an undirected graph, and cannot accurately interpret the relationships between queries and URLs [8][9]. Hence, we convert this bipartite graph. In this converted graph, every undirected edge in the original bipartite graph is converted into two directed edges. The weight on a directed query-URL edge is normalized by the number of times that the query is issued, while the weight on a directed URL-query edge is normalized by the number of times that the URL is clicked. After the conversion of the graph, we can easily design the query suggestion algorithm. 3.3 Query Suggestion Algorithm 1: A converted bipartite graph G= (V+UV*,E) consists of query set V+ and URL set V* The two directed edges are weighted using the method introduced in previous section. 2: Given a query q in V+, a sub graph is constructed by using depth-first search in G. The search stops when the number of queries is larger than a predefined number. 3: As analysed above, set ⍺=1 and without loss of generality, set the initial heat value of query qfq(0)=1 (the choice of initial heat value will not affect the suggestion results). Start the diffusion process using

4: Output the Top-K queries with the largest values in vector f (1) as the suggestions. ,

,

,

i=j,

otherwise

3.3 Graph Construction For the query-URL bipartite graph, consider an undirected bipartite graph Bql=(Vql,Eql), where Vql=Q U L, Q ={q1,q2,… qn}, and L={l1, l2, . . .. lp}. Eql= (qi lj) there is an edge from qi to lj is the set of all edges. The edge (qj,lk) exists if and only if a user ui clicked a URL lk after issuing a query qj.

Query suggestion algorithm not only suggests queries which are literally similar to the test queries, but also provides latent semantically relevant recommendations. For instance, if the test query is a technique, such as “java,” we recommend “virtual machine” and “sun micro systems.” The latter suggestion is the company who owns the Java Platform, and the former suggestion is a key feature of the Java programming language. They both have high latent semantic relations to the query “java.” This chapter shows different results of Query suggestion algorithm.

4 RESULTS This chapter shows different results of Query suggestion algorithm.

Figure 2 Graph construction for query suggestion. (a) Query-URL bipartite graph. (b) Converted query-URL bipartite graph. Volume 2, Issue 3 May – June 2013

4.1 Impact of Parameter α The parameter α plays an important role in our method [8]. It controls how fast heats will propagation on the graph.

Page 319

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 2, Issue 3, May – June 2013 ISSN 2278-6856

Figure 3 Impact of α We can observe that the best setting is 1. If we choose relatively smaller thermal conductivity, the performance will drop since some relevant nodes cannot get enough heat. On the other hand, if we choose relatively larger value of α, the performance will also decrease. This is because if the heat transfers very fast, some irrelevant nodes will gain more heat, hence will hurt the performance. 4.2 Impact of the Size of Sub graph Web graphs are normally very huge; it will perform on query suggestion algorithm on a sub graph extracted the original graph [8]. Hence, it is necessary to evaluate how the size of this sub graph affects the recommendation accuracy.

Figure 4 Impact of the size of sub graph (α= 1) Figure 4 shows the performance changes with different sub graph sizes. We observe that when the size of the graph is very small, like 500, the performance of our algorithm is not very good since this sub graph must ignore some very relevant nodes. When the size of sub graph is increasing, the performance also increases.

KNOWLEDGE AND DATA ENGINEERING VOL. 24, NO. 6, JUNE 2012 [3] Resnick, N. Iacovou, M. Suchak, P. Bergstrom,and J. Riedl , ”Grouplens:An Open Architecture for Collaborative Filtering of Netnews”,CSCW ’94: Proc.ACM Conf. Computer Supported Cooperative Work,1994 [4] A. Kohrs and B. Merialdo, ”Clustering for Collaborative Filtering Applications,”,Proc.Computational Intelligence for Modelling, Controland Automation (CIMCA), 1999 [5] A. Chirita, C.S. Firan, and W. Nejdl, ”Personalized Query Expansion for the Web”,SIGIR ’07:Proc.30th Ann Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 7-14, 2007. [6] Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li,”Context Aware Query Suggestion by Mining Click-Through and Session Data”KDD ’08: Proc. 14th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 875-883, 2008. [7] E. Agichtein, E. Brill, and S. Dumais ”Improving Web Search Ranking by Incorporating User Behaviour Information”, SIGIR ’07: Proc. 29th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 19-26, 2006. [8] H. Ma, H. Yang, M.R. Lyu, and I. King, ”,SoRec:Social Recommendation Using Probabilistic Matrix Factorization”,CIKM ’08: Proc. 17th ACM Conf. Information and Knowledge Management,pp. 931- 940,2008. [9] H. Yang, I. King, and M.R. Lyu, ”Diffusion Rank:Possible Penicillin for Web Spamming”,SIGIR ’07:Proc. 30th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 431438, 2007. [10] B.Velez, R.Weiss, M.A. Sheldon, and D.K. Gifford,”Fast and Effective Query Refinement”,ACM IGIR Forum, vol. 31(SI) pp. 6-15, 1997. [11] Q. Mei, D. Zhou, and K. Church, ”Query Suggestion Using Hitting Time”, CIKM ’08: Proc. 17th ACM Conf. Information and Knowledge Management, pp.469-477, 2008

5. REFERENCES [1] G. Linden, B. Smith, and J. York, ”Amazon.com Recommendations: Item-to-Item Collaborative Filtering”,IEEE Internet Computing pp. 76-80, 2003. [2] Hao Ma, Irwin King, ”Mining web graph for recommendations”, IEEE TRANSACTIONS ON

Volume 2, Issue 3 May – June 2013

Page 320

Suggest Documents