Tree-Based Ontological User Profiling for Web Search Optimization Sachin R. Joglekar1 and Dr. Mangesh Bedekar2 1
BITS-Pilani, K. K. Birla Goa Campus, Goa, India Email:
[email protected] 2 Maharashtra Institute of Technology, Kothrud, Pune, India Email:
[email protected]
Abstract--Today, the internet represents the most important source of knowledge for the average user. As the number of users turning to search engines for information is increasing, web search personalization is becoming an important domain of research in information retrieval. It is essential to adapt the process of web search according to the needs of every individual, without explicit actions from the user's side. In this paper, we demonstrate an implicit clientside method for user profiling using an ontological tree, for efficient contextualization of a user's interests. Our developed system is able to build 'partprofiles' of a user over time, given his web usage data. Each of these part-profiles pertains to one domain of interest of the said user. We also demonstrate the usage of these profiles for search result re-ranking by exploiting a tree-traversing algorithm, to ensure faster knowledge gain from the user's perspective. The methods described in this paper can also be used for other methods of web search optimization, such as query expansion. Index terms–User Profiling, Web Search, Ontology
I. INTRODUCTION A knowledge worker can be defined as someone “whose paid work involves significant time spent in gathering, finding, analyzing, creating, producing or archiving information” [13]. Essentially speaking, every web user in today's scenario is a knowledge worker, using search engines extensively to obtain required knowledge off the web. Considering the ever-increasing amount of text data being added to the internet, it is important to make it easier for the user to get to the information he wants, as efficiently as possible. With this aim in mind, the one-size-fits-all approach [14] is no longer recommended, especially considering the polysemous nature of words in the English vocabulary. The need to model the web experience of an individual as per his specific interests is more important than ever. The best possible way to understand what a user needs from his web experience is relevance feedback for web pages from the user himself [4]. However, a user is
usually reluctant to take the extra effort needed to provide such information on a regular basis explicitly [15]. Hence, implicit user modeling is required, to avoid the extra hassle from the user's side. Moreover, fast processing of a user's interests will only be possible if it is achieved at the individual client-side, helping distribute the necessary computations instead of burdening the server-side infrastructure. To achieve the aim of implicit client-side search optimization, we take the help of the DMOZ (directory.mozilla.org) ontology. Ontology is defined as “an explicit specification of a conceptualization- the objects, concepts, and other entities that are assumed to exist in some area of interest and the relationships that hold among them” [6, 16]. With the idea of accurately representing a user's information needs and search interests, we propose a system to implicitly build an ontological tree of 'part-profiles'. The ontology used for this purpose is the DMOZ/ODP (Open Directory Project) ontology [1]. Every part-profile, as constructed by this method, would denote an area of the user's interest and help in optimizing web search pertaining to it. The aforementioned tree of ontological part profiles will be updated dynamically, as the user visits various web pages. Such dynamic generation of user profiles has been explored in the UCAIR system [3]. However, UCAIR requires the definition of 'logical sessions' of a user's web activities, where every distinct session would pertain to a unique profile. Our method overcomes this shortcoming by extracting important keywords from every web page viewed by a user, and exploiting the ODP hierarchy to understand the field of study of a web search [12]. The keywords extracted from a certain web page form a weighted vector in the bag-of-words representation of the page, which is then categorized into one of the user's profiles. Ref. [17] explored a similar method for ontological categorization of visited pages. We improve upon their method by using a tree-based algorithm over the profile tree for fast categorization. This avoids the need to compare a certain page vector with every category in the tree. When the similarity of a page vector to its predicted category falls below a predefined threshold, a new part-profile is initialized for the user.
Personalization of web search is attained by constructing an expanded version of the user's original query vector using the summaries of the top search results from a search engine. This vector is then either classified to a part profile of the user (if the similarity is above the threshold) or added to a newly generated profile based on the DMOZ ontology. Then, result reranking is done on the top search results [18] by computing their final rank based on an algorithm inspired by [11]. The remaining sections are organized as follows. In Section 2, we discuss the related work. In Section 3, we present our complete methodology for dynamic user profiling using the ODP ontology tree. In Section 4, we demonstrate the experimental results obtained from implementing the framework described in Section 3. In Section 5, we present discussions and critique points pertaining to our work. Section 6 concludes the discussion of our research. II.
RELATED WORK
Ref. [2] very appropriately defines the two pillars of personalized searchcontextualization and individualization. Contextualization refers to the definition and representation of the background information regarding a user's work and the nature of his search interests. Individualization refers to the distinguishing factors and data pertaining to user's own unique information needs. Usage of an ontology for user profiling satisfies both these requirements - the extensive vocabulary of DMOZ aids adequate representation of a user's background (contextualization), while the hierarchy of domains in the hierarchical tree helps in accurate descriptions of the topics of interest (individualization). Ref. [9] recognized the four main types of context in web applications- domain, location, data and user. With the help of our system, we exploit data context (mining the ODP wealth of knowledge for content) and user context (focusing on the web pages viewed by the user). Ref. [2] uses the ODP ontology to classify every page clicked by the user into one of its categories, leading to the construction of individualdomain profiles, similar to our explained work. However, their method differs from ours in the algorithms followed for keyword extraction and categorization. Ref. [3] proposed the development of UCAIR - a decision-theoretic framework for implicit user modeling based on query expansion and click-through information. As search engine queries are usually short, the user models based on them are understandably impoverished. Hence, query expansion is utilized to enrich the notion of what the user desires. The need to perform the user modeling at the client side is also stressed, to reduce the server load drastically [3]. However, as mentioned before, to exploit previous queries and the corresponding
click-through data, UCAIR needs to judge whether two adjacent queries belong to the same logical session. Our framework overcomes this hurdle by making use of the DMOZ ontology to define the 'topic' of a search session, and extract keywords from every document to automatically group together information from various visited web pages. Moreover, UCAIR focuses on user modeling based on short-term contexts, while our framework remembers the user's interests over a long period of time. Ref. [19] evaluates the various ontology-based methods to model user interests – primarily by changing the type of inputs provided for mapping interests onto the ODP ontology. The two best inputs to consider while understanding the user's information needs are (i) the text content of the web pages dwelled upon and frequently visited by the user, and (ii) the queries input by the user, in an expanded format [19]. We utilize the page content data for building the part profiles and the expanded-query approach to aid search result re-ranking. Ref. [20] suggested building builds a user profile comprising of data from previous search query terms. However, since the interpretation of a query by a search engine may be erroneous, we focus on web pages viewed by a user, weighted by the amount of attention given to each of them. A reasonably good method to expand a search query is to derive additional terms from the summaries of the top 10-50 search results, depending on the available resources as explained in [3, 19]. Query expansion not only resolves the problem of poor vocabulary of the original query, but also avoids the issues occurring due to word mismatch [19]. For categorizing a vector using the DMOZ category tree, the OBIWAN system maps every visited web page to five different categories [5]. Categorization is done by comparing a test vector with the vector corresponding to every single category [17]. Our system improves upon this by using a tree-traversing algorithm to reduce the number of comparisons and make the system more efficient. To improve accuracy, we classify every page to only one unique category. Cosine similarity is generally used while mapping a page or query vector onto the appropriate ODP category [3, 11, 17, 19]. In this work we use a modified version of Tanimoto coefficient, also known as the extended Jaccard coefficient [22], as an indicator of vector-vector similarity. The method usually used for re-ranking of search results is sorting as per descending order of cosine similarity. UCAIR utilizes expanded queries and viewed document summaries to dynamically re-rank unseen search results [3]. Ref. [11] proposed to compute final rank of every search result as a linear combination of the contextual rank (computed using cosine similarity with context vector) and the keyword rank (original rank provided by the search engine). This allows flexibility in
assigning weightage to both the type of ranks to get a 'hybrid' rank, focusing on keyword as well as contextual similarity. III. THE PROFILING MODEL
A. Single Document Keyword Extraction We focus on the extraction of keywords from every web page/document read by the user instead of grouping together web pages into sessions. To achieve this aim, we use a modified version of the single-document keyword extraction algorithm proposed in [12]. Ref. [10] summarizes the five groups of keyword weighing methods - (i) a word which appears in a document is likely to be an index term, (ii) a word which appears frequently in a document is likely to be an index term, (iii) a word which appears only in a limited number of documents is likely to be an index term for these documents, (iv) a word which appears relatively more frequently in a document than in the whole database is likely to be an index term for that document, and (v) a word which shows a specific distributional characteristic in the database is likely to be an index term for the database. Ref. [12] focuses on keywords which show a specific distributional characteristic in the database. This is done by weighing them according to the degree of bias of their co-occurrences, with the frequently appearing keywords in the document. The co-occurrence bias can be measured quantitatively by calculating the statistical value of χ2 as an index of biases [12]. The said value is given by the equation 𝜒2 𝑤 =
𝑔∈𝐺
((𝑓𝑟𝑒𝑞 𝑤 , 𝑔 −𝑛 𝑤 𝑝 𝑔 ) 2 𝑛 𝑤 𝑝𝑔
,
(1)
where w is a certain keyword, g is a keyword from G, the set of most commonly occurring keywords, freq(w, g) denotes the number of co-occurrences of keywords w and g, nw denotes the number of keywords present in the sentences in which w occurs and pg denotes the percentage of keywords that occur in sentences in which g is found. We construct G by considering the top one third of most frequently occurring keywords in a given page. Ref. [12] suggested using only G for co-occurrence measures, since only the frequently occurring words in a document strongly determine the occurrence distributions. In (1), nwpg denotes the 'expected' co-occurrence of w and g, while freq(w, g) is the actual value of this quantity. Hence, larger values of χ2(w) indicate a stronger bias in co-occurrence of w with the keywords in G. According to experimental results, this method proves comparable to the popular tf-idf (term frequency-inverse document frequency) scale for domain-independent keyword extraction. Since the domain of knowledge a web page belongs to will not be known beforehand, this weighing scale is very useful in our methodology.
However, just a strong co-occurrence bias may not be sufficient to measure the importance of a keyword in a document. For example, words that occur very rarely in a document may also end up having very biased cooccurrence distributions, but may not be important in defining its context. Therefore, we have modified the importance index by considering the weighing methods defined in [10], namely term frequencies. Hence, the quantity we propose for measuring importance of keywords in a single document is I(w) = tf (w) χ2(w) ,
(2)
where tf(w) denotes the term frequency of word w in the given document, and χ2(w) is given by (1). This ensures that only those keywords, which not only occur frequently but also show a specific distributional characteristic in the document, are given the most importance. The keywords considered for this algorithm were extracted from a text document in the form of unigrams and bigrams using the APriori Algorithm [21] after preliminary pre-processing such as stop word removal followed by stemming [7]. Experimental evidence shows that this method is very effective in extracting keywords from single web pages, without requiring a corpus. We thus construct a bag-of-words representation of any document with the weightages given to every keyword being equal to I(w) (2). B. Construction of the ODP Tree The ODP ontology, also known as DMOZ, is the largest human edited directory of Internet sites [1]. We build a hierarchical tree of vectors from this ontology, with each node denoting one category, and its children depicting its sub-categories. Ref. [5] showed that while generating representative vectors for DMOZ categories, 5-60 documents-worth of data provided quite accurate results. The method we follow is similar to the one used by [17] while generating individual category profiles, but with an added recursive step in the end. We first construct a tree data structure based on the DMOZ ontology. Then, for every category, we append all its URL descriptions together (not considering its subcategories). This piece of text can now be considered as a 'document' describing that particular topic. We then apply the single-document keyword extraction algorithm described in Section 3.A on the category document to get a bag-of-words vector representation for it. Before storing it into the corresponding tree node, the vector is normalized to avoid bias towards categories with larger number of associated URL descriptions. After this initialization, we run a recursive algorithm to update ever node's vocabulary based on that of its children. This method can be expressed as,
𝑉𝑁,𝑓𝑖𝑛𝑎𝑙 = 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑(𝑉𝑁,𝑖𝑛𝑖𝑡𝑖𝑎𝑙 + 𝑛∈𝐶(𝑁) 𝜔(𝑛) 𝑉𝑛,𝑓𝑖𝑛𝑎𝑙 ) ,
(3)
where N denotes a particular node, VN,initial denotes initial vector of node N constructed from URL descriptions, VN,final denotes the final vector of N, C(N) denotes the set of its children, and ω(n) is the weightage factor for vector of child n. For our experiments, we take ω(n) = 1. Thus, we construct the DMOZ ontology tree used for user profiling, with each category-node being assigned an appropriate bag-of-words representation in the vector space model. C. Vector Categorization We now describe the tree-traversing vector categorization method used in our framework. Consider a normalized document vector Vtest constructed using the method as in Section 3.A. To classify it to a category of the DMOZ ontology, we use the algorithm (given in the form of Python code) described belowcurrentnode = Top current_similarity = Tsim(Top, vtest) while True: if currentnode.number_of_children == 0: break similarities = { } for child in currentnode.children: similarities[child] = Tsim(child.vector, vtest) max_child = max(similarities, key = similarities.get) max_similarity = similarites[max_child] if max_similarity > current_similarity: currentnode = max_child current_similarity = max_similarity else: break The above algorithm, at every node, keeps traversing to the child with the highest similarity to the test vector, provided the similarity is greater than that with the current node. This ensures that the tree traversal stops at the category node with the highest similarity to the test vector. The measure of similarity used in this algorithm is a modified version of the extended Jaccard coefficient, also known as the Tanimoto coefficient, given by 𝑇𝑠𝑖𝑚 𝑉𝑁 , 𝑉𝑡𝑒𝑠𝑡 =
𝜙(𝑉 𝑁 ,𝑉𝑡𝑒𝑠𝑡 ) . 𝑉𝑡𝑒𝑠𝑡 |𝜙 (𝑉 𝑁 ,𝑉𝑡𝑒𝑠𝑡 )|2 + |𝑉𝑡𝑒𝑠𝑡 |2 −𝜙 (𝑉 𝑁 ,𝑉𝑡𝑒𝑠𝑡 ) . 𝑉𝑡𝑒𝑠𝑡
...(4) where 𝜙(𝑉𝑁 , 𝑉𝑡𝑒𝑠𝑡 ) = 𝑤 𝑤 ∈ 𝑉𝑁 & 𝑤 ∈ 𝑉𝑡𝑒𝑠𝑡 }
(5)
This form of Tanimoto coefficient proves to be better suited for vector categorization. The basic idea is to consider only that component of the node‟s vector that
consists of the dimensions (keywords) the test vector comprises of. Since a category vector may potentially have the concepts from various other sub-domains apart from the user‟s interest, considering all of them in calculation is futile. D. Building of Part-Profiles Consider a user with part-profile tree P, who views a document with normalized vector d (extracted using method of Section 3.A). Suppose it gets classified to the part-profile Pi, based on the algorithm discussed in Section 3.C. It is to be noted that the classification is done using the part-profile tree, not the original ODP ontology tree. If Tsim(d, VPi) > Ɛ, where Ɛ is a pre-defined threshold value, then the part-profile Pi is enriched using the formula Pi,new = normalize(Ω(d, ni) * d + (1- Ω (d, ni)) * VPi) (6) where ni is the total number of web pages visited by the user pertaining to part-profile Pi and Ω(d, ni) is the „importance‟ function of the total time a user spends on the document, and ni. In our framework, we take Ω (d, ni) = 1/ni, to ensure that recently viewed documents affect the category profile adequately enough to lead to shortterm contextualization. Finally, the value of ni is incremented by 1 to denote the visit to one more web page. If Tsim(d, VPi) < Ɛ, a new part-profile is initialized for the user, using the original DMOZ ontology tree and the categorization algorithm presented in Section 3.C. The category vector for the new part profile is again initialized using (6) (with nj = 1, if j is the category computed). We empirically find the optimum value of the threshold Ɛ by considering the similarities of DMOZ categories with irrelevant test vectors. Thus, we construct a part-profile tree for every user based on the web documents read / seen by him. Initially, the profile tree is empty. This tree grows and is enriched based on the concepts present in the DMOZ ontology and the data extracted from the web pages visited by the user. E. Search Result Re-ranking For search result re-ranking, we first employ query augmentation on the original query to enrich it. This is done based on the keyword-similar search results provided by the search engine. We append the summaries of the top 10 results yielded by the search engine into a document, and then apply our single-document keyword extraction algorithm on it to form an expanded query vector q. This query vector is categorized into the user partprofile tree using the algorithm described in Section 3.D. Let the calculated part-profile be Pi. If the value of Tsim(d, VPi) > Ɛ, then the categorical vector used for result reranking is given as
TABLE II.
VC = normalize(q + Pi).
(7)
If Tsim(q, Pi) < Ɛ, a new part-profile is generated using the original ODP tree. Vc is again calculated using (7). Let the bag-of-words vector representation of the ith search result be Vi. Here, 'i' is the keyword rank of the particular search result, given by the search engine. The 'category ranks' of the search results are calculated by sorting them in decreasing order of Tsim(Vi, VC). Let the category rank of the ith search result be j. We calculate the resultant rank of this page by computing a linear combination of the keyword rank and the category rank [11].Thus, the final rank R is given by, R = α * i + (1-α) * j
(8)
where α is the weightage given to the keyword rank, and 0