implemented in a Personalized Abstract Search Services (PASS) system, a Web-based, domain-specific search engine for searching abstracts of research ...
A Fuzzy Ontology-based Abstract Search Engine and Its User Studies Dwi H. Widyantoro
John Yen
Department of Computer Sciences Texas A&M University College Station, TX 77843-3112, USA
School of Information Sciences and Technology Pennsylvania State University University Park, PA 16801-3857, USA
Abstract- Query refinement can help users find information on the Internet more effectively. This feature has been implemented in a Personalized Abstract Search Services (PASS) system, a Web-based, domain-specific search engine for searching abstracts of research papers. The system uses a fuzzy ontology of term associations to support the feature. The ontology is automatically built in two stages using information obtained from the system's collection. A preliminary user study reveals that query refinement is one of the most important features of the system.
information services, the system fully utilizes the current state-of-the-art of search, personalization and recommendation technology. Currently the system maintains 584 abstracts of paper from four IEEE Transactions in its collection. Figure 1 depicts the general architecture of PASS.
1.
INTRODUCTION
Users of a search engine are often required to precisely articulate their query terms when locating a piece of information of their interests on the Internet The most appropriate terms to use in the search query are the ones that will pull the most relevant information to the top list of search results. However, these terms are often difficult to find and in many cases they do not even exists. One of the solutions to this problem is providing a query refinement mechanism [1]. A system adopting this technique responds to a user's initial query by suggesting a list of related, broader or narrower terms. The user can then select one of the suggested terms as an alternate query to replace the old one so that it will improve the likely to find the information needed. This paper presents our work in developing PASS system (http://nnc.cs.tamu.edu) whose one of its features addresses the above problem. The system uses a fuzzy ontology of term associations as one of the sources of its knowledge to suggest alternative query terms. The fuzzy ontology is automatically generated from the information contained in its collection. In current implementation, our system allows its users to refine their query, to easily navigate through document contents and links, and to ask suggestions to a list of documents relevant to their interests. Additionally, the users can select the appropriate presentation of search results and manage the documents they find. Section 2 describes the overview of PASS application. The sequence of processes to acquire various types of knowledge needed by the system will be presented in Section 3. We then describe in Section 4 our two-stage approach to automatically generate a fuzzy ontology of term associations for query refinement. Section 5 discusses the deployment of PASS and reports some of the results, followed by conclusions in Section 6. 2. SYSTEM OVERVIEW PASS is a domain-specific search engine providing abstracts of paper from mostly IEEE Transactions sponsored by the IEEE Neural Network Council (NNC). Initially launched as a pilot effort by NNC to provide Web-based
User Query
Standard Keywordbased Abstract Retrieval
Search Results Clustering
Collaborative Abstract Retrieval Users’ Searching Activities Term associations Ontology
Collaborative Query Refinement Standard Query Refinement Recommender
Search Results Presentation
Content-based Recommender
Fig. 1. PASS General Architecture.
In response to a user’s query, the system displays search results based on a standard keyword-based retrieval and provides sets of terms lists for query refinement. If requested by the user, PASS will also provides suggestions to a list of most relevant documents related to the user's query and a set of related terms based on a collaborative filtering, a technique that recommends papers that might be of interests to the user based on the user's interests and other users with similar interests [2]. The system's suggestions are indicated by the star icons to the left of paper’ titles. The system can group the search results according to the similarity of their content upon the user's request. PASS will display the abstract of a paper if the user clicks the title of corresponding paper in the search results. If available, the system will recommend a list of related papers, judged based on the content similarity of papers in the collection. The system also identifies keywords that occur in the abstract body and provides hyperlinks including the author-supplied keywords, if available, to search documents based on these keywords. These hyperlinks can be viewed as other means of providing query refinement that is based on the content of documents. The authors of a paper are also hyper-linked, allowing users to search all papers written by the corresponding authors. Other features provided by PASS include personalized folder and shopping-cart-like facilities.
Document Features Extractor
Keywords Tagger
Document Features Vectors
Documents Collection
Documents Similarity Networks Constructor
Documents Similarity Networks
Tagged Document Collection
Keywords List
Keywords Extractor
Word Net
Fuzzy Ontology Constructor Term Associations Ontology
Fig. 2. PASS Knowledge Acquisition.
3. PASS KNOWLEDGE ACQUISITIONS In order to provide the above features, PASS requires two types of knowledge: (1) the structure of the domain and (2) the knowledge of user judgment about the relevance of articles to user queries. Figure 2 describes the processes in the system to acquire the first type of knowledge. As mentioned earlier, the collection used by the system contains abstracts of paper from several IEEE Transactions. These abstracts have been manually typed and tagged based on their title, authors, publication date, abstract body, and author-supplied keywords if provided. Simply stated, the original abstracts in the collection are exact copies of their corresponding hardcopies version. In addition to keywords provided partly by the original collection, PASS extracts additional keywords from the abstract bodies and from the titles of paper. More specifically, the system initially extracts a set of two or three consecutive words exhibiting the following grammar patterns: (noun | adj) (noun | noun noun) or adj adj noun. The WordNet dictionary [3] is used to tag each word during this process. The system then eliminates a two- or three-word phrase that includes at least one word not contained in a control list. The control list has a set of authorsupplied words and a set of pre-defined words in the domain. An exception is given to hyphenated words if they do not appear in the control list since this type of words is a good indicator of a keyword component. The keywords list and the original abstract collection are the central knowledge for PASS to extract other knowledge. For example, the keywords list is then used to generate fuzzy ontology of term associations for query refinement and features vectors of documents for clustering documents and constructing document similarity networks. In current implementation, the system uses scatter-gather algorithm [4]
for documents clustering and employs cosine similarity measure [5] for constructing document similarity networks. The content-based filtering in PASS is based on the networks of document similarity. The keywords list is also used to identify and to tag keywords occurring within a document. The second type of knowledge is gathered from users' implicit feedback. In general, the system uses this knowledge to provide personalized recommendation and to build collaborative filtering data. Due to the space and the paper scope limitations, we will not discuss any further detail about this knowledge. 4. FUZZY ONTOLOGY OF TERM ASSOCIATIONS While many techniques are adopted by PASS, this paper only describes the construction of fuzzy ontology and its use for query refinement. The fuzzy ontology provides information about sets of terms with broader and narrower meaning. A term u is narrower than a term v if the semantic meaning of v subsumes or covers that of u. For example, fuzzy controller has narrower meaning than fuzzy logic, while the former term contains a broader meaning than non-linear system. The definition of broader term is the inverse of narrower term definition. This section describes our approach in constructing a fuzzy ontology based on fuzzy narrower and broader term relations. A more detail version of the technique is described elsewhere [6]. 4.1 Narrower and Broader Term Relations The basic construct needed to build a fuzzy ontology of term associations is knowledge about relations between two terms. More specifically, we use the definition of fuzzy narrower term relation as described in [7] to automatically extract the fuzzy relations between two terms from a set of text documents collection. Let C = (a1, a2, … an) be a collection of articles ai, where each article a = (t1, t2, … tm) is represented by a set of terms tj. Let occur(tj,a) denote the occurrence of tj in article a. The membership degree of occur(tj,a) is defined by µoccur(tj,a) = f(|tj|), which in general is a function of term's frequency of occurrence. In the information retrieval community, the function f can be viewed as the normalized within document term weighting method. Let NT(ti, tj) denote that ti is narrower than tj. The membership degree of NT(ti, tj), represented by µNT(ti, tj), is defined by µoccur (ti, a) ⊗ µoccur (tj, a)
µNT(ti, tj) = a ∈ C
a∈C
µoccur (ti, a)
(1)
where ⊗ denotes a fuzzy conjunction operator. In current implementation, we use a binary function for the f function so that µoccur(tj,a) = 1 if the occurrence frequency of tj > 0, or µoccur(tj,a) = 0 otherwise. Using the binary function will turn Equation 1 into the same equation regardless the selection of fuzzy conjunction operator.
Let BT(ti, tj) denote that ti is broader than tj. Because the notion of broader term is basically the inverse of narrower term notion, the membership value of BT(ti, tj) is derived from the membership value of NT(ti, tj) as follows [7]: µBT(ti, tj) = µNT(tj, ti)
(2)
4.2 Fuzzy Ontology Construction In general, the fuzzy ontology construction can be grouped into two stages. The first stage is to create a full ontology from fuzzy narrower term relations. The full fuzzy ontology is then pruned by eliminating unnecessary relations in the second stage. 4.2.1
Building Fuzzy Ontology from Fuzzy Narrower Terms
A fuzzy ontology can be constructed by first calculating the membership values of two NT relations for each pair of two distinct terms (e.g., µNT(ti, tj) and µNT(tj, ti) ). A set of tests is then applied to select an NT relation that will be incorporated in the fuzzy ontology. The selection process at this stage will eliminate redundant, less meaningful and unrelated term relations. For each pair of terms ti and tj we can have µNT(ti, tj) and µNT(tj, ti) where it is high likely that µNT(ti, tj) ≠ µNT(tj, ti). The meaning of both membership values is basically equivalent so that whenever one computes µNT(ti, tj) and µNT(tj, ti), one will get a redundant information. Eliminating one of the term relations will not reduce the information conveyed but will reduce by half the size of the storage needed to maintain the same amount of information. In constructing the fuzzy ontology, we retain the fuzzy narrower term relation that has higher membership value, and delete the relation with lower membership degree. This decision strategy, in general, will choose a positive relation if one of the membership values is far apart from the other (e.g., 0.2 and 0.9). In the case where the two membership values are close to each other (e.g., 0.8 and 0.9), the strategy will choose a stronger relation. After the removal of redundant term relations, lots of potential less meaningful information intact, and this is indicated by narrower term relations whose membership degrees are very small (negative relation). Since a negative term relation might confuse the meaning of term relations in the fuzzy ontology, the stronger NT relation between two terms whose membership value is small should not be included in the ontology. Applying α-cut to the stronger term relation can exclude this type of term relation and will also automatically eliminate unrelated terms. Two terms are unrelated if the membership degree of their relation is zero (e.g., both terms never co-occur). 4.2.2 Fuzzy Ontology Pruning In the first stage of fuzzy ontology construction, the elimination of an NT relation is based on an analysis between two NT relations generated by two distinct terms. Although
the relation between two terms in the ontology is strong enough, the resulting ontology is still large and may contain unnecessary relations. For example, the more general terms might still contain links to many or almost all other narrower terms, which many of the links are unnecessary since they can be indirectly connected through intermediate nodes. The second stage of fuzzy ontology creation attempts to reduce the excessive relations by conducting an analysis over the set of relations involving more than two distinct terms. For each NT(ti, tj), a search procedure is performed to find an indirect path connecting terms ti and tj (e.g., the sequence of NT(ti, tm1), NT(tm1, tm2), …, NT(tmn, tj) ). Let P be a set of NT relations representing the indirect path for NT(ti, tj), and NT(P) represents an alternate NT relation of (ti, tj) through the indirect path. The membership degree of NT(P) can be defined as the minimum membership value of NT relations in P. The idea of redundant information elimination as used during the first stage of ontology construction can now be applied to determine whether or not NT(ti, tj) should be removed. If P for a given NT(ti, tj) exists and the NT(ti, tj) relation is not stronger than NT(P), then NT(ti, tj) relation could be removed from the ontology description. Unlike in the first stage that removes NT(tj, ti) whenever NT(ti, tj) is stronger, during the second stage, however, all NT relations in P remain in the ontology description if NT(ti, tj) is stronger than NT(P). Figure 3 depicts the algorithm summarizing the ontology creation described above. The corresponding fuzzy ontology of broader term relations can be derived directly from the output of algorithm in Figure 3 using Equation 2. 1. Definition and Initialization • T = {t1, t2, … tm} is a list of distinct terms extracted from a collection C. • 0 ≤ α ≤ 1 is the alpha cut. • Ontology = {} is an empty ontology description. 2. First Stage. For each ti, tj ∈ T and ti≠ tj • Calculate µNT(ti, tj) and µNT(tj, ti) using Equation 1. • Select NT(tp, tq) subject to (tp, tq) = arg max {µNT(ti, tj), µNT(tj, ti)} µNT(tp, tq)α or µNT(tp, tq) ≥ α • Add {NT(tp, tq), µNT(tp, tq)} into Ontology.
3. Second Stage. For each NT(ti, tj) ∈ Ontology • Find P = { NT(ti, tm1), NT(tm1, tm2), …., NT(tmn, tj) } • (tp, tq) = arg min NT {µNT(tm, tn), NT(tm, tn) ∈ P} • if µNT(ti, tj) ≤ µNT(tp, tq) then remove NT(ti, tj) from the Ontology. Fig. 3. The algorithm of automatic fuzzy ontology generation.
4.3 Partial Results This sub-section reports some facts about a fuzzy ontology of term associations generated by the algorithm in Figure 3. From the 584 abstracts in the collection, the system extracted
3443 keywords. The ontology generated using these keywords contained 3438 terms that had links to broader terms, which meant that there were five broadest terms. Two of the broadest terms were fuzzy systems and systems man cybernetics, which were the names of the IEEE Transactions. Note that the names of Transactions were included in the abstract files. The phrase fuzzy systems turned out to be one of the broadest terms because most abstracts in the collection (48%) came from the IEEE Transactions on Fuzzy Systems. Although the portion of abstracts from the IEEE Transactions on Systems, Man and Cybernetics was relatively small (9%), the uniqueness of the name of this journal caused the corresponding keyword one of the broadest terms. About 796 keywords had links to narrower terms, leaving the rest 2647 keywords became terminal nodes in the fuzzy ontology of narrower terms. These results are expected since there must be much more keywords that are more specific and appear less frequently in abstract files. As another example of the partial results, keywords such as pattern recognition, set theory, and classification problem were found to be the list of broader terms for fuzzy clustering, while the list of its narrower terms included fuzzy c-means method, fuzzy clustering PCM, fuzzy set theory, gradient descent, input space and objective function. The complete ontology involved the 3443 keywords. V. USER STUDIES Within a year since its first release in November 1999, PASS had attracted 360 users to register with the system and more than 50% of them had been active users. Active users were those who searched using the system and explored the search results so that enabled the system to record at least one complete trail of searching activities. On the average, each active user left three complete trails, which was significant enough considering the domain of PASS that was very restrictive. Non-active users might have used the system but they did not explore the search results so that the system did not record their activities. Table I illustrates the composition of queries submitted to PASS. Among the search queries that lead to the finding of documents, 29% of them come from query refinement using keywords provided by the system. This indicates that query refinement is a frequently used feature and thus must be a useful feature. Table II describes the break down of the query refinement. The usage percentage of keyword links in the abstract body (60%) is much larger than that of authorsupplied keyword links (14%). It shows that users tend to use keywords contained in the abstract body for query refinement. This also indicates that identifying keywords beyond the ones given by the author is a very important task. Table III breaks down further the sources of query refinement from the ontology of term associations. We believe that the results shown in Table III are highly correlated with the content and the size of collection rather than with a specific system and users behavior. The more frequent use of broader terms indicates that more queries initially generate a few search results and these results might be affected by the small
size of current PASS collection. The few search results then might have encouraged users to broaden their search scope using one of the broader terms suggested by the system. TABLE I THE COMPOSITION OF QUERY SUBMITTED TO PASS. Source of Query
User-Supplied Query
Query Refinement
71%
29%
TABEL II THE COMPOSITION OF QUERY REFINEMENT. The source of query refinement Keywords links in abstract body Ontology of term associations Author-supplied keywords links Author name links
Percentage 60% 18% 14% 8%
TABLE III THE COMPOSITION OF QUERY REFINEMENT FROM FUZZY ONTOLOGY The source of ontology Percentage Related terms* 37% Broader terms 47% Narrower terms 16% * Related terms refer to any terms that contain the original query terms.
VI. CONCLUSIONS This paper has described PASS system and an automatic technique to build a fuzzy ontology of term associations for query refinement in the system. In constructing the fuzzy ontology, the notions of fuzzy narrower term adopted by the system and the technique developed to prune the ontology appear to work well. By evaluating the users' searching activities since its first release, we find that each feature in the system significantly contributes in the searching process showing at least the potential of its usability. We believe that the effectiveness of users' searching activities can be significantly improved by combining the use of PASS features.
REFERENCES [1]
Vèlez, B., Weiss, R., Sheldon, M.A. and Gifford, D.K. (1997). Fast and Effective Query Refinement. In Proc. of the 20th ACM SIGIR Conf. on Research and Development in Information Retrieval, 6 – 15.
[2]
Billsus, D. & Pazzani, M. (1998). Learning Collaborative Information Filters. In Proc. of the Intl. Conference on Machine Learning. Morgan K. Publishers, Madison, WI.
[3]
Fellbaum, C. (1998). WordNet, an Electronic Lexical Database, MIT Press.
[4]
Cutting, D.R., Krager, D.R., Pederson, J.O. and Tukey, J.W. (1992). Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th ACM SIGIR Conf. on Research and Development in Information Retrieval, 318-329.
[5]
Salton, G., and McGill, M.J. (1983). Introduction to Modern Information Retrieval. New York, McGraw-Hill.
[6] Widyantoro, D.H., and Yen, J. (2001). Using Fuzzy Ontology for Query Refinement in a Personalized Abstract Search Engine. Proc. of Joint 9th IFSA World Congress and 20th NAFIPS International Conference, July 25-28, Vancouver, Canada. [7]
Miyamoto, S. (1990). Fuzzy sets in information retrieval and cluster analysis. Boston : Kluwer Academic Pub.