In addition, the (n-)best fitting categories from the current context as determined by the ..... Your homepage for domain name registration, web site design and ...
A Search Interface for Context Based Categorization of Result Sets Korinna Grabski and Andreas Nürnberger Otto-von-Guericke University of Magdeburg Universitätsplatz 2, 39106 Magdeburg, Germany Phone: +49-391-67-11309, Fax: +49-391-67-12018 email: {kgrabski,nuernb}@iws.cs.uni-magdeburg.de
ABSTRACT: Besides search and indexing algorithms, the user interface is a key component of a search engine. If the interface allows the user to express his or her query appropriately, the query can be defined more precise and thus the search process of the user requires less search steps. Moreover, additional user specific information that describes or classifies the retrieved documents can significantly reduce the amount of time a user needs to search for relevant documents in the obtained result set. In this paper, we present an interface of a search engine that adapts to specific categories of interest of a user. The main idea is that a user always searches in a specific context. If the system has detailed information about this context, the search can be supported better as results can be presented in terms of categories that are defined within this context. KEYWORDS: web search, interface design, text classification, adaptation
MOTIVATION The World Wide Web is the largest public database of documents and is still growing. It contains a vast amount of information. Unfortunately, most of it is stored in text files that are designed for human use. These files usually do not provide structured machine readable information that can directly be used for further processing. Therefore, finding specific pieces of information is quite difficult. To overcome these problems, search engines were built since the 1990s. While different search algorithms and indexing methods (e.g. PageRank [1]) were developed, the interface basically stayed the same. The user has to enter some keywords – possibly connected with Boolean operators or formulated as regular expression – by which the search is performed. The matching web sites are returned as a ranked list of documents. In the following, we present an approach that has been developed as part of our research on this topic: The automatic categorization of documents based on user defined contexts and categories.
RELATED WORK Adapting to a user always means to have some kind of user profile, in which additional information about the user is stored. Most studies about creating user profiles are done in the field of intelligent agents. The main goal here is the user specific adaptation of web content and filtering between relevant and non-relevant information. These agents, often called profiling agents, try to identify user interest, among other things, based on web sites viewed and links followed by the user (see, e.g., [2]). A further method to get information about the user’s interests is relevance feedback, i.e., the user marks the results of a query as relevant or non-relevant. This information can then be used to improve the original query. Methods following this idea are mainly based on Rochio’s method [3] that adapts term weights based on the feedback or on the method proposed in [4] for modifying term weights in probabilistic models. A model that uses relevance feedback to create a user profile is presented in [5]. It uses a Bayesian network for learning the user profile. The method presented in [6] tries to learn categories of interest for a user with incremental clustering. Splitting result sets of documents in groups of similar documents is also done in [7]. An integrated tool that provides a visualization of a text document or image collection by a clustering process was presented in [8]. This tool allows the reassignment of documents to different clusters and utilizes this user feedback information in order to compute a user specific similarity measure, which can be interpreted in order to get some inside into the user interests. The search engine Teoma provides keywords describing different topics for query refinement. The WebWatcher [9] is a model that supports the user when navigating the web e.g. by highlighting potentially interesting links. Among other techniques, reinforcement learning methods are applied.
One main problem of all these approaches is that the user has no direct influence on the information stored in his profile. Furthermore, most of the available methods can only be used to support the user during one search session for specific information, but fail if the user switches the search context. In order to avoid these problems and to design an interface that can support context sensitive searching, we restricted ourselves – in the work presented in this paper – to the development of an interface that automatically classifies the web pages retrieved by a search engine with respect to user specific categories. The underlying ideas are presented in the following.
IMPROVING THE SEARCH – ‘INTELLIGENT BOOKMARKS’ The main idea is using user and / or search specific features to adapt the presentation of results to the user’s interests. For this, we assume that a user searches in different contexts. Some queries are related to his or her work, others to private matters like planning the next vacation. Each context is described by the queries performed in it. In addition, each context is split into different user defined categories. The interface provides methods for the user to store valuable results in category folders, e.g. which he or she wants to store permanently or just for the search session. Thus the user is able to store links in a more structured way then it is possible by the bookmark concept of current web browsers. These category folders describe groups of web sites that implicitly define the categories for the support system of the search interface. The more pages are assigned to a category, the better does the system “know” the category. The search contexts with their categories can be used to annotate or structure the search results. Once retrieved documents are stored in the folders, a classifier can be learned and the support system can predict the (n) best fitting categories of the current search context for each document in the search result set. If the user is only interested in results of a certain category – which should be true most of the time – he or she can hide other documents from view with the help of the predicted categories. This is especially useful, when searching for a very specific topic in contrast to a very popular area of interest. In this case, the information needed can only be found on very few pages in the internet. If the keywords also match to sites in other areas, it is usually hard or almost impossible to find the few interesting pages, because their rank in a result list might be very low. Using category information to filter the data could help finding these pages as the others belong to another category (or none). If the query and its results do not fit in the current context at all or if there is any that fits better, the system can suggest switching to another context. The user is also able of joining and splitting contexts, if he or she accidentally created more than one context about the same topic or if one context has grown too large. Another possibility to structure the results is finding groups of similar documents without looking at categories or contexts. This is especially useful in the case of a user searching in a rather new context, where categories are not well defined yet. If the user finds an interesting document, he or she can select to only view documents that are similar to it.
A SEARCH SYSTEM FOR CONTEXT BASED CATEGORIZATION In the work presented in this paper, we are focusing on the design and functionality of an interface for searching the web. Our Interface is connected to an existing Search Engine by web services, which are used to exchange queries and result sets. In the current implementation we use Google. However, the Search Engine can easily be replaced. Once a query is submitted, the results are sent to the Classifier and optionally to the Clusterer, which both annotate the results by adding category or group labels. The Classifier is trained with the categories and their corresponding web sites belonging to the current User Profile as described above. The User Profile is defined by the user, when he or she alters contexts and categories and assigns documents to them. The Clusterer provides an opportunity to group the results without the need for the user to provide information about his or her interests, using an unsupervised clustering algorithm. To be able to deal with natural language text, a Preprocessor is needed that maps the web sites into a representation that can be used for further processing by the Classifier and Clusterer. In Figure 1, an overview of the system architecture is given. In the following, we present the individual components of the system in more detail. THE INTERFACE A screen shot of the interface of our system is shown in Figure 2. The interface basically consists of three parts, the query area, the contexts and categories area, and the result list area. The query area can be found on the upper left side. Here, the user can enter his or her queries or can scan through the results from the current or previous queries by
Interface
Clusterer
User Profile (Folders)
Search Engine
Preprocessor
Figure 1: System Structure. selecting a specific result list. As all previous queries performed by the user and the corresponding results are stored by the interface, the user is able to switch arbitrarily between his or her results. This is based on the idea that searching the web for a specific piece of information usually means having a series of search queries, which are needed until the sought information is found. In current search engines, results from earlier queries get lost as they are overwritten by the new results. Although the user has the possibility to retrieve them by using the browser functionality of returning to the previous page (back button), this only works for a linear sequence of queries. When the search process starts to become a tree structure, e.g. by refining a query in several ways, results get lost. The context and categories area is located underneath the query area. Here, the contexts and categories of the user are presented, which define his or her user profile. The user can browse through it, create and delete contexts and categories in a hierarchical tree structure, assign single web sites to certain categories and view the web sites contained in each category. There is always one context selected as the current one. Based on this selection, search results are categorized. The result list area is on the right side. Here, a list of web sites is presented. This can either be a list of search results or a list of web sites contained in a selected category. Each web site is represented by its title, its hyperlink, and a snippet as it is done in standard search engines. In addition, the (n-)best fitting categories from the current context as determined by the trained classifier are presented for each search result. If desired by the user, it is also possible to display categories found by an automated clustering over the search results, independent from user defined categories. For this, automatically created labels are displayed to give the user some idea of the content of the clustered elements. By selecting certain categories or clusters, the user can filter the displayed results. Documents belonging to different categories will then be hidden. THE USER PROFILE The user profile consists of the user defined contexts, their categories, and the documents assigned to each category. This can be seen as a folder structure, in which contexts are the upper level folders, categories are subfolders and the documents are files in the subfolders. This hierarchical folder structure is currently created and maintained completely and exclusively by the user with the interface, which gives him or her total control over the profile. However, adding documents to categories can also be seen as some kind of relevance feedback, as the user only assigns documents to categories, if they are relevant to him or her. This implicit user behavior is exploited by the support system in order to train the document classifier. We also assume that documents not assigned to a category are not necessarily irrelevant. Therefore, the classifier is not used as a filter but to annotate result sets. Documents that do not fit into a given class are marked as such.
Figure 2: Screenshot of the Search Interface.
THE PREPROCESSOR Since we are dealing with natural language text, the documents have to be preprocessed in order to store their information in a data structure more appropriate for further use. The currently predominant approaches for this are the vector space model [10], the probabilistic model [11], the logical model [12] and the Bayesian net model [13]. Despite of its simple data structure without using any explicit semantic information, the vector space model enables very efficient analysis of huge document collections and is therefore still used in most of the currently available document retrieval systems and also in ours. For the definition of the documents, we use a tf×idf-representation ([14], [15]). This means that each element in the vector corresponds to a term i in the document, while the size of the vector is defined by the number of words n occurring in the considered document collection (dictionary). The weights of the elements depend on term frequency tfi,d and inverse document frequency idfi of the corresponding term. In our implementation, the weight w for the term i in the document vector of document d is calculated as follows (with N being the total number of documents and dfi being the document frequency of i): wi ,d =
wi ,d
∑
n w2 j=1 j , d
⎛ N + 1⎞ ⎟ with wi ,d = tf i ,d ⋅ log⎜⎜ ⎟ ⎝ df i ⎠
Before vectors can be created, one has to decide which terms will be used to define the vector space. Usually, it is not reasonable to transfer every occurring term into the model. A high vector space dimensionality can have negative effects on computing time and quality of classification. In order to reduce the number of terms used for indexing, term selection and term extraction methods can be used. Term selection methods select a subset of – statistically or semantically relevant – terms for further use. The most widely known term selection method is stop word filtering [16]. It removes all words that can be said to carry little information about the contents of the document, e.g. terms like ‘is’ and ‘and’. Other term selection methods founded on statistic and information theory are discussed in [17]. Term extraction methods generate new, usually more general terms for the model. Thus several syntactical different terms are
mapped to a single term, which represents a semantically more general representation. Very prominent is stemming (see e.g. [18]), which reduces each word to its stem., e.g. ‘going’ and ‘goes’ is mapped to ‘go’. Other prominent term extraction methods are term clustering ([17], [19]) and latent semantic indexing [20]. In the current implementation, some simple term selection and term extraction methods are used in order to reduce the dimensionality of the document vectors and thus the complexity of the classification algorithm. Term selection is done by stop word filtering. In addition, all words that occur less than two times in the document collection are removed as they are not statistically relevant. Also words including numbers are filtered out. Term extraction is done by standard stemming algorithms for English and German documents. Indexing methods provide more efficient access to document collection, if information about specific terms is needed, like in which documents a term can be found or how often it occurs in a document or the complete collection. Meanwhile, different indexing methods have been proposed like inverted files, suffix trees, and signature files [21]. For our purposes, an inverted file is a good choice, since it allows efficient computation of a vector space representation of the documents, which is required for the classification and clustering methods used. Here, a dictionary is created including each term occurring in the document collection. To each term a list of occurrences is assigned, defining the documents, in which the term occurs and the exact positions of the term in the specific documents. We currently use the Lucene indexer implementation provided by Apache Jakarta Project. THE CLASSIFIER The classifier classifies documents of a result set according to the predefined categories of the current search context. This is a supervised learning problem. To each category – which are our classes – documents are assigned, which are the learning examples. For each context, we can learn a classifier based on the sample documents in all categories. In our current implementation, we are using a Naïve Bayes classifier ([22], [23]) that classifies the documents based on term frequencies. Naive Bayes classifiers are an old and well-known type of classifiers. They classify data using a probabilistic approach, i.e., they try to compute conditional class probabilities and then predict the most probable class. The classifier is “naïve” because it assumes that the probabilities concerning different features are mutually independent. In our scenario, the documents d are represented as a collection of words t1,…,tn. This means that we assume that the occurrence of a word in a document is independent from occurrences of other words in the document. Although this assumption is violated in natural language text, the classifier works well for this kind of classification problem [24]. Using the Naïve Bayes equations, we can state our problem as follows with the naïve assumption taken in the last step: P (cat d ) = P (cat t1 ∧ ... ∧ t n ) =
P (t1 ∧ ... ∧ t n cat ) ⋅ P (cat ) P(t1 ∧ ... ∧ t n )
∏i =1 P(t i cat ) ⋅ P(cat ) = n
P (t1 ∧ ... ∧ t n )
For each category cat, the probability is calculated that the document d belongs to that category. The category with the highest probability is then returned. The probabilities P(ti|cat) and P(cat) are estimated by the relative frequencies in the learning examples. P(t1,..,tn) can be seen as a normalizing factor over all categories to make the probabilities sum up to one. THE CLUSTERER In contrast to the previous problem, finding groups of similar documents without categories is an unsupervised setting. Therefore, other learning methods – so-called clustering methods – have to be used. Unfortunately, standard clustering methods are usually computational complex. As our setting requires online clustering in real-time, so that the user does not realize the clustering process because of long response times, most standard methods are not appropriate for this type of application. An interesting approach are incremental clustering methods that access each data item only once during the clustering process and are thus quite fast. However, this usually results in a decreased cluster quality. Besides, the order, in which the data elements are presented, also influences the final cluster structure.
In our first implementation, we decided to use the Doubling Algorithm as described in [25], due to its encouraging theoretical analysis presented in this work. A pseudocode description of the algorithm is shown in Figure 3. The algorithm has three parameters. α=e/(e-1) and β=e are used for controlling the merging and creation of clusters, k is the maximum number of clusters that should be found. The cluster generation is mainly influence by a threshold value for the cluster diameter. This threshold decides, which clusters should be merged and whether a data element is added to an existing cluster or whether a new cluster is opened. DoublingAlgorithm(α, β, k) 1. Let d be the diameter threshold 2. Initialize the (k+1) clusters a. Cluster centers are set to the (k+1)-first data elements b. Each cluster gets assign the corresponding data element 3. Set d = r ⋅ mini , j ,i ≠ j (dist(ci , c j )) where r is a random number from [1/e;1] 4.
according to the probability density function 1/x Loop while the number of clusters is k+1 a. Do Merging i. d = β*d; ii. Merge clusters, which centers are closer together than d b. Do Updating: Loop while the number of clusters is less than k+1 and there are still new data elements i. Determine cluster center cc that is closest to the new data element e ii. If dist(cc,e)≤ α*d: Assign e to cc Else: Create a new cluster with e as center Figure 3: Doubling Algorithm.
At the beginning of the algorithm, the k clusters plus an additional one are initialized with the first (k+1) data elements. The threshold value is set to the minimum distance between all cluster centers. Then, the main loop of the algorithm starts. It consists of two steps, merging and updating. Merging tries to reduce the number of clusters by merging all clusters that are closer together than the current threshold value. This means that after the merging step the number of clusters can be any between 1 and k+1. For more details on the merging step but also on the rest of the algorithm see [25]. In the update step, the algorithm handles new data points by either adding them to an existing cluster or opening a new cluster. Data points are added until the number of clusters first exceeds k. When the algorithm stops, all data points are assigned to k or less clusters. The Doubling Algorithm needs a defined distance measure between the data elements. It assumes that a distance can take values anywhere in the interval of [0,∞). In our application, we need to determine the distance between tf×idf vectors. This is usually done by the cosine similarity, which ranges between 0 and 1. Therefore, a transformation between these two intervals has to be done. This is gained by the following equation as proposed in [6]: dist (v1 , v 2 ) =
1 −1 cosSim(v1 , v 2 )
After all documents are clustered, labels are automatically assigned to the clusters. These labels shall help the user when browsing through the results by giving him or her an idea about the content of documents in one cluster. To create the labels, we select keywords that describe a cluster as proposed in [26]. This approach assumes that a good label term t should be prominent in the cluster compared to the other words in the cluster as well as prominent in the cluster compared to its occurrence in the whole collection. The first characteristic can be described by the relative frequency Fj(t) of t in a certain cluster j. For the second characteristic, this frequency can be set in relation to the “background frequency”. Therefore, the quality (goodness) of a term t as a label can be determined by: G (t , j ) = G cluster (t , j ) ⋅ G collection (t , j ) = F j (t ) ⋅
F j (t )
∑ F (t ) i
i
In our implementation, we determine the three best terms for each cluster by this equation, which are then used as cluster labels.
DISCUSSION Our main goal of the work up to this point was the implementation of a system for searching the web that integrates different aspects of information retrieval to improve the search for the user. This system is meant to be a starting point, providing basic functionality for all aspects. Then, each aspect can be analyzed more thoroughly and improved in the scope of the whole system. The main parts here are the interface, a user profile, classification, and clustering. As a good user interface is supposed to support the user in performing his or her task, in our case searching the web, the best evaluation for the interface itself is a user study. However, this was not done at this point, since still some further refinements on the interface are necessary. Nevertheless, some aspects of our work are based on user studies already performed elsewhere: In [27], different methods of presenting category information are shown and compared to the standard list interface, especially in terms of tracked search times but also by user ratings. This strongly suggests that using categories improves the search and give hints about what presentation of search results might be the best. Therefore, as a first step, our system presents determined category information integrated in the list interface to provide further information about the content of each website. However, a flexible interface can use this information to allow the user to reorganize the presentation of search results, e.g. by grouping the results by category name or by filtering the view on the results. The classifier provides categorical information based on the analysis of documents implicitly classified by the user. A screenshot of the integration of this technique in the interface is shown in Figure 2. Studies on the performance of the applied naïve Bayes classification for text can be found in, e.g. [24], [28]. In order to give an impression of the performance of the Doubling cluster algorithm and the automatic labeling method, we applied it on a set of websites. For this purpose, we entered the keyword “network” in Google and took the first three result pages (30 hits) as our data set. Some of these pages couldn’t be parsed by the HTML-parser. So we left them out. Table 1 shows an overview of the used pages as presented by Google and Table 2 shows the learned clusters. As we use stemming, the computed labels are also stemmed. However, they seem to be quite good in giving an idea about the web sites’ contents although they are not perfect. The clusters themselves provide some good separation of the data, e.g. the science links were put together in cluster 5 and the weather link was separated into cluster 4. The links concerning environmental issues are all to be found in cluster 3. Nevertheless, this cluster also contains web sites that do not fit perfectly to the class labels. However, a hierarchical search strategy could relax this problem: Once the user selected one category, the system determines new sub clusters for further separation. “Wrong” cluster assignments could be due to the fact that some of the used web sites use a lot of scripting, leaving not much words to find for the parser. This is a general problem making the grouping and categorizing of web sites difficult. Besides quality of clustering and labeling, the run time is a very important issue. For a test run with 150 docs, clustering took less than a second. Labeling only takes a few milliseconds. So in terms of run time, the Doubling algorithm is very suitable for our problem at hand.
CONCLUSIONS In this paper we have described the structure and the algorithms used in an adaptive interface for web searching. The interface supports a user by providing techniques to store and maintain his or her (intermediate) search results in a structured way. Furthermore, this structural information provided by the user is used to classify results of following search steps. Thus, the interface provides additional user specific information about the retrieved web pages in order to support the user in his or her search process. In addition, the system can also determine clusters of documents without any knowledge about the user interests and label these groups automatically. Even though, we have not yet performed appropriate user studies, the results so far are very encouraging. Therefore, we will continue our work in the proposed direction.
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Description Your homepage for domain name registration, web site design and ... www.networksolutions.com/ ... web site design, email services, domain name transfer, domains, email address, web page design, domain name registration, network solutions, networksolutions ... CNN.com www.cnn.com/ ... CNN TV, CNN International, Headline News, Transcripts, Preferences, About CNN.com. © 2004 Cable News Network LP, LLLP. A Time Warner Company. All Rights Reserved. ... Welcome to MSN.com www.msn.com/ Sign in with your .NET Passport. Search the Web: Help. Autos Careers & Jobs Dating & Personals Entertainment Games Health Hotmail ... Rainforest Action Network www.ran.org/ Rainforest Action Network works to protect the Earth's rainforests and support the rights of their inhabitants through education, grassroots organizing, and non ... Fedworld Homepage www.fedworld.gov/ Skip directly to Main Content NTIS Logo:Link to NTIS.gov website, Site Map | About Us | Privacy | Comments This Page uses one table for Layout Purposes. ... The EnviroLink Network www.envirolink.org/ ... animals. library stone About EnviroLink The EnviroLink Network is a non-profit organization founded in 1991. EnviroLink maintains ... Environmental News Network - ENN.com www.enn.com/ ... The Environmental News Network and ENN Publisher Jerry Kay are also weekly contributors to WISDOM Today. ... Copyright © 2004 Environmental News Network Inc. Jumbo: Free & Shareware MP3 files, Games, Screen Savers & Computer ... www.jumbo.com/ UninstallDummy! Haunted by programs that just won't leave? Remove phantom programs from your Add/Remove Programs list with this handy utility. ... CNET.com www.cnet.com/ Click Here! All CNET, Site map, Send us feedback, 04.11.04. ... MadSciNet: The 24-hour exploding laboratory. www.madsci.org/ MadSci Network represents a collective cranium of scientists providing answers to your questions. ... MadSci Network is not affiliated with the Mad Science Group. ... The Sports Network www.sportsnetwork.com/ ... 2004 The Sports Network. All Rights Reserved. privacy policy | your comments. American Civil Liberties Union www.aclu.org/ ACLU Files First Nationwide Challenge to 'No-Fly' List No-Fly Lawsuit: Michelle GreenAir Force Master Sergeant Michelle Green (right), a retired Presbyterian ... NationJob, Employment Job Search Engine & Careers www.nationjob.com/ NationJob.com, Job search engine, careers, employment, and resume services for job seekers. Search Jobs, Jobs By E-mail, Company Search, Post Your Resume, ... Science Learning Network: Home www.sln.org/ ... VIsit Our Museums EXPLORE! Visit our international network of museums for the best inquiry resources on the web. ... Find out about our network. Science Learning Network: Home www.sln.org/ ... VIsit Our Museums EXPLORE! Visit our international network of museums for the best inquiry resources on the web. ... Find out about our network. Gospelcom.net :: Home www.gospelcom.net/ Gospel Communications Network is an alliance of over 300 Christian organizations dedicated to spreading the Gospel over the Internet's World Wide Web. ... ::: Welcome to FOX.com ::: www.fox.com/ FOX Fan Club, FOX Primetime Movies. FOX Wireless. FOX.com Lounge. FOX.com Vault. FOX All Access. Public Service. Upgrade Center. 24. American Idol. America's Most Wanted. ... The Weather Network www.theweathernetwork.com/ the weather network web site provides weather forecasts, news, and information for Canadian cities, US cities and International cities, including weather maps ... Make a WHOIS search on any domain on the Web | Network Solutions www.networksolutions.com/cgi-bin/whois/whois ... The Data in Network Solutions' WHOIS database is provided by Network Solutions for information purposes only, and to assist persons in obtaining information ... scholastic.com www.scholastic.com/ Scholastic Home | About Us | Site Map, Search | Privacy | Customer Service, Families - Encourage the love of learning at home. Go Now. ... Network Computing | Evaluating Enterprise Technology | For IT By ... www.networkcomputing.com/ ... minutes to respond to our e-poll. We'll share the anonymous results in the July 22, 2004, issue of Network Computing. Current Issue. ...
Table I:. Used Web Sites
# 1 2 3 4 5 6 7
Label domain, solut, transfer compani environment, digit, envirolink forecast, weather, pollen scienc, resourc, madsci sport, tenni, scoreboard govern, technolog, platform
Web Sites 1, 19 17 7, 6, 8, 9, 13, 16, 4 18 10, 14, 15 11, 2, 3 12, 20, 21, 5
Table II:. Clusters Computed by the Doubling Algorithm
ACKNOWLEDGEMENTS The work presented in this article was supported by the German Research Community (Deutsche Forschungsgemeinschaft, DFG), project number NU 131/1-1 and the European Network of Excellence on Intelligent Technologies for Smart Adaptive Systems (EUNITE).
REFERENCES [1]
Page, Larry; Brin, Sergey; Motwani, R.; Winograd, T., 1998, “The PageRank Citation Ranking: Bringing Order to the Web”, Stanford Digital Library Technologies Project, Stanford University.
[2]
Klusch, M., 1999, “Intelligent Information Agents”, Springer Verlag, Berlin.
[3]
Rochio, J.J., 1971, “Relevance Feedback in Information Retrieval”, IN: Salton, G., “The SMART Retrieval System”, pp. 313-323, Prentice Hall, Englewood Cliffs, NJ.
[4]
Robertson, S.E.; Spärck Jones, K., 1976, “Relevance Weighting of Search Terms”, Journal of the American Society for Information Science, 27(3), pp. 129-146.
[5]
Wong, S. K. M.; Butz, C. J., 2000, “A Bayesian Approach to User Profiling in Information”, Technology Letters, 4(1), pp. 50-56.
[6]
Somlo, G. S.; Howe, A. E., 2001, “Incremental Clustering for Profile Maintenance in Information Gathering Web Agents”, IN: Proc. Of the 5th International Conference on Autonomous Agents (AGENTS ’01), pp. 262-269, ACM Press.
[7]
Roussinov, D. G.; Chen, H., 2001, “Information Navigation on the Web by Clustering and Summarizing Query Results”, Information Procession & Management, 37(6), pp. 789-816.
[8]
Nürnberger, A.; Detyniecki, M., 2003, “Weighted Self-Organizing Maps: Incorporating User Feedback”, In: Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003, Proc. of the joined 13th International Conference, LNCS Series Vol. 2714, pp. 883-890, Springer-Verlag.
[9]
Joachims, T.; Freitag, D.; Mitchell, T. M., 1997, “WebWatcher: A Tour Guide for the World Wide Web”, IN: Proc. of the International Joint Conference on Artificial Intelligence (IJCAI 97), pp. 770-777, Morgan Kaufmann Publishers, San Francisco, USA.
[10] Salton, G.; Wong, A.; Yang, C. S., 1975, “A vector space model for automatic indexing”, Communications of the ACM, 18(11), pp. 613-620. [11] Robertson, S. E., 1977, “The probability ranking principle”, Journal of Documentation, 33, pp. 294-304. [12] Rijsbergen, C. J. van, 1986, “A non-classical logic for Information Retrieval”, The Computer Journal, 29(6), pp. 481-485.
[13] Turtle, H.; Croft, W., 1990, “Inference Networks for Document Retrieval”, In: Proc. of the 13th Int. Conf. on Research and Development in Information Retrieval, ACM, New York, pp. 1-24. [14] Salton, G.; Buckley, C., 1988, “Term Weighting Approaches in Automatic Text Retrieval”, Information Processing & Management, 24(5), pp. 513-523. [15] Salton, G.; Allan, J.; Buckley, C., 1994, “Automatic structuring and retrieval of large text files”, Communications of the ACM, 37(2), pp. 97-108. [16] Frakes, W. B.; Baeza-Yates, R., 1992, “Information Retrieval: Data Structures & Algorithms”, Prentice Hall, New Jersey. [17] Sebastiani, F., 2002, “Machine learning in automated text categorization”, ACM Computing Surveys (CSUR), 34(1):1-47. [18] Porter, M., 1980, “An algorithm for suffix stripping”, Program, pp. 130-137. [19] Dhillon, I.; Mallela, S.; Kumar, R., 2002, “Enhanced word clustering for hierarchical text classification”. [20] Deerwester, S. C.; Dumais, S. T.; Landauer, T. K.; Furnas, G. W.; Harshman, R.A., 1990, “Indexing by latent semantic analysis”, Journal of the American Society of Information Science, 41(6):391-407. [21] Baeza-Yates, R.; Ribeiro-Neto, B., 2002, “Modern Information Retrieval” [22] Duda, R.; Hart, P., 1973, “Pattern Classification and Scene Analysis”, John Wiley & Sons, New York, Chichester et al. [23] Good, I., 1965, “The Estimation of Probabilities: An Essay on Modern Bayesian Methods”, MIT Press, Cambridge, MA, USA. [24] Mitchell, T. M., 1997, “Machine Learning”, McGraw-Hill International Editions. [25] Charikar, M.; Chekur, C.; Feder, T.; Motwani, R., 1997, “Incremental Clustering and Dynamic Information Retrieval”, Proceedings of the 29th Symposium on Theory of Computing, pp. 626-635. [26] Lagus, K.; Kaski, S., 1999, “Keyword Selection Method for Characterizing Text Document Maps”, Proceedings of the International Conference on Artificial Neural Networks, volume 1, pp. 371-376. [27] Dumais, S.; Cutrell, E.; Chen, H., 2001, “Optimizing Search by Showing Results in Context”, Proceedings of CHI 2001, pp. 277-284 [28] Joachims, T., 1996, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization”, CS Technical Report CMU-CS-96-188, Carnegie Mellon University.