Hierarchical Classification of Web Search Results Using Personalized Ontologies Amrish Singh and Keiichi Nakata School of Information Technology International University in Germany Campus 2, D-76646 Bruchsal, Germany
[email protected],
[email protected]
Abstract In this paper, we propose an approach to presenting web search results that supports personalization, taking into consideration users’ perspectives. We developed a post-retrieval algorithm which uses document classification techniques to organize search results into a meaningful hierarchy of topics, based on the perspective of the user performing the search, represented as a taxonomic ontology. A demonstration system called WEBCLUSTERS was implemented to interface with a number of existing search engines to retrieve search results for keyword queries and present them in user defined topic hierarchies. The classification of retrieved documents to a concept in an ontology is based on the multinomial variant of the naïve Bayes classifier. Experiments were performed to measure the accuracy of the system in organizing documents based on user-defined ontologies. Accuracy was measured by estimating the degree of correlation between two different ontologies which described the same information domain. Experiments show that where the retrieved documents are heterogeneous in nature, the WEBCLUSTERS system is capable of classifying search results into the user specified hierarchy with a reasonable level of accuracy.
1
Introduction
Web-based search engines enable fast and easy access to documents that relate to users’ information needs. These engines present search results in a ranked ordered list. The ranks of the documents are determined by their relevance to the corresponding query. This relevance measure depends predominantly on the users’ ability to suitably describe their information need as a query text. Unfortunately, most queries are short, and unconscious assumptions are made regarding the context of query terms, making the query ambiguous or vague. This leads to a low precision in the retrieved results and users are forced to manually sort through the list to find relevant documents. This would be unproblematic if the users could easily separate irrelevant documents from the relevant ones. However, the current presentation style of ranked lists used by most search engines does not make it easy to do so. Document ranking becomes virtually obsolete when documents lower in the list are more relevant to the user than the ones with higher ranks. Therefore, a major challenge for efficient web search is to make search results helpful to the user even if the query is poorly formulated. Search Result Clustering (SRC) has often been explored as a solution to this problem. Online SRC systems like vivisimo.com1 and kartoo.com2 present search results by contextually organizing similar web documents into groups (or clusters), and labeling each group with a characteristic word or phrase. This alleviates the problem with ranked ordering by enabling users to find relevant documents easier. Existing online clustering engines, however, still suffer from two basic drawbacks. First, the document clustering generated is very objective in nature and does not consider the user’s perspective. This forces users to look at the set of results from different perspectives (e.g., a technical perspective versus a business perspective), which sometimes causes difficulty, or may not make any sense to the user (Hotho et al., 2002). Second, the automated keywords or key phrases created for each cluster often tend to be vague, ambiguous and unrepresentative of the documents in the cluster (Hotho et al., 2002). In this paper, we propose an approach to presenting web search results that support personalization, taking into consideration users’ perspectives. We developed the OHC (Ontology-based Hierarchical Classification) algorithm: a post-retrieval algorithm which uses document classification techniques to organize search results into a meaningful hierarchy of topics based on the perspective of the user performing the search. User perspectives are represented by 1 2
http://www.vivisimo.com http://www.kartoo.com
ontologies. Ontologies provide an explicit structured representation and organization of concepts which users apply in making sense of the information space. This formal representation replaces the cluster hypothesis of SRC systems, i.e., a cluster according to our approach is a set of documents which are similar, and its label can be attributed to a particular concept in an ontology; i.e., each cluster itself is labeled with the corresponding concept. The hierarchical nature of the ontology also provides for a natural hierarchical organization of the search results. Personalized user-specific search result classification is achieved by allowing users to provide personal ontologies while performing a web search. This produces a completely different hierarchical organization of web search results for the same query, thereby overcoming the “one-size fits all” (all users get the same result for the same query) paradigm of traditional document classification and web search systems. In addition, the use of large-scale personalized ontologies like the dmoz3 and Yahoo! web directories4 was explored. These categorical ontologies are created by thousands of individuals working together to define information domains and the respective concepts underlying each domain. This collaborative nature allows for producing detailed ontologies which provide a comprehensive coverage the domain and lead to a concise and general classification of search results. The main aim of this work is to enhance the web search result presentation scheme rather than improving the actual process of information retrieval. A demonstration system called WEBCLUSTERS was implemented to interface with a number of existing search engines to retrieve search results for keyword queries and present them in user defined topic hierarchies. In short, we created a search interface with powerful visual information overlays to ease and enhance the browsing of search results arranged in a hierarchical structure.
2
Background
2.1
Document Organization
Dumais and Chen highlight use of structural information, document clustering and document classification as the three general techniques employed in organizing documents into thematic categories (Chen & Dumais, 2000; Dumais & Chen, 2000). Structural information refers to the special properties or metadata associated with each document. Structural information for example can be gathered by analyzing the link structure of the retrieved web search results (Chen et al, 1999). Kleinberg (1999) exploits the link structure of web pages by analyzing the collection of pages relevant to a broad search topic, and discovers the most authoritative pages on the search topic. The retrieved search results are then organized into groups where each group is identified by an authoritative source. One of the principle ideas behind the utilization of structural information is hyper-information (Marchiori, 1997). Rather than analyzing the text in the retrieved documents, document organization is achieved by analyzing the links in and between the retrieved documents. Organizing documents in this manner has two consequences. First, the document groups produced by such systems are often found to be obscure and difficult to understand. Second, applying it to large scale applications can be difficult, since some of the systems require pre-retrieval calculation of the entire link structure, which does not scale up to the WWW’s document corpus. The second approach, document clustering, organizes documents by placing documents into groups based on their overall similarity to one another. It is typically an unsupervised learning task where unlabeled documents are categorized into unknown and unpredicted categories automatically, i.e., neither the documents nor the labels of each cluster (or group) are known prior to the clustering task. This premise makes document clustering an ideal approach to organizing search results. In contrast to using structural information similarity measures in document clustering systems are based on the text of the provided documents. Usually, a document is represented as a collection of words (both ordered and unordered collections have been used), and the similarity between two documents is determined by comparing their word collections. These similarity measures coupled with the clustering algorithms form the basis of any document clustering system. To generate hierarchical clusters, Agglomerative Hierarchical Clustering (AHC) algorithms are often used (Zamir & Etzioni, 1998; Zamir & Etzioni, 1999). These algorithms follow a bottom-up approach to organize documents into hierarchical clusters. Although such clustering systems have proved to be quite useful current online clustering engines, which label clusters with shared phrases, are often unable to produce meaningful phrases leaving the user to discern the contents of the cluster. 3 4
http://www.dmoz.org http://www.yahoo.com
The third approach to document organization is classification. Contrary to clustering, document classification is a technique of assigning documents into predefined categories or classes. It is usually performed in two stages: 1) the training phase and 2) the testing phase. During the training phase, sample documents are provided to the document classifier for each predefined category. The classifier uses machine learning algorithms to learn a class prediction model based on these labeled documents. In the testing phase, unlabelled documents are provided to the classifier, which applies its classification model to determine the categories or classes of the unseen documents. This trainingtesting approach makes the process of document classification a supervised learning task where unlabeled documents are categorized into known categories. Therefore, document classification can only be used when the domain of the retrieved documents is already known or the set of predefined categories can effectively cover most domains. In this context, it is also important to note that document classification algorithms focus on the text of the documents (like document clustering) rather than the hyper-information (like structural information) available in them. They usually represent a document as a collection (ordered or unordered) of words. A large number of machine learning techniques have been applied for document classification. The Naïve Bayes (NB), Support Vector Machine (SVM), K-Nearest Neighbor, and Decision Tree algorithms are the most common algorithms used for document classification. Among the four, NB is the most frequently used algorithm for real-time classification systems (Lewis, 1998). This is because 1) NB is a linear time algorithm and 2) It is easy to implement. NB, however, has been found to perform lower than several other classification algorithms (Yang & Liu, 1999; Rennie et al., 2003) with SVM receiving good reviews (Joachims, 1997; Shih et al., 2002). However, the quadratic complexities of other algorithms make them unsuitable for speed critical systems.
2.2
Hierarchical Classification
Document classification is normally considered as a flat classification technique, i.e., documents are classified into predefined categories where there is no relationship specified between the categories. This approach is suitable when a small number of categories are defined. However, in areas such as search result classification, where the retrieved documents can belong to several different categories, flat classification becomes inefficient and hierarchical classification is preferred. Hierarchical classification is the process of classifying documents into a hierarchical organization of classes. The assumption behind the hierarchical structure is that each class node in the hierarchy is a special type of its parent class node and a general type of its child nodes, thereby implying a hierarchical relationship. Web directories like Yahoo! Directory or dmoz are good examples of such class hierarchies. In fact most online hierarchical classification systems utilize existing web directories as their predefined class hierarchies. Literature on hierarchical classification reports two basic approaches, namely, 1) the big bang approach and 2) the top down approach. In the big bang approach documents are classified in one single step to internal nodes or leaf nodes in the hierarchy. In the top down approach, flat classifiers are created at each node of the hierarchy. A document is classified by traversing it down the tree hierarchy and applying a sequence of classifiers, from the root node to the leaf node. Dumais and Chen (2000) used the linear SVM classifier to automatically classify search results from MSN’s Search Engine into a hierarchical structure. An interesting usability study showed that users preferred their approach to the usual ranked presentation. Mladenic (1998) used the Naïve Bayes classifier along with the Yahoo! Science hierarchy to classify text documents. Dhillion et al. (2002) used a combination of word clustering measures and Naïve Bayes classification along with the dmoz web directory for top down hierarchical classification.
2.3
Ontologies, User Profiles and Personalized Search
Gruber (1993) defined an ontology as an explicit specification of a shared conceptualization. Simply put, an ontology is a formal way of representing concepts and their relationships. Taxonomical class hierarchies like web directories are an example of how ontologies can be used in the context of the WWW. In terms of semantic annotation of web pages, SHOE (Simple HTML Ontology Extensions)5 can be used to provide semantic metadata corresponding to concepts in predefined ontologies. This allows automated software agents to understand web site content on a more semantic level. An example of using ontologies to represent user profiles is OBIWAN (Chaffee 5
SHOE. Simple HTML Ontology Extensions. http://www.cs.umd.edu/projects/plus/SHOE/
& Gauch, 2000) that facilitates personalized web navigation. Users build their profiles by creating a hierarchical organization of concepts (personalized ontologies) which were then mapped to a core reference ontology. Software agents used this information to categorize websites into the user’s personal hierarchy of concepts. Although personalization of search results has been explored, most research does not focus on the explicit use of ontologies in this context. Google Personalized6 allows users to rank search results according to a collection of topics. The Profusion Search Engine7 also provides this facility of organizing search results into topics. Several other companies like AltaVista (with geotracking services), Yahoo! (MyYahoo!) and MSN are looking into tapping the personalized search market. An interesting addition to the idea of personalized search is the Stuff I’ve Seen project (SIS) (Dumais et al., 2003) at Microsoft Research. SIS collects information on files or other text data which the user sees while working on his computer. The search interface allows users to search for these text documents at a later time.
3
Ontology-based Hierarchical Classification
In this section we describe OHC (Ontology-based Hierarchical Classification) – an algorithm to classify documents into a user defined hierarchy of classes. It is a top down classification that uses the dmoz Computers hierarchy (as a default class hierarchy) based on the Naïve Bayes classifier to automatically classify documents.
3.1
Ontologies representing user profiles
We define an ontology in our framework in the following way8: An ontology is a sign system O := (C,D,F,H) consisting of • A set of concepts C. • A set of documents D. • A mapping function F, which maps a document d ∈ D to a concept c ∈ C. A document can only be mapped to one concept; however a concept can be mapped to by several documents. • A hierarchy H=(N,E) defined as a directed simple graph with no cycles (i.e. a tree). H consists of a set of nodes N and a set of ordered pairs called edges (np, nc) ∈ E ⊆ { N x N }. The direction of an edge (np, nc) is defined from the parent node np to the child node nc, specified by the relational operation np → nc (after Granitzer, 2003, p.14). Each node ni ∈ N corresponds to a unique concept ci ∈ C, i.e. ni ≡ ci. Thus, the ordered relationship np → nc or cp → cc describes an is-a relationship between a parent concept and a child concept. In this framework, an ontology is defined as a hierarchical organization of concepts where the textual documents assigned to each concept explain its semantics and the directed hierarchical structure provides an understanding of the relationships between them. We use this explicit conceptualization to represent user profiles. User profiles are created by identifying concepts which can relate to user perspectives. Textual samples describing each concept are provided in the form of web documents. The hierarchy is created by identifying the various is-a (class-subclass) relationships between the different concepts. The actual process of creating such ontologies can be a very tedious process. In our demonstration system (WEBCLUSTERS) we provide users with two options 1) create a personalized ontology using a provided tool (the Ontology Editor) or 2) use an existing ontology representing the users’ domain to customize their personal ontology. The second case makes the process of defining ontologies a little easier. Since concepts, their relationships and existing web documents are already available, user can customize these ontologies to fit their perspective. However, an imposing question in the second case is whether it is possible to preconceptualize most information domains (thereby most user perspectives) using existing ontologies. We believe it is; since we define an ontology as a weighted topic hierarchy, existing web directories like dmoz and Yahoo! can be used for this purpose. As these directories cover most, if not all, information domains, we can use them as the foundations for representing user perspectives and for creating new personalized ontologies.
6
http://labs.google.com/personalized http://www.profusion.com/ 8 This ontology definition is based on Hotho, Maedche, and Staab’s formal representation (Hotho et al., 2002). 7
Computers
Multimedia
Programming
Methodologies
Mobile Computing
Hardware
Languages
Java
C and C++
Software
Databases
Python
Internet
Security
Operating Systems
Linux
Unix
Windows
Figure 1: dmoz “Computers” hierarchy
Figure 1 shows a small portion of the dmoz “Computers” hierarchy used to represent the computer science domain. Subsets of this ontology can be used to symbolize different user perspectives. For example, a computer programmer can be represented by the “Programming” sub hierarchy. Similarly, the “Linux” sub-hierarchy can be used to represent a Linux enthusiast’s profile.
3.2
Ontology-based Hierarchical Classification
The process of hierarchical classification of documents into the user-defined ontology is performed in two steps. In the first step, pre-preprocessing and feature extraction, documents are first transformed from textual form into bag of words. Then stop words are removed and the remaining words are stemmed. The feature vector is generated from the bag of words by using word frequencies. The feature vector provides the logical view of the document (BaezaYates & Ribeiro-Neto, 1999). Once the user ontology has been defined and a logical view of the documents has been established, a hierarchical classifier can be created and used to classify unlabeled documents. Ontology-based hierarchical classification of documents is realized by using the definition of an ontology and the theory of Multinomial Naïve Bayes (MNB) (cf. McCallum et al, 1998) classification to create a hierarchical classification model. We employ the top down approach to hierarchical classification mentioned in section 2.2. Flat MNB classifiers are created at each non-leaf node of the hierarchy to predict the most suitable child class of the node. A document is classified by traversing it down the tree hierarchy and applying a series of MNB classification tests at each internal node until a leaf node is reached. The ontology along with the collection of flat MNB classifiers form an Ontology-based Hierarchical Classifier (OHC). Note that before a document can be classified into the hierarchy, the MNB classifier at each node of the ontology has to be trained. The MNB classifiers are trained on sample documents belonging to the MNB classifier’s corresponding node or to its children. The sample documents are provided by the user while constructing the ontology or automatically generated as described in 4.2.
4
The WEBCLUSTERS System
We have brought the Ontology-based hierarchical classification algorithm and the personalized search result organization paradigm together in a proof-of-concept demonstration system called WEBCLUSTERS9. WEBCLUSTERS is designed to interface with existing search engines to retrieve search results for keyword queries and organize them into user defined topic hierarchies. At present we have used WEBCLUSTERS to organize search results from Google, Yahoo! Search, Ask Jeeves, MSN Search, LookSmart and Overture. WEBCLUSTERS consists of two main components: 1) A search interface, which allows users to specify search queries along with personalized ontologies and visualize the classified search results and 2) an ontology editor called WOE (WEBCLUSTERS Ontology Editor), which facilities creation of user’s personal ontologies using a visual representation.
9
The term “CLUSTERS” is used purely for symbolic reasons and represents the search result organization paradigm. WEBCLUSTERS is a classification engine, not a clustering engine.
Figure 2: WEBCLUSTERS demonstrator
4.1
User Interface
A WEBCLUSTERS session starts when the user enters a search query and chooses the user perspective (ontology) to be used for the search and classify process. The search query is typically a word or a set of words which succinctly describes the user’s information need. The user ontology is either a generic ontology like the dmoz “Computers” or a personal ontology created using WOE. Once the user submits the query, the system queries one of the 6 search engines mentioned above to retrieve approximately 30-100 search results. This variance in the number of search results is based on the search engine chosen. After the search results are returned from the search engine, they are classified into the user ontology and presented on the user’s browser (Figure 2).
4.2
Ontology Construction
WOE is a desktop application which allows users to construct personalized ontologies and save them to our ontology database. Figure 3 shows an ontology created using WOE’s visual tree editor. By using WOE, users create ontologies by defining the concept hierarchy and assigning web pages relevant to each topic. The hierarchy is shown as a tree in the pane on the left. The toolbar at the top and the localized popup menu provide most of the ontology editing functionality. The properties pane on the top right hand corner, and the xml pane below that, show contextual information for the node selected. All of these features have been designed to make the process of constructing ontologies easy and intuitive. WOE can also be used for creating classifiers for concepts in ontologies semi-automatically. This semi-automatic process is required when the user has defined the concept hierarchy but does not want to spend time in specifying sample pages for each concept used for training. The automate feature in WOE can be used to automatically assign web pages for each leaf-concept in the hierarchy. The automate program adds sample pages for each node in the hierarchy by using the node’s path (from the root to the node) as a query term to a search engine and the retrieved search results as the training data for MNB classifier associated with this concept. This is illustrated in Figure 4.
Figure 3: WEBCLUSTERS Ontology Editor
The semi-automatic process relies heavily on the search engine to retrieve relevant results. High precision in the search results can be expected from the search engine because the queries are quite detailed (“Computers Programming Java” for the node Java in Figure 4). Once a manual or semi-automatic ontology has been created, the user can use the “Export” button to inform WOE that the ontology creation process has been complete and is ready to be saved. At this stage, a new Ontology-based Hierarchical Classifier (OHC) is created and trained on the ontology provided by the user. This OHC together with the user’s ontology is then saved to the ontology database.
Figure 4: Using search engines to obtain sample pages for ontology nodes
5
Experimental Evaluation
In this section we discuss the empirical techniques used to evaluate the classification performance of our system. The purpose of our experiments was to address three key questions: 1) what is the accuracy of an OHC trained on a large ontology like dmoz where a large number of training samples are available, 2) how does OHC perform when a user defined ontology is created where training samples are limited and 3) what is the accuracy of 1) and 2) when only the summaries of each document is used for classification, rather than the entire text.
5.1
Datasets
We used the Yahoo! Computers and Internet (YCI) ontology as a reference for defining our testing ontology dataset. A total of 175 topics were selected from YCI to represent our ontology’s class hierarchy. Textual samples for each class were also taken from the corresponding leaf classes in the YCI ontology. Table 1 shows the top-level classes in the testing dataset. Table 1: Top Level Categories of the Yahoo! Computers and Internet Testing Dataset Category Name Programming and Development Software Multimedia Security and Encryption Hardware Total
Test Pages 1434 1926 603 758 687 5408
For our experiments, we also needed to create two OHCs: One trained on an ontology where a large number of web pages are available for each topic in the hierarchy and another representing a personalized user defined ontology where training documents are limited. We used the default dmoz classifier from WEBCLUSTERS as the former. The latter (personalized ontology) was created by using only the topic hierarchy of the YCI testing dataset and populating it with web pages using the semi-automatic population feature of the WOE and some manual editing. In this ontology each topic was only assigned around 10 training web pages.
5.2
Methodology
The 2 training sets (dmoz, and a personal ontology) and the testing set (YCI) shared the same concept hierarchy but contained text documents from different data sources organized by different sets of users. This approach was used to represent the different perspectives of users for the same concept hierarchy. Our evaluation was aimed at measuring the correlation between views on the same hierarchy given three different perspectives. This correlation was determined by measuring the accuracy of the OHC when classifying the web pages from YCI first into the dmoz hierarchy and, second into the Personal hierarchy. This also reflects the real world usage scenario of WEBCLUSTERS where heterogeneous documents are classified based on the user’s perspective. As described in 3.2, an OHC uses flat Multinomial Naïve Bayes (MNB) classifiers at each non-leaf node in the hierarchy to classify documents to the lower child nodes. A document is classified by traversing it down the tree hierarchy and applying a series of MNB classification tests at each internal node until a leaf node is reached. To test if a document has been classified correctly we cross-checked the predicted class in dmoz and the personal ontology with the document’s actual class in YCI. To measure the performance of the approach we used the following definition of accuracy: the accuracy of a classification algorithm is defined as the ratio of documents which are correctly classified (correct class prediction) to the total number of test documents which are classified (correct and incorrect predictions)10, i.e. Number of correct predictions Accuracy = Total number of predictions
10
Accuracy is also often referred to as precision.
5.3
Experiments and Results
Four separate experiments were conducted to address the questions described in the beginning of this section. Experiment 1, tested the WEBCLUSTERS dmoz Ontology on the YCI dataset using the entire text of the test documents. Experiment 2, did the same but used the short summaries available for each document instead of the entire text. Experiment 3, used the Personal Ontology (which had a limited number of training documents) to classify the YCI dataset’s documents using the entire text for representing the document. Experiment 4 used short summaries in place of the entire text to do the same (as experiment 3). Accuracy was calculated by determining the correct predictions at each leaf node of the dmoz and Personal ontology. The results are summarized in Figure 5. 100% 90% 80%
78.9%
70%
Accuracy
60% 50% 40%
74.6% 69.9%
67.6% Dmoz-FullText Dmoz-Summary Personal-FullText Personal-Summary
30% 20% 10% 0%
Figure 5: Performance of two OHC using Full-Text and Summary Classification
The dmoz OHC operating on the actual text of a text document was found to be the most accurate at 78.9%. The dmoz OHC working on small summaries of each document was less accurate in predicting the right classes (at 69.9%). Interesting results were produced from the experiments on the personalized ontology. The Personal ontology OHC operating on the testing document’s text was the second most accurate classifier at 74.6%. Even though very limited training web pages were assigned, the model was able to predict unlabeled documents with acceptable accuracy. The Personal OHC operating on the testing document’s summaries was found to be least accurate at 67.6%. This showed that the OHC can be quite accurate in predicting the right classes even if a limited number of training samples are provided. Using these results we can argue that in a real world scenario where the testing documents are quite heterogeneous in nature, the WEBCLUSTERS system is able to classify search results into the user specified hierarchy with a reasonable amount of accuracy. Accuracy is at a maximum when a large number of training documents are supplied for each concept and the entire text is used for classification.
6
Conclusion and Future Work
We have shown in this paper that our approach to dynamic hierarchical organization of search results addresses issues in web search and document clustering paradigms. With the help of user defined ontologies we have also demonstrated the ease of personalizing search results based on user perspectives. The WEBCLUSTERS demonstrator is a proof-of-concept of a real world application which utilizes this approach to enhance existing search services. With the emergence of new and powerful techniques in the field of web search and personalized information retrieval, we believe that there are a number of improvements that can be made. Here we focus on the issue of building personalized ontologies. At present, the actual process of building personalized ontologies is quite tedious. Average users, who are only concerned with retrieving relevant search results in as little time as possible, may find it difficult to identify concept hierarchies and build ontologies. The semi-automatic process of obtaining sample documents as described in the paper is a step in the right direction. However, in order to make personalized web search truly useful a fully dynamic ontology creation process is required. One such approach would be to (non-
invasively) monitor the user’s activity on his computer while he browses the web. This way, the system would be able to build a concrete understanding of the user’s perspective without requiring him to do any explicit work. Another approach is to support a group of people who share the same interest or work context to collaboratively build ontologies and sample document collections such as described in Nakata et al. (1998). Such an approach would also lead to effective sharing and reuse of ontolgies.
References Baeza-Yates, R. A. and Ribeiro-Neto, B. A. (1999). Modern Information Retrieval. ACM Press/Addison-Wesley. Chaffee, J. and Gauch, S. (2000). Personal Ontologies for Web Navigation. In Proc. of the 9th Intl. Conf. on Information and Knowledge Management (CIKM'00), 227-234. Chen, H. and Dumais. S. T. (2000). Bringing order to the web: Automatically categorizing search results. In Proc. of CHI'00, Human Factors in Computing Systems, 145-152. Chen, M., Hearst, M., Hong, J. and Lin, J. (1999). Cha-Cha: A System for Organizing Intranet Search Results. In Proc. of the 2nd USENIX Symposium on Internet Technologies and SYSTEMS (USITS). Dhillon, I. S., Mallela, S. and Kumar, R. (2002). Enhanced Word Clustering for Hierarchical Text Classification. In Proc. of the Eighth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), 191-200. Dumais. S. T. and Chen, H. (2000). Hierarchical Classification of Web Content. In Proc. of SIGIR-00, 23rd ACM International Conf. on Research and Development in Information Retrieval, 256-263. Gruber, T. R. (1993). A translation approach to portable ontology specifications. Tech Report Logic-92-1, Department of Computer Science, Stanford University. Granitzer, M (2003). Hierarchical Text Classification using Methods from Machine Learning. PhD thesis, Graz University of Technology, Austria. Hotho, A., Maedche, A. and Staab, S. (2002). Ontology-based Text Document Clustering. Künstliche Intelligenz, 16(4), 48-54. Joachims, T. (1997). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proc. of ECML-98, 10th European Conf. on Machine Learning, 137-142. Kleinberg, J.M. (1999).Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46(5), 604-632. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Proc. of ECML-98, 10th European Conf. on Machine Learning, Lecture Notes in Computer Science, 1398, 4-15. McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. In Proc. of the AAAI-98 workshop on Learning for Text Categorization, 137-142. Marchiori, M. (1997). The Quest for Correct Information on the Web: Hyper Search Engines. In Proc. of WWW6 (Sixth International World Wide Web Conf.), 265-276. Mladenic, D. (1998). Machine learning on non-homogeneous, distributed text data. PhD thesis, University of Ljubljana, Faculty of Computer and Information Science. Nakata, K., Voss, A., Juhnke, M. and Kreifelts, T. (1998). Collaborative Concept Extraction from Documents. In Proc. PAKM 98 the Second Int. Conf. on Practical Aspects of Knowledge Management. Rennie, J. D. M., Shih, L., Teevan, J., Karger, D. R. (2003). Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In Proc. of the Twentieth International Conf. on Machine Learning, 616-623. Shih, L., Chang, Y., Rennie, J. and Karger, D. (2002). Not Too Hot, Not Too Cold: The Bundled-SVM is Just Right! Proc. of the ICML-2002 Workshop on Text Learning. Yang, Y and Liu, X. (1999). A Re-Examination of Text Categorization Methods. In Proc. of SIGIR-99, 22nd ACM International Conf. on Research and Development in Information Retrieval, 42-49. Zamir, O. and Etzioni, O. (1998). Web document clustering: a feasibility demonstration. In Proc. of the 19th International ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'98), 46-54. Zamir, O. and Etzioni, O. (1999). Grouper: A Dynamic Clustering Interface to Web Search Results. Computer Networks, 1361-1374.