Altering Document Term Vectors for Classification - CiteSeerX

Altering Document Term Vectors for Classification Ontologies as Expectations of Co-occurrence Meenakshi Nagarajan, Amit Sheth LSDIS Lab, Dept. Of Computer Science. University of Georgia Athens, GA, USA {bala,amit}@cs.uga.edu

ABSTRACT

and

Subject

HP Labs Palo Alto, CA, USA {marcos.aguilera, kimberly.keeton, arif.merchant, mustafa.uysal}@hp.com

Document Classification, the process of classifying documents into pre-defined categories is one of the most popular tasks aimed at grouping and retrieving similar documents. Like many Information Retrieval (IR) tasks, classification techniques rely on using content independent metadata e.g. author, creation date etc. or content dependent metadata i.e. words in the document. One of the most common challenges is that classifiers tend to be inherently limited by the information that is present in the documents. While past research has made use of external dictionaries and topic hierarchies to augment the information that classifiers work with, there is still considerable room for improvement. This work is an investigative effort towards exploring the use of external semantic metadata available in Ontologies in addition to the metadata central to documents, for the task of supervised document classification. The aim is to go beyond 'word cooccurrence / synonym / hierarchical' similarity based classification systems to a one where the semantic relatedness between terms explicated in Ontologies is exploited. We study the effectiveness of our approach on categories in the national security domain trained using documents from the Homeland Security Digital Library. Preliminary evaluations and results indicate that the technique generally improves precision and recall of classifications in some cases and illustrates when the use of Ontologies for such a task does not add significant value.

Categories

Marcos Aguilera, Kimberly Keeton, Arif Merchant, Mustafa Uysal

Descriptors

H.3.1 [Content Analysis and Indexing]: Abstracting methods, Dictionaries, Indexing methods, Linguistic processing, Thesauruses

Keywords Supervised Document Classification, Background domain knowledge, Ontology, Vector Space Models, Ranking semantic relationships

1. INTRODUCTION The web contains a vast amount of information being produced continuously over time by many sources, in many different formats and styles. To make this information more usable, a number of different services have emerged, such as search engines, portals, information crawlers and aggregators, knowledge and shopping bots, and other autonomous agents that collect, classify, and organize web pages for human consumption. Many of these services rely on document classification. Document classification refers to the process of organizing a set

of documents, typically a large set, into a set of predefined classes or categories. To perform its function automatically, a document classifier typically relies on some training data, which is often a small but significant fraction of the documents and their correct categories as specified by a human (or some other external and expensive but very accurate means of classification). Such document classifiers operate by automatically inferring statistically significant patterns that differentiate each category. As a result, classifiers do not require humans to carefully define how and when a document should fall into a given category. As far as the classifier is concerned, a category is defined by the examples provided in the training data. This approach is nice because it reduces the amount of human work necessary to just providing an effective training data. Unfortunately, this approach also introduces a problem: classification is inherently limited by the information that can be inferred from the training data. So, in a training data corpus, if the words 'Abu Bakar' and 'Abu Sayyaf' did not co-occur frequently, statistical techniques will fail to make the correlation that they are somehow related. Recognizing this drawback, number of prior research has proposed to augment the training data with external information to help the classifier learn different categories better. For example, one could use a database with information about terms and their synonyms or dictionaries to map together equivalent words [36]. One could also use information about subclass and super class relationships to refine the learning process [29]. These approaches have been reasonably successful, but they are limited to very specific forms of outside information in the kind of relatedness between terms that they exploit, which may or may not be available. In this paper, we propose to use a more general framework to use background knowledge in classification, based on ontologies. An Ontology is a way to represent knowledge within a specific domain [18]. For example, in the domain of terrorist organizations, an Ontology contains entities and relationships between them, for example, ‘Abu Bakar’ (a person) is a ‘Abu Sayyaf’ (a terrorist organization). Traditional retrieval systems using this information can therefore discover that two documents that contain each of these words are related. Ontological knowledge is currently available in a number of domains including the biomedical, homeland/national security domain etc. that we experiment with in this paper [4,37]. We believe that information contained in such ontologies can be incorporated into many classification schemes and algorithms. In this paper, we focus on a particular classification scheme based on vector space models [35, 32], which represents documents as a vector of their most important terms. In a typical supervised setting, the process of classification consists of (a) using training data to find the most important terms or a representative term

vector for each category (category term vector); entries in the category term vector are weighted based on their discriminatory nature; (b) generating a term vector for documents to be classified by indexing and weighting their most representative terms and (c) classifying a document by employing a similarity measure between the vector of the document and the category term vector for each category [16, 39]. As one can imagine, the representative terms in a document's and category's term vector and the weights assigned to the terms determine to a large extent the result of a classification. Our approach for using ontologies to affect supervised document classification, primarily works on altering document term vectors. The motivation is to understand how useful such external domain knowledge is to the process of classification; what the trade offs are and when it makes sense to bring such background knowledge to bear. To do this, we intuitively alter basic TFIDF[34] weighted document term vectors (syntactic term vector) with the help of a domain Ontology to generate semantic term vectors for all documents to be classified. The contribution of this work is therefore not in improving any of the current classification algorithms but to affect the document term vectors in an intuitive fashion and measure the effect on results of existing classifiers. We demonstrate the effectiveness of our approach in the national security domain. Our background knowledge is obtained from a high quality Ontology built for a previous semantic analytics application in the Intelligence domain [4]. We train our 60 categories (created for the same application) using documents from the Homeland Security Digital Library [21]. Documents to be classified are obtained from sources listed in [17]. Using our approach, we enhance the basic syntactic term vectors of documents to generate semantic term vectors and test a lineartime centroid-based document classification algorithm [20] to assess the effectiveness of the altered vectors. Preliminary results indicate a positive effect on the recall, precision and confidence metrics in some cases. We also observed clear patterns in cases where this approach was not as effective. The remainder of this paper will illustrate the problem with an example (Section 2), explain the formal basis for our hypothesis in affecting term vectors (Section 3); give an overview of our approach with examples (Section 4, 5 and 6); provide some details on our implementation (Section 7); present evaluations and discuss results (Section 8) and close with some concluding discussions on the challenges we faced, our experiences and ideas for future work (Section 9 and 10).

2. ILLUSTRATIVE EXAMPLE To understand the kind of document space this work is hoping to affect, we present a simple example in this section. In our experiments with retrieving and ranking relevant documents [4, 2], we have often encountered that statistical techniques over document terms are insufficient in the following two cases: 1. When documents whose representative terms do not satisfy a simple exact / synonym / co-occurring relationship with terms of other documents; but are related by more complex named relationships e.g. the only relationship between the terms 'Abu Bakar' and Abu Sayyaf' is that the former is a member of the latter. (Abu Bakar Abu Sayyaf) 2. When the most representative terms in a document offer little or no evidence to state conclusively what the document is actually talking about. Below is an excerpt of a document relating to the second case that was a contender for being classified as a 'terrorist event' or about

the 'terrorist organization' without enough evidence for either. A further analysis of the document, in addition to some background knowledge (the Shoe Bomber was a member of the Al-Qaeda) provided more evidence to justify either classifications. "Case of the Shoe Bomber: Lessons in counter-terrorism--this time at no cost: On the morning of December 23, 2001, the world did not wake up to the headline 'Passenger Plane Blown up in Midair by Suicide Bomber.' This chilling development was narrowly averted on the previous day, due to a vigilant stewardess and some determined passengers on an American Airlines flight from Paris to Miami...... .... Over the past decade, particularly in the six years since the Taliban seized power in Afghanistan, bin Laden succeeded in developing a veritable industry of terrorism in Afghanistan. The seeds of this effort will likely continue to bear fruit in the foreseeable future.... " In cases like these when contents of the document alone prove insufficient, the use of external background knowledge comes in very handy. They also indicate the need for exploiting the more complex, yet common named relationships between terms in documents beyond the simple 'equivalence / synonym / hierarchical' relationships that have been exploited so far. In this work, we make use of such semantic background knowledge available in Ontologies to intuitively alter and add to the contents of a document's term vector, thereby providing classifiers with more information than what is exemplified in the document.

3. ALTERING TERM VECTORS ONTOLOGIES AS EXPECTATIONS OF CO-OCCURENCE In this section, we describe the rationale behind our hypothesis that semantic background knowledge from ontologies can be used to augment traditional syntactic term vectors. A fundamental drawback behind Vector Space Models is that it treats a document as a bag of words and ignores the dependence between terms i.e. assumes that terms in a document occur independent of each other. One kind of dependence that has been captured and utilized has been the order in which the terms appear in the document [8, 15]. In reality however, there is a strong interdependence between terms imposed by the context of discourse. For example, if we spot the word Miami in a document, there is a very high probability that we will also see the words Florida or FL. Topicbased vector space models [11] incorporate the dependency notions by considering equivalence and synonym based relationships between terms. Capturing this implicit dependence in terms of co-occurrences has been successfully attempted by the use of statistical techniques in addition to lexical and morphological normalization of term vectors. Figure 1 shows the space of a document's discriminatory terms (i.e. terms that uniquely identify a document). Some of the most important contributions in normalizing term vectors have been to use heuristics and external information like dictionaries and topic hierarchies to move the space of terms in the ambiguous (grey) regions to one of the 4 quadrants where one can tell with higher certainty what the discriminatory nature of a term is. For example, if t1 and t2 in the figure are synonyms or co-occur frequently, combining their TFIDF scores moves t2 to a space where it can be considered more representative to the document. However, there are cases when terms do not co-occur very often and are also not related in a way that lexical or morphological

techniques can help. For example, if terms t1 and t2 are related by a named relationship . In such cases, current techniques will fail to identify a possible correlation between the terms and therefore will not normalize the term vector effectively. Our heuristic therefore, extends the intuition of exploiting relatedness between terms to normalize term vectors, by using named semantic relationships between terms in addition to the 'synonym' / ‘class hierarchy’ / 'statistical co-occurrence' based relationships.

Figure 1. Discriminatory nature of terms in a document

4. ALTERING TERM VECTORS OVERVIEW OF OUR APPROACH The core of our approach lies in altering document term vectors in three simple steps, see Figure 2. There is a direct analogy to this and traditional query expansion techniques for example; with the important difference being the use of 'semantic relatedness' between terms as opposed to the use of statistical importance of terms alone to augment term vectors. 1. The syntactic term vector Vsyn: We start with a state-of-theart indexing system, Lucene [7] to generate document term vectors that order terms in a document by importance of their occurrence in the document and the corpus i.e. by a normalized TFIDF score. An additional component that indexes phrases (e.g. Abu Sayyaf) instead of words alone (e.g. Abu) was built on top of Lucene. This step is analogous to how most IR systems index their documents. To be fair to the state-of-the-art in indexing documents, we also normalize the term vectors using synonyms from WordNet[28] at this stage. We call this vector of terms the syntactic term vector Vsyn.

3. The enhanced semantic term vector Venh-sem : For every term Ti in the semantic term vector, we use instantiations of the term in the Ontology to obtain the most relevant terms (Trs) connected to Ti. Cases where multiple matches for a term are found in the Ontology are disambiguated using our past efforts in entity disambiguation [1]. The relationships under consideration to obtain relevant Trs encompass all the ones that are modeled in the Ontology – aliases, subclass/superclass, named relationships etc. For example see part 3 of Figure 2 that shows the term ‘tr1’ appearing in the syntactic and semantic term vector, connected to six other terms in the Ontology. To avoid adding too many related terms from the Ontology and increasing the recall of the classification process, picking of Trs that are most relevant is determined by several factors. One of them is the importance of the relationship in the Ontology which semantically relates Ti and Tr. Weights on relationships indicate the importance of the relationship in the domain and/or for the particular classification setting. Terms Tr (in the Ontology) that are related to Ti could already be present in the term vector of the document i.e. were already present in the document or could be new terms being added to the term vector i.e. were not originally present in the document. If the term Tr is already present in the document, its relationship to Ti in the Ontology is considered as a corroboration for the cooccurrence of the terms in the document. In the latter case, Tr is a term that is highly relevant to the document, the addition of which to the term vector might affect classification. In either case, a weighting function determines the weights that the terms Ti and Tr take in the new term vector. We call this weight-adjusted term vector an enhanced semantic term vector Venh-sem. Open source Ontology storage and querying systems like Jena [25] and BRAHMS [24] were used at steps 2 and 3 for creating an enhancing the semantic term vector. Whereas steps 1 and 2 are relatively straight forward, step 3, the process of enhancing the term vector to include more terms from an Ontology than those that are explicitly present in a document, is the most crucial step towards affecting the classification process. The next section explains in detail how relevant classes and instances from the Ontology are identified to enhance the vector and the weight functions that are used to intuitively alter the importance of terms depending on their relevance to the document.

5. THE ENHANCED SEMANTIC TERM VECTOR Venh-sem

Meaningfully extending the semantic term vector to include terms that are not explicitly mentioned in the document or to corroborate the ones already present in the document, involves two critical steps: 1.

Quantifying the weights of relationships in the Ontology i.e. ranking semantic relationships that allow one to add only the most relevant terms related by the most important relationships AND

2.

Formalizing a weight function that intuitively alters weights of old and new terms in the term vector thereby reflecting their relative importance in the document.

Figure 2. Altering Document term vectors 2. The semantic term vector Vsem: We then construct a new term vector that we call the semantic term vector Vsem for all the documents. This vector consists of terms that are in Vsyn and are also instances in a domain Ontology. The weights for terms in Vsem are the same TFIDF scores as in Vsyn. This step guarantees a 'meaningful' reduction in the term vector dimensionality and also establishes a semantic grounding of the terms in the document that overlap with instances in the Ontology. Although there is a risk of obtaining a very sparse vector depending on the modeling of the domain Ontology; for now, we will assume the existence of a relatively complete model.

To illustrate with an example, consider the document in Figure 3 whose syntactic and semantic term vectors are also shown in the figure. The term 'Isnilon Hapilon' (Ti) that belongs to both Vsyn and Vsem (i.e. is present in the document as well as the Ontology) is related to two terms 'Abu Sayyaf' and 'Iraq' (Trs) in the

Ontology via some named relationships. The questions one would hope to answer while trying to enhance the term vector are the following: a. In the Ontology, does the term 'Iraq's' relationship to the term 'Isnilon Hapilon' (which is in the document), make 'Iraq' important enough to this document to add it to the term vector ? b. 'Abu Sayyaf' already co-occurs with 'Isnilon Hapilon' in the document. Does the fact that an Ontology corroborates their relatedness, strengthen their importance in the document? c. If there is more than one term in the document 'Abu Sayyaf' and 'Isnilon Hapilon' related to a term in the Ontology 'Iraq', does that mean there is more evidence for 'Iraq's' importance to this document?

Group Person; not all instances Al-Qaeda {Mohammed Atta, Waleed al-Shehri, Abdulaziz alOmari...} are equally relevant - new members might be considered less important than the older ones. Statistical techniques on the instance base give a numerical indication of which entity is more important. An alternate approach to ranking semantic associations on the Semantic Web [19, 3] uses a human specified context and statistical metrics like the rarity, popularity and association length of the relationships. Our system uses ranks assigned by SemRank and additional human input to establish final numerical scores on schema level relationships and/or associations (see example in Figure 4). The idea behind having a human intervene here is to let the importance of relationships reflect the classification requirement at hand. If the analyst is not interested in relationships between a city and a terrorist event, he should be able to rank those lower compared to other relationships.

Figure 4. Ranking semantic relationships - example

Figure 3. Weighting semantic relationships between terms One of the fundamental requirements for being able to answer these questions is to be able to assess the importance of relationships that relate two terms in the Ontology. For example, among the three relationships / associations seen in Figure 3, which one of them is considered most important by the classification task / end-user? In the following sections, we have tried to formalize these intuitions on identifying related terms and relative altering of weights in term vectors.

5.1 Ranking semantic relationships in the Ontology In using domain models for identifying semantically relevant entities, it is expected that the number of relationships between entities in a knowledge base will be much larger than the number of entities themselves. Using semantic relationships to find related entities would therefore result in an overwhelming number of results, therefore elevating the need for appropriate ranking schemes. Our past work in ranking semantic relationships, SemRank [6] uses a blend of semantic and information theoretic techniques, along with heuristics to determine the rank of semantic relationships in an Ontology. In addition to assigning a numerical score to a simple or complex schematic relationship (also called associations) as shown in Figure 4, SemRank also assesses the relative importance of entities participating in a relationship. For example in the schematic relationship, Terrorist

Copyright is held by the author/owner(s). WWW 2007, May 8--12, 2007, Banff, Canada.

At the end of this process, for every entity Ti in Vsem, we know with respect to the underlying Ontology the important relationships that Ti participates in and which of the related terms (instances in the Ontology related via the important relationships) Tr are important. This information is used in the weight function in addition to other parameters to affect the content of the term vector. Even after this ranking there might be cases where an entity is related to several other entities, for e.g. Al-Qaeda { Mohammed Atta, Abdulaziz al-Omari...}. To place an upper threshold on how many entities to include in the term vector, we pick the top 6 most strongly related entities (top 6 important members in this case) based on evaluations conducted and the knowledge of the domain.

5.2 The Weight Function Weights associated with terms in a term vector indicate their importance relative to other terms in a document. So, if a document's term vector is ordered by decreasing weights on terms, the first term in the vector is the most important / discriminatory term in the document and so on. The idea behind coming up with the intuitive TFIDF weighting schemes (and its other variations) was to amplify the importance of the most discriminatory terms and weed out noise by assigning lower weights to less discriminatory terms. Given the varying lengths of a document and therefore its term vector, algorithms usually employ an upper threshold to ignore terms below a certain degree of importance. An obvious result of this is that if there are terms that are actually important to a document but have a low TFIDF score (may be because the document represents a slightly different content than what is in the corpus); there is a risk of the terms being excluded while applying the threshold to prune the term vector. We believe that it is important to not only identify statistically important terms but statistically and contextually or semantically relevant terms in a document. Our hypothesis for weighting terms in a vector therefore proceeds by assessing the relatedness between terms in a document in addition to their statistical cooccurrence to assign weights. It only seems fair that a term be representative to a document because of its own TFIDF score in

the document and because of how it is related to other terms in the document. For example, in a sentence like this one "The influential Sunni Association of Muslim Scholars, a group of top religious leaders for the nation's Sunni minority, has called for the dissolution of government after an arrest warrant was issued for its leader."; words like 'dissolution' and 'government' that occurred very frequently in the national security corpus understandably got lower weights. But the term 'Muslim scholar' that occurred less frequently in the corpus compared to 'Sunni Association' got assigned weights far lesser weights than the latter term. It only makes sense that the term 'Sunni Association' be just as important as 'Muslim scholars' because of the implicit strong relationship between them (Muslim scholars are members of Sunni Associations). To summarize, we believe that a term gets ranked higher if a. it has a high TFDIF score and b. if many terms in the document are related to it or some terms with high weights are related to it. A closer look reveals a direct analogy between this hypothesis and the basic Page Rank [13] intuition; the difference being the kind relationships that are being exploited - inter page structural hyperlinks vs. named relationships between terms. Making the analogy, a term has a high weight if many terms in the document are related to it or some terms with high weights are related to it. In other words, a term is representative to a document because of its own TFIDF score in the document and a function of its relation to other terms in the document. Going back to extending our semantic term vector; we are interested in seeing how the addition of new highly related terms from the Ontology or the presence of terms in the Ontology that corroborate co-occurrence in the document, affect the weights of terms in the vector. Clearly there are two cases to consider here, as shown in Figure 3. Case1: When Ti (a term in the document) is related to a Tr (in the Ontology) and Tr does not already exist in the document. (New term) Relating to the example in Figure 3, the weight of a new term Tr (Iraq) being added to the term vector is a function of the importance of the source terms (TFIDF score) Ti in the document (Abu Sayyaf , Isnilon Hapilon) the importance of the relationship between the terms (operates in , born in) the importance of the term being added (Iraq). The term is not added if it does not pass a certain threshold.

Case2: When Ti (a term in the document) is related to a Tr (in the Ontology) and Tr is already present in the document. (Corroborating textual co-occurrence) Relating to the example in Figure 3, the weight of a term Ti (Abu Sayyaf) and Tr (Isnilon Hapilon) already in the document get boosted as a function of their importance (TFIDF score) in the document (Abu Sayyaf , Isnilon Hapilon) the importance of the relationship between the terms in the Ontology (leader of) the strength of their co-occurrence in the document (5 sentences away implies a weak co-occurrence compared to the terms being in the same sentence) The new weights for the terms Ti and Tr in this case are calculated using the following equations: Tr ' = TFIDF (Tr) + Σ(all related Ti s) [ TFIDF (Ti) * ( R TiTr) + Co-Occurrence TiTr ] Ti ' = TFIDF (Ti) + Σ(all related Trs) [ TFIDF (Tr) * ( R TiTr) + Co-Occurrence TiTr ] where Co-Occurrence TiTr is the co-occurrence strength between the two terms and is quantified using the term offset positions that Lucene generates. Term positions refer to the relative position of the terms in a document for e.g. the term occurs in the 3rd sentence as the 5th word and so on. It is important to note that the proportional increase of term weights in the first case (when Tr does not already exist in the document) will be lesser than the increase in the second (when Tr does not already exist in the document). This enforces our hypothesis that terms that are already IN the document contribute to a relatively higher increase of term weights. More importantly, the weight functions reflect our belief that Ontologies can act as an expectation of term co-occurrences in documents. In the following section, we show sample altered term vectors affected using the ranked semantic relationships and weight functions.

6. WALK THROUGH EXAMPLE For a document about the 'Abu Sayyaf' terrorist group, Figure 5 shows the syntactic Vsyn, semantic Vsem and enhanced semantic Venh-sem term vectors.

The new weights for the terms Ti (Abu Sayyaf) and Tr (Iraq) in this case are calculated using the following equations: Tr ' = TFIDF (Tr) + Σ(all related Ti s)[ TFIDF (Ti) * ( R TiTr) ] Ti ' = weight of Ti In the above equation, R TiTr is the normalized strength of the relationship in the Ontology between Ti and Tr. We normalize the weight by dividing the individual relationship strength with the total reach of the term i.e. if Ti (Abu Sayyaf) is related to 2 other new terms Tr1 (Iraq) and Tr2 (Afghanistan); R TiTr1 = strength of the relationship between 'Abu Sayyaf' and 'Iraq' in the Ontology / total reach of term 'Abu Sayyaf'. By total reach we mean the total strength of all the relationships (incoming and outgoing) that Ti (Abu Sayyaf) participates in. Not changing the weight of Ti is in line with our intuition that the weights of terms are affected only by terms that are in the document. Figure 5 Example showing the three term vectors

7. SEMANTIC DOCUMENT CLASSIFIER – SYSTEM ARCHITECTURE

Much of the operations on Ontologies that we covered in the previous sections, like weighting and ranking of semantic relationships are performed separately at the pre-processing stage. Inputs to the classifier system include trained categories, documents to be classified and the weighted semantic relationships in the Ontology. Once the syntactic Vsyn, semantic Vsem and enhanced semantic Venh-sem term vectors are computed for all documents, the classifier system uses them to classify documents into target categories. The final component is a human intensive stage where the results are analyzed and the sensitivity of Ontology relationship weights tuned (if needed) to affect future classifications. Although our system allows us to plug-n-play classifier algorithms, for the sake of evaluations in this paper, we evaluate the effectiveness of the term vectors on the centroid-based document classification algorithm [20].

This section describes the system we have implemented to evaluate our approach.

8. DATASET, EVALUATION, RESULTS

The terms in italics and blue are some of the terms present in all the three vectors and indicate the increase in weight proportional to their importance to the document. For example, the weight of the term 'Abu Sayyaf' in Vsyn, and Venh-sem increased from 0.004137 to 0.255734. In Venh-em, the terms in grey and bold are entities in the Ontology that are related to the terms in the document (but not already present in the document) by highly ranked semantic relationships. The enhanced semantic term vector shows a boost in weights of the terms in the document and also weights of the new terms that were added. Some of the semantic relationships and their weights that are responsible for the enhanced semantic term vector are also shown in the figure.

One of the key requirements for evaluating this approach is the availability of a relatively complete domain Ontology. As a result of our prior work on a semantic analytics application in the Intelligence domain [2, 5], we developed a high quality metadata knowledge base [23] focused on terrorist events, organizations and people in the Middle East. We use this Ontology in our evaluations for this work.

8.1 Dataset - categories, training and testing documents Our dataset for evaluation in this paper is primarily in the national security space. Figure 7 shows parts of the taxonomy that we classified the documents into. These categories were created by domain experts for the intelligence analytics application [4]. Numbers on top level categories indicate the number of sub categories. Although the classification was performed on the entire category set (60 categories), we chose a small subset of the classification (12 categories as shown in Figure 7b) in order to analyze and explain the results in a clear fashion. The reason for choosing these categories was the significant overlap in their contents that makes precise classifications a challenge. One of our goals was to see if our approach makes a significant impact on the precise classification of contents that are hard to distinguish.

Figure 6. Semantic Document Classifier - System Architecture Figure 6 illustrates the main processing steps in our approach and describes the system architecture. Lucene, a text search engine library to index documents. • A programmatic environment for Ontologies expressed in RDF [33] / OWL [10] e.g. Jena [27], Sesame [14], BRAHMS [24] etc. In our implementation we chose to use BRAHMS. • A Rule Engine that constructs and alters term vectors of a document based on the weight functions explained in Section 5. • A Classifier algorithm plugin meant to evaluate the altered term vectors against existing classifier algorithms.

Training and testing documents: We trained all our 60 categories with documents obtained by querying the HSDL. At this stage, because of the focused nature of the domain and the goal of this work, the training data comprised of only positive examples. We hope to extend this evaluation to include positive and negative training sets in the near future. Testing documents were obtained from sources listed at [17]. All our test documents (on an average 75 per category) were manually pre-labeled and available from our previous work in this domain. Given the use of additional information and subjective notions of ranking semantic relationships, we performed a close human intensive evaluation. After performing the classification, we picked a subset of the classified documents from each category to test for precision and recall metrics. The subset was created by randomly sampling 3 sets (indicated as Set1, 2 and 3 in the results) of about 15 documents in each set, for all the 12 categories. The complete taxonomy structure, a subset of training and testing documents, results of the evaluations that are not shown in this paper due to the lack of space etc. are available at this link [30].

is easy to see that altering weights of semantic relationships in the Ontology affects the contents of the semantic term vectors we construct because of the nature of the weight function (Section 10 has more discussions on this topic). Below, we define the two metrics that our evaluation uses. Recall: Of all the documents that should have been classified in a category, how many of them were actually classified, given the application’s ranking of semantic relationships in the Ontology. Precision: Of all the documents classified in this category, how many of them were correctly classified, given the application’s ranking of semantic relationships in the Ontology.

8.2 Evaluation The core aspect of our evaluation is to measure the effectiveness of the altered term vectors. The question we are trying to answer is the following: Does our intuition behind adding terms and boosting weights of terms in a term vector meaningfully amplify important terms and weed out less important ones? The comparison is therefore between the three vectors – the syntactic, semantic and enhanced semantic term vectors; specifically the following combinations : syntactic term vector Vsyn vs. semantic term vector Vsem vs. enhanced term vector Venh-sem vs. [ syntactic Vsyn U semantic term vector Vsem] vs. [ syntactic Vsyn U enhanced semantic term vector Venh-sem]

Why the combinations? An Ontology models entities in a domain and the relationships between them. Words like 'bombing' 'government' etc. are not common occurrences in a domain model. Using only the semantic term vector for classification (as opposed to the union of the semantic and syntactic vectors) would mean performing the classification only on the basis of the terms in the document that overlap with instances in the Ontology. In cases of a limited domain model or absence of Ontology instances in a document, one can imagine that the semantic term vector would be very sparse. Since the goal of the work is to augment what we already know from a document, we use a combination of the vectors. A union of the syntactic and semantic term vector implies including all the terms present in both the vectors; but replacing weights of terms occurring in both the vectors with the weights in the semantic term vector.

The classification algorithm: As mentioned in earlier sections, the focus of this work is not on improving classification algorithms. Our system uses the altered term vectors as inputs to various classification algorithms specifically, the centroid-based classification [20] algorithm for the evaluations. In the following section, we present the results of our approach and give details on the kinds of classification patterns we have observed.

8.3 RESULTS For the ease of evaluation, we restricted ourselves to the 12 categories shown in Figure 7b. We start with presenting some overall statistics and then discuss some success and failure patterns that we observed in correlation with the results of the classification. Average recall and precision values for 8 categories using all five vector combinations: (we did not show results of all 12 categories for the sake of clarity of the graphs) Overall RECALL values for 8 categories

Categories

Figure 7. Categories and evaluation

Additionally, because the aim of this work is to weigh more important terms highly, an accurate evaluation of our approach should not answer only the question: Did the document get classified under the categories it should have been classified under? It should also attempt to answer the question: Was the explanation provided by the classification accurate? Answering the latter question requires close human evaluation and helps us identify what terms (and their weights) contributed most to the classification thereby giving us a feel for how the weight functions performed. Although such a close examination of results would help identify when the approach presented in this paper would work, it also places a limitation on the number of classification results one can manually evaluate.

What are we measuring? Our metrics of evaluation are also the traditional notions of precision and recall. However, since there is use of information external to a document and in an ‘application / task / user’ sensitive manner; the correctness of the classification also becomes a subjective issue. What is a satisfactory classification for an application setting that has weighted ontological semantic relationships a certain way, might be completely unacceptable for other classification settings. The importance of relationships between terms therefore becomes an additional independent and tunable component that affects the precision and recall metrics. It

Shooting Hizbillah Shooting Hamas Shooting AbuSay Shooting AIG

Vsyn U Venh-sem Vsyn U Vsem Venh-sem Vsem Vsyn

0

0.2

0.4

0.6

0.8

1

Recall

Figure 8. Overall Recall Values for 8 categories Key observations (see Figure 8, Figure 9):

Using only the semantic (Vsem) and enhanced-semantic (Venhem) vectors were not favorable in terms of recall or precision results.

Vsem

Doc9

Doc10

Doc8

Venh-sem

Doc7

Vsyn U Vsem

Vsem Doc6

Vsyn U Venh-sem

Vsyn

Doc5

Shooting Hizbillah Shooting Hamas Shooting AbuSay Shooting AIG

1 0.8 0.6 0.4 0.2 0 Doc4

Categories

Overall PRECISION Values for 8 categories

Confidence in accurate classifications go up with the use of external knowledge

Doc3

Increased recall with the use of additional background knowledge was expected but should be kept under check. High recall scenarios are not always favorable.

Doc2

results of the "Hizbollah" terrorist organization category. It should be easy to see that given Observation 1, the confidence in the classification for the combined vectors (Vsyn U Vsem and Vsyn U Venh-sem) was also higher than that of the base syntactic vector.

Doc1

Using the document and semantic combination vectors (i.e. Vsyn U Vsem and Vsyn U Venh-sem) always did better in terms of precision and recall than the base syntactic vector

Confidence

Venhsem

Correctly classified documents

Vsyn

0

0.2

0.4

0.6

0.8

1

Precision

Figure 9. Overall Precision Values for 8 categories As a result of looking more closely at some categories in order to understand the above results better, we discovered interesting patterns when the use of this approach added value and when it did not. Observation 1: It is possible for a domain Ontology to have nothing to do with the classification. The goal is to do no worse than the base method when the Ontology is relevant or irrelevant. Our evaluation indicated that quite a few documents had minimal or no overlapping terms with the Ontology instances, mostly because of an incomplete domain model (clearly further investment in extending the Ontology knowledge base can address this issue to some extent). The result was therefore a sparse semantic and enhanced-semantic term vector. In most cases however, there was an overlap in the universe of terms of these two worlds. All the results (see Figure 8, Figure 9) supported our belief that the altered term vectors in combination with the base syntactic term vector would always do better than the base vector alone. The reason for this belief is obvious in the kind of positive enhancements we make to the syntactic term vector. In every case, the use of a domain Ontology along with the document contents contributed to a higher precision, recall and confidence in the classification; see graphs for Vsyn vs. (Vsyn U Vsem) vs. (Vsyn U Venh-sem) points. Observation 2: Confidence in a classification i.e. the fit goes up with the use of external information. The evidence for the classification is also more meaningful. The confidence in a classification is directly proportional to the common representative terms between the document and category vectors, and the weights assigned to the terms. For example, while using the cosine dot product similarity of two vectors, if the confidence is 0 then the two vectors are orthogonal and if it is closer to 1 there is a high match between the vectors. In almost all cases when all the five vectors classified a document correctly, the confidence in the fit was highest for the enhanced semantic vector. This is in line with our intuition that if one identified semantically or contextually important terms in a document in addition to the statistically important terms and gave them higher weights, the confidence in the particular classification would be higher. Figure 10 shows the relative confidence scores of 10 random documents from the classification

Figure 10. Ontologies improve confidence in classification Closely aligned to this is observation, is also the fact that explanations generated for a classification (i.e. the most weighted terms in the document term vectors) were also more meaningful in case of the semantic vectors. Additionally, our approach also gives us the option of presenting classification evidences in a more intuitive fashion beyond just the important terms in a document. Figure 11 shows an example evidence for a classification, presenting not only one of the most important terms in a document; but also the semantic relatedness between the terms that was used to identify the term's importance in the document.

Figure 11. Presenting evidence for a classification Observation 3: The need to look more closely at precision and recall numbers. Overall results of our evaluations generally indicate an improvement in precision and recall when using the combination vectors (see Figure 8, Figure 9). Individual semantic vectors always did poorly in terms of recall while the combination of syntactic and semantic vectors increased recall in almost every case. Figure 12 below shows such recall trends for the 'Abu Sayyaf' category. Although an increase in recall is generally considered unfavorable, given the additional knowledge one has used, it is very possible that the increase in number of documents classified in the category is actually accurate. Evaluating the documents that contributed to the increase in the recall numbers is however, more challenging. It is also easy to see that tuning the weights on semantic relationships will affect the recall metrics. The more information one includes in a term vector, higher the risk of a wide recall scenario. One of our near future plans is to evaluate precision for constant recall values at different thresholds. Ontologies help improve the precision of a classification: Our dataset for evaluation considered (intentionally) several categories that had minor characteristic differences. For example, contents in 'Shootings' and 'Bombings' categories have a lot of similar predictor variables or terms that make classifying documents between the categories a challenge. Syntactic term vectors that rely solely on document and dictionary contents can rarely and precisely classify a document as falling in one category or the other. Figure 13 for the 'Hizballah shootings' category aligns with our goal of boosting terms highly relevant to the document

thereby increasing the precision in classifying such usually hard to classify content. In every case the combination of the syntactic and semantic term vectors improves the classification precision. Category - Abu Sayyaf Terrorist Organization RECALL Values

Set3 Recall

Vsyn U Venh-sem Vsyn U Vsem

Set2

Venh-sem Vsem

Set1

Vsyn 0

0.2

0.4

0.6

0.8

1

Figure 12. Ontologies and documents improve recall of a classification Category - Hizbillah Shootings PRECISION Values

10. RELATED WORK AND CONCLUDING REMARKS

Precision

Set3 Vsyn U Venh-sem Vsyn U Vsem

Set2

Venh-sem Vsem

Set1 0.75

threshold for how many important entities one will include. In our work, given a certain degree of domain expertise, we chose to include in the term vector the top six entities most strongly connected to the term under consideration. We suspect however, that reaching this magic number is not a simple process and might require a thorough analysis of different approaches to suit the application. Generating multiple feature vectors for documents and categories: In this work, we limited ourselves to generating three different feature vectors or term vectors; the syntactic, semantic and enhanced-semantic feature vectors. It is conceivable that one alters the ontological relationship weights and comes up with yet another feature vector to perform the classification. Another thought that crossed our mind was to generate the same three term vectors for the 60 categories before the classification process. Unfortunately, categories typically house a large number of documents and generate several predictor variables. Although it is almost impractical to affect such large term vectors on a Web or large application scale, we hope to study document classification against smaller scale semantic category vectors in the near future.

Vsyn 0.8

0.85

0.9

0.95

1

Figure 13. Ontologies and documents help in the precision of a classification

9. ISSUES, CHALLENGES & EXPERIENCES Although this work is a first important step towards understanding how useful semantic domain knowledge is to the task of document classification, there are several issues that need to be addressed before one can see a Web scale deployment of such solutions. In this section, we try to present some of the challenges we faced, our experiences and plans for future work. Domain specific document classifications might benefit the most from this approach: Given the need for domain level metadata created independent of a document corpus, it is easy to see that bringing in more information while treating a document for very high level web directory classifications might not see a significant improvement in the precision of classification. Domains that are characterized by complex relationships between their entities, for example, the bio-medical space, are best suited for tapping external knowledge to unearth the implicit relationships between terms. Avoiding high recall scenarios: Entities found in a reasonably good domain model are almost always inter-connected to each other. In expanding a term vector by bringing in connected / related entities, there is a risk of introducing more than what is needed and casting a high recall scenario for the task at hand. To give a simple example, expanding a search query term to add more than what is most relevant, will result in the user being presented with more documents than what he otherwise would have seen. Although the relatedness between a term and what is added is important and still very subjective, it is important to set a

In this work we presented a vector based approach to utilizing additional domain knowledge for the task of supervised document classification. Models based on extending vector space models are not new; some of them include the Generalized Vector Space Model [9], Topic-based Vector Space Model [11], Latent Semantic analysis [38] etc. All of these models have proven to be quite effective for the task of classification. They mainly rely on the explicit co-occurrence of terms and other lexical and morphological normalizations of term vectors. Recently, with the inception of semantic domain models, there were efforts to harness the information in vocabularies like WordNet [28] to affect the term vectors for text clustering [22] and web document classification [26]. Other approaches exploited the structural subclass and super class information from Ontologies to modify term vectors for classification and clustering [31, 12]. The main difference between the approaches above and ours is that we harness the semantic named relationships between terms that Ontologies model. The intuition behind our work was to alter term vectors by strengthening the discriminative terms in a document in proportion to how related they are other terms in the document; (where relatedness included all possible relationships modeled in an Ontology). A side effect of this process was the weeding out of the less important terms. Since ontologies model domain knowledge independent of a corpus, there is also the possibility of introducing terms in the term vector that are highly related to the document but are not explicitly present in it. Our model for enhancing term vectors was therefore based on a combination of statistical information and semantic domain knowledge. We also presented an implementation of this idea and evaluated the effectiveness of the altered vectors by classifying documents over 60 categories in the national security domain. Preliminary results indicate that in general, use of background knowledge from Ontologies to affect term vectors improved the precision, recall and confidence metrics of a classification. It is important to note that since the use of domain ontologies brings in an additional subjective notion of what relationships are important for the task at hand; increase in recall cannot always be dismissed as unfavorable. Since the goal of this work was to accurately assess

the value of this approach, our evaluations were rigorous and manual and therefore relatively on a smaller scale. Future work includes evaluating this approach on larger benchmark datasets subject to the availability of an Ontology in the domain.

11. ACKNOWLEDGMENTS Majority of this effort was done at HP labs in summer 2006. A partial support for subsequent work was funded by NSF-ITR-IDM Award#0325464 titled ‘SemDIS: Discovering Complex Relationships in the Semantic Web’.

12. REFERENCES [1] Aleman-Meza B. et. al, Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. 2006. [2] Aleman-Meza Boanerges et al., An Ontological Approach to the Document Access Problem of Insider Threat. IEEE International Conference on Intelligence and Security Informatics (ISI-2005), Atlanta, Georgia, USA, 2005. [3] Aleman-Meza Boanerges, Christian Halaschek-Wiener, I. Budak Arpinar, Cartic Ramakrishnan and Amit P. Sheth, Ranking Complex Relationships on the Semantic Web. 2005 IEEE Internet Computing, 9 (3). 37-44. [4] Aleman-Meza Boanerges et al. Semantic Analytics in Intelligence: Applying Semantic Association Discovery to Determine Relevance of Heterogeneous Documents. in Keng L. Siau ed. Advanced Topics in Database Research, Idea Group Publishing, 2006 (In Print). [5] Anderson Robert and Richard Brackney Understanding the Insider Threat. RAND Corporation, Rockville, MD, USA, 2004. [6] Anyanwu Kemafor, Amit P. Sheth and Angela Maduko, SemRank: Ranking Complex Relationship Search Results on the Semantic Web. 14th International World Wide Web Conference, Chiba Japan, 2005, 117-127. [7] http://lucene.apache.org/java/docs/index.html Apache Lucene [8] Baeza-Yates R. and B. Ribeiro-Neto, Modern Information Retrieval. 1999 Addison-Wesley. [9] Baeza-Yates Ricardo and Berthier Ribeiro-Neto, Modern Information Retrieval. 1999 Addison-Wesley Longman Publishing Company. [10] OWL Web Ontology Language Reference. W3C Proposed Recommendation http://www.w3.org/TR/owl-ref/ Bechhofer Sean, Frank van Harmelen, Jim Hendler, Ian Horrocks, Deborah L. McGuinness, Peter F. Patel-Schneider and Lynn Andrea Stein [11] Becker Jorg and Dominik Kuropka, Topic-based vector space model. 2003 Business Information Systems. [12] Breaux Travis D. and Joel W. Reed, Using Ontology in Hierarchical Information Clustering. 2005 International Conference on System Sciences [13] Brin S. and L. Page, The anatomy of a large-scale hypertextual Web search engine. 1998 Computer Networks and Isdn Systems [14] Broekstra Jeen, Arjohn Kampman and Frank van Harmelen, Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. International Semantic Web Conference, Sardinia, Italy, 2002. [15] Cavnar W.B. and J.M. Trenkle, N-Gram-Based Text Categorization. 1994 In Proceedings of the SDAIR. [16] Cosine Similarity and Term Weights. http://www.miislita.com/information-retrieval-tutorial/cosinesimilarity-tutorial.html [17] Data sources http://lsdis.cs.uga.edu/semdis/DocumentClassifi

cation.html [18] Gruber Tom.What is an Ontology? http://wwwksl.stanford.edu/kst/what-is-an-Ontology.html [19] Halaschek Chris, Boanerges Aleman-Meza, I. Budak Arpinar and Amit P. Sheth, Discovering and Ranking Semantic Associations over a Large RDF Metabase. VLDB Canada, 2004. [20] Han Eui-Hong Sam and George Karypis, Centroid-Based Document Classification: Analysis Experimental Results Principles of Data Mining and Knowledge Discovery, 2000. [21] https://www.hsdl.org/ Homeland Security Digital Library [22] Hotho A., S. Staab and G. Stumme, Ontologies Improve Text Document Clustering 2003 Third IEEE International Conference on Data Mining. [23]http://lsdis.cs.uga.edu/~aleman/data/2005-0827__insider_schema.rdf Insider threat Ontology schema [24] Janik Maciej and Krys Kochut, BRAHMS: A WorkBench RDF Store And High Performance Memory System for Semantic Association Discovery. 4th International Semantic Web Conference, Galway, Ireland, 2005, Springer. [25] Jena. http://www.hpl.hp.com/semweb/jena.htm. http://www.hpl.hp.com/semweb/jena.htm [26] Luca Ernesto William De and Andreas Nürnberger, Ontology-Based Semantic Online Classification of documents: Supporting Users in Searching the Web. 2004 European Symposium on Intelligent Technologies [27] McBride B., Jena: A semantic Web toolkit. 2002 IEEE Internet Computing [28] Miller George A. , WordNet: A Lexical Database for English. 1995 Communications of the ACM, 38 (11). 39-41. [29] Mladenic D. and M. Grobelnik., Feature selection for classification based on text hierarchy. Conf. Automated Learning and Discovery 1998. [30] Document Classification using additional domain knowledge http://lsdis.cs.uga.edu/semdis/DocumentClassification.html [31] Prabowo R., M Jackson, P Burden and H Knoell, OntologyBased Automatic Classification for the Web Pages: Design, Implementation and Evaluation. 2002 International Conference on Web Information Systems Engineering. [32] Raghavan V. V. and S. K. M. Wong, A critical analysis of vector space model for information retrieval. 1986 Journal of the American Society for Information Science, Vol.37 (5), p. 279-87. [33] RDF. http://www.w3.org/RDF/. http://www.w3.org/RDF/ [34] Salton G. and C. Buckley, Term Weighting Approaches in Automatic Text Retrieval. 1987 Technical Report. UMI Order Number: TR87-881., Cornell University. [35] Salton G., A. Wong and C. S. Yang, A Vector Space Model for Automatic Indexing. 1975 Communications of the ACM, vol. 18, nr. 11, pages 613–620. [36] Scott S. and S. Matwin., Text Classification Using WordNet Hypernyms. Use of WordNet in Natural Language Processing Systems, 1998. [37] Sheth Amit P. et al., Semantic Association Identification and Knowledge Discovery for National Security Applications. 2005 Journal of Database Management, 16 (1). 33-53. [38] Sun J., Chen, Z., Zeng, H., Lu, Y., Shi, C., Ma, W. , Supervised Latent Semantic Indexing for Document Categorization. 2004 International Conference on Data Mining (ICDM'04) [39] Zhao Ying and George Karypis, Criterion Functions for Document Clustering: Experiments and Analysis. 2003 Machine Learning.