Discovering Concepts in Raw Text: Building Semantic ... - CiteSeerX

3 downloads 0 Views 191KB Size Report
1S{PLUS is a registered trademark of Statistical Sciences, Inc. 3 ... container chance college customer customers containing contaminated contamination .... 6], Alta Vista 7], WebCrawler 8], and Open Text 9] employ software robots (also called.
Discovering Concepts in Raw Text: Building Semantic Relationship Graphs Boris Gelfand

Marilyn Wulfekuhler

William Punch

Genetic Algorithms Research and Applications Group, the GARAGe A714 Wells Hall, Michigan State University, E. Lansing MI 48824 fbg, wulfekuh, [email protected] http://isl.cps.msu.edu/GA

Abstract

We describe a system for extracting concepts from unstructured text. We do this by clustering document words and then assembling a structure which relates these words semantically. The clustering process identi es words which co{occur across a set of documents and creates groups of words which suggest a semantic context common across the document set. This context is formalized by identifying semantic relationships between the cluster words using a lexical database to build a Semantic Relationship Graph (SRG). This SRG is a directed graph which conveys a robust representation of the sub{ and super{class relationships between the correct word senses. We show how this process can be applied to a user{selected set of HTML documents; the SRGs can subsequently aid in searching the World Wide Web for documents which are similar.

1

1 Introduction Mining textual information presents challenges over data mining of relational or transaction databases because there are no prede ned elds or features and no standard formats. However, mining information from unstructured textual resources such as the World Wide Web has great potential payo and has received much recent attention[1]. One approach for e ectively mining relevant information from raw text is based on nding common \themes" or \concepts" in a set of documents. Even if we have pre{de ned keywords available (given by the user, for example), keywords do not convey the context of their usage, and do not adequately encompass the concept of interest. To nd relevant information, we must be able to capture conceptual ideas independently, based only on some examples of what a user has already labeled as relevant. In this paper we describe a method for extracting some potential key features from raw text which are then linked together in a structure which represents the semantic content. We call this structure a Semantic Relationship Graph (SRG). No category information about the documents is required, and there is no a priori identi cation of keywords. Once constructed, the SRG relates words according to the appropriate word sense used in the documents, leading to a coherent, cohesive structure of a single idea.

2 Problem Description Documents are grouped together into categories (for example, World Wide Web documents are grouped in a user's bookmark le) according to an individual's preferences; there is no absolute ground truth of classi cation except what is de ned by a particular user. These documents may be grouped into subsets explicitly in a hierarchy (as in a bookmark le), or only implicitly in the user's mind. The criteria that an individual uses to classify documents are related to some underlying semantic concepts which are typically not directly measurable as features of the documents. However, the words that compose the documents are measurable, and we use them as the raw features of the documents. Our goal is to automatically discover some higher level features, some function of the raw features, that represent the underlying semantic concepts which led to the user's particular classi cation and interest in this document set. Clues that will help lead to discovery of these semantic concepts are in the user's chosen set of examples. We rst use a statistical clustering technique to get the initial groups of semantically related features, then incorporate a lexical database to create the Semantic Relationship Graphs in order to determine the linkage between key terms. New documents can then be evaluated for relevance based on how well they t the SRG.

3 Statistical Processing We use statistical pattern recognition techniques to identify candidate keywords from among all the words in the document set. It is natural to think of the documents as patterns and the words contained in them as features. However, since we have a very small training set of examples compared to the number of raw features (typically less than 100 documents, 2

with thousands of unique words), trying to group the documents based on some selection of words does not work well. Instead, since the features (words) are correlated, we group the words in the document space to obtain sets of semantically related words which can later be used to build an SRG.

3.1 Preprocessing For World Wide Web documents, we begin with a list of documents that the user has designated as \interesting" in some way. This can be in the form of a bookmark le from a browser, or simply a list of URLs (Uniform Resource Locators). We retrieve and store the documents locally; these form our document training set. We then parse each document for words, where a word is de ned as a contiguous string of letters. We ignore HTML tags, digits, punctuation, and common English words (such as the, of, and, etc) from a pre{de ned stop list. Our stop list is the one given by Fox [2]. Note that we do not stem the words. All unique words from the entire training set of documents form the global feature space.

3.2 Feature Clustering We form an n  d pattern matrix M where n is the number of documents and d is the number of words in the document set. In our previous work reported in [3], the value M represented the number of times word j occurred in document i. Using occurrence counts as values without normalization has the e ect of lending more emphasis to longer documents when computing distance between vectors. However, a pure normalization for document length loses important contextual information. A full discussion of normalization issues can be found in [3]. Instead of the actual count of word occurrences, we currently use a modi ed ordinal number ranking system within a single document where 0 values and ties are allowed. We have found that this mitigates the problem of long documents getting more emphasis, yet still retains the important relationships between occurrences of words which make unusual contexts stand out from common usages. To group the features, we used Hartigan's K {means partitional clustering algorithm [4] as implemented by S{PLUS1 , where K points are chosen randomly as the means of K groups in the n dimensional space. Each of the d column vectors is then assigned to the group whose mean is closest in Euclidean distance. After each point is assigned, the group means are recomputed, and the process repeats until either there is no change, or after a xed number of iterations. The K {means process is not guaranteed to nd a global optimum grouping, so we run it several times with di erent random seeds in order to increase our con dence in the resulting clusters. We choose a value of K = 2(c + 1), where c is the number of original document categories. i;j

1

S{PLUS is a registered trademark of Statistical Sciences, Inc.

3

3.3 Feature Clustering Results

Our sample problem comes from the manufacturing domain, with web documents from the Network for Excellence in Manufacturing (NEM Online)2 . The sample data set consists of 85 web documents which were classi ed by human experts into the four categories labor, legal, government, and design. These 85 documents contain a total of 7633 unique words, so the clustering algorithm has n = 85, d = 7633, and c = 4. We ran the algorithm 10 times with di erent initial random seeds, and results were consistent. One of the useful results was that there was always a single group which contained an average of 80% of the words. These are the features which are useless for discriminating among the documents. Discarding this largest group, we are left with smaller more manageable groups to consider. The sizes of all 10 clusters for a typical run were 5746, 988, 40, 72, 106, 94, 55, 80, 34, 418. The smallest clusters are particularly interesting in that they contain words which to a human seem to be semantically related. The 3 smallest clusters in the above clustering are shown in Table 1. Cluster 1 asic based cad cadmazing circuit company connx consultants consulting custom customer customers design designs development eda electronic experience expertise rms hardware ic implementation inc independent management network planning printed process project projects quality service services software solutions systems technology tools

act additives bear cfr considered containing cosmetic drugs except fd hair information intended manufacturing nail product regulation safety warning

Cluster 2 active adulterated body color consumers contaminated cosmetics dye externally fda human ingredient label materials panel products regulations skin

additive agency cause conditions container contamination drug dyes eye federal including ingredients labeling misbranded prevent red safe statement

Cluster 3 action armative america american americans applause believe business chance college discrimination economic education employment equal government job jobs lives loans middle minorities opportunity people percent president programs quali ed re rights set white women wrong

Table 1: Three sample word clusters The word clusters by themselves do not identify characteristics of the original document categories (labor, legal, government, design), but they do suggest semantic concepts within those categories (for example, discrimination and armative action, dyes and additives in cosmetics and drugs). We can now more fully de ne the relationships between the words in the clusters to better re ect the semantic content of the documents.

4 Acquiring Semantic Meaning So far, we have discussed how to identify a cluster of words which corresponds to the relevant informational content of a user-de ned set of documents. This word set is the output of the 2

http://web.miep.org:80/miep/index.html

4

feature clustering process described in the previous sections. The next step in our process is to identify the underlying conceptual information or structure of the word list in order to express the ideas in the documents in a computer understandable form. First, we note that a given word list as output by the clustering algorithm will usually consist of closely related words. This is understandable because the documents which are used to generate these words have been chosen for a certain purpose - their \commonality" as determined by the user. While a human typically has the cultural and linguistic experience to comprehend such a word list, a computer requires a considerable amount of pre{de ned knowledge to perform the same task. The knowledge base that we have selected to use is WordNet3 , a lexical database consisting of cross-referenced sub- and super-class relationships for most common nouns and verbs [5]. Sub- and super-class relationships are a natural way of relating concepts, and provide for much greater exibility than a dictionary or synonymbased thesaurus. WordNet also provides di erent senses of a word. For example, the word \club" has four distinct senses: as a suit in cards, as a weapon, as a social group, and as a golf tool. Using this (or a similar) knowledge base, we can construct a Semantic Relationship Graph (SRG) of the word list and connect each word to other words in the list, adding augmenting words not in the original list when necessary for a minimally connected graph. A word is related to another word by either a sub- or super-class relationship if and only if there is a directed edge between the two vertices representing the words in the SRG. The direction of the edge is determined by the type of relationship.

4.1 Building the SRG Building an SRG starts by examining each word in the initial cluster (called core words) and determining which other core words are directly related to it. We then examine all of the words which occur in a single lookup of each core word, and recursively search further and further out along each of the words acquired in successive lookups for the other core words until a threshold depth is reached. We keep track of the parents of each word, and thus know all the paths from each core word to other core words for every depth. We also keep track of the sense of the word and require that all edges coming to or from each word refer to the same sense of that word. Any words that link core words, even if they are not core words themselves, are important augmenting words because by linking the core words, they form a part of the underlying concept which would otherwise be incomplete. These augmenting words are then added to the graph at each search depth level, creating a structure which tries to connect all of the original core words either directly or through a set of augmenting words. Once a certain iteration depth is reached, words not connected to any other word are thrown out as outliers. For example, the words \club" and \diamond" may be related by the intermediate word \suit" (\suit" is a generalization of both \club" and \diamond"), and so \suit" would be added to the graph in order to connect \club" and \diamond". If the word \tuxedo" also occurred in the core word set, it may have a connection to the word \suit", but since the senses of \suit" are di erent, \tuxedo" would not be related in the SRG to \club" or 3

http://www.princeton.edu/~wn

5

\diamond" via \suit". The following is an outline of the SRG building algorithm. Starting with an empty SRG, we do the following for each word: 1 Set depth = 0 and set wordlookuplist to be the core word. 2 Lookup hypernyms and hyponyms of all words in wordlookuplist, keeping track of the sense and parent of the word. 3 If we hit another core word add all components of the path (both vertices and edges) which are not already in the graph to graph. 4 Set wordlookuplist to be all of the new words which occurred in the lookup of the hypernyms and hyponyms of the old wordlookuplist. 5 Increment the depth and if below the cuto , go back to step (2). In this way, we arrive at a structure which relates the words to each other and robustly characterizes the documents that the words were extracted from. Concepts that were missing from the word list are lled in, words only peripherally related are excluded, appropriate word senses are identi ed, and we have a relatively complete structure that encompasses and gives order to a conceptualization of what the documents in question are really concerned with. This process is and must be robust since we cannot assume perfect clustering of the initial word set nor can we assume a perfectly complete lexical database. There will typically be a number of redundant paths between two connected words which we use to our advantage. If one path is not found due to a missing entry in the lexical database, the system can gracefully compensate for this { it will likely nd several other paths to link the words. In fact, words that are linked through only one path are likely to not be very strongly related. A hierarchical tree approach to this problem, for example, has exactly this weak point; the tree branching is too dependent on the completeness of the knowledge base.

4.2 Example SRGs Random Word Set hamburger knife club creature folder holmium hodgepodge sentence bunny chorus musculature track trend stoneware phrase scan spawn can metal apple government person

co ee hop dropout bubble screen

hat volley siren sulphur paper

Table 2: Random Word Set from /usr/dict/words We have illustrated the SRG building process by showing the SRG on cluster 3 from Table 1, which consists of 32 words, and on a set of 32 words selected randomly from a 6

program action president employment job

women business college white people

Figure 1: SRG of depth 1 on cluster 3 words from Table 1 person

creature

metal

holmium

can

coffee

Figure 2: SRG of depth 1 on random words from Table 2. unix system's /usr/dict/words le. The random words are listed in Table 2. Figure 1 shows an SRG of depth = 1 on the cluster 3 words. At this depth, we are only connecting words in the original set which are separated by a single lookup. Recall that the document set described previously is relevant to manufacturing business, and note that the depth = 1 SRG shows the words connected around a central business theme. Figure 2 shows an SRG of depth = 1 of the random word set. It is clear that there is very little commonality between the words { the only connections are three disjoint pairs of words. Figure 3 shows a partial SRG at depth = 2. Not all depth = 2 core and augmenting words are shown for the sake of clarity. The complete graph at depth = 2 links 26 of the 32 words using at most one augmenting word per path. The words re, set, armative, believe, applause, and wrong were eliminated as outliers at this depth. In many cases, a single augmenting word connects multiple other words - another sign of how closely tied the words are. For example, note that when \discrimination" is added, a large number of the concepts become tied together. This may suggest that the word \discrimination" is a key sub-idea within the overall SRG. 7

agenda program action

process president

employment train

job business college

education

class

legislation

women empire

people discrimination white

government control

racisim wasp middle

Figure 3: SRG of depth 2 on words from Cluster 3.

5 Application to Mining the World Wide Web Now that we have described constructing an SRG from example WWW pages, we can use this SRG to search for other documents similar to the concepts found in the original document set. Current web searching is based on traditional information retrieval techniques and is typically based on boolean operations of keywords. Major web search services such as Lycos [6], Alta Vista [7], WebCrawler [8], and Open Text [9] employ software robots (also called spiders) which traverse the web and create an index based on the full text of the documents they nd. A user submits a keyword query to one of these services, and receives the location of any document found in the index which matches the query. Di erent services vary in their methods of constructing the index, which is why identical queries to di erent search engines produce di erent results. Query format is also di erent for various services. If a user is skilled enough to formulate an appropriate query, most search engines will retrieve pages with adequate recall (the percent of the relevant pages retrieved among all possible relevant pages), but with poor precision (the ratio of relevant pages to the total number of pages retrieved). 8

Most users are not experts in information retrieval and can not formulate queries which narrow the search to the context they have in mind. Furthermore, typical users are not aware of the way the search engine designers have constructed their document indexes. Indexes are constructed to be general and applicable to all; they are not tailored to an individual's preferences and needs. They necessarily include all senses of a key search term's usage, which often results in irrelevant information being presented. Instead of the traditional information retrieval approach, we search for other documents in a large search space which match the original set of documents as described by their word set and resulting SRG. This matching can now take place on a conceptual level based on our understanding of the documents' concepts. Thus, we need to determine if a given document conceptually matches a list of words which represents a group of interesting or desirable documents. For example, we can take a user's HTML document collection and discover features which will lead us to other relevant web pages whose content is \similar" to those which have already been seen. We do this without any speci cation of keywords from the user; the only input required is example documents which are of interest. First, we apply our statistical clustering algorithm and then build an SRG from the output set of words. We will now be able to examine a new document and see how well it matches the SRG. We can acquire these documents through a conventional keyword search by searching on the disjunction of all words in the SRG. Using the SRG, we can see how words from a new document t into this structure. We can also see how the words in the new document which are close in meaning to the original SRG word set relate to each other. We can do the comparison as follows. Suppose the initial SRG is S . We then create S from S by doing a low{depth add of all the words in the new document. Even though all words in the new document are potential candidates for inclusion in S , only words that are closely related to S (according to the search depth level threshold) will be added. We then remove S from S to yield a structure which consists of just the words in the new document that relate to S . We can then perform various graph property tests in order to determine how cohesive S ? S is on its own. One such metric is as follows: for each pair of words in S ? S which are connected in S , nd the number of vertices which are missing along this same path in S ? S . Compute the total of all such path values, and normalize by the number of paths. This gives us a rough estimate of how well the new document spans the original SRG. We are presently experimenting with these and other approaches that compare an SRG and a document to determine the best way to evaluate a match. 0

0

0

0

0

0

0

6 Conclusions We have shown that statistical word clustering of a document collection can de ne a set of semantically related words. Using a set of such grouped words together with the lexical database, we can construct a Semantic Relationship Graph. This SRG directly represents the semantic relationship between the clustered words, either by a direct connection or by the addition of key augmenting words. The approach is e ective since an SRG built from a set of random words proves to be very disconnected, while an SRG built from the clustered words 9

is cohesive even at low search depths. The SRG gives a robust graph theoretic representation of semantic ideas in unstructured documents which can be applied to data mining of the World Wide Web.

Acknowledgments

We would like to thank the members of the MSU GARAGe for providing helpful discussions and valuable advice, and gratefully acknowledge the nancial support of the Network for Excellence in Manufacturing Online (NEM Online).

References [1] O. Etzioni, \The world{wide web: Quagmire or gold mine?," Communications of the ACM, vol. 39, pp. 65{68, November 1996. [2] C. Fox, \Lexical analysis and stoplists," in Information Retrieval Data Structures and Algorithms (W. B. Frakes and R. Baeza-Yates, Eds.), pp. 102{130, Englewood Cli s, New Jersey: Prentice Hall, 1992. [3] M. Wulfekuhler and W. Punch, \Finding salient features for personal web page categories," in Sixth International World Wide Web Conference, (Santa Clara, CA), April 1997. [4] J. A. Hartigan and M. A. Wong, \A k{means clustering algorithm," Applied Statistics, vol. 28, pp. 100{108, 1979. [5] G. Miller, \Wordnet: A lexical database for english," Communications of the ACM, pp. 39{41, Nov 1995. [6] Lycos, http://www.lycos.com. [7] Alta Vista, http://altavista.digital.com. [8] WebCrawler, http://www.webcrawler.com. [9] Open Text, http://index.opentext.net.

10

Suggest Documents