USE BP-NETWORK TO CONSTRUCT COMPOSITE ATTRIBUTE

4 downloads 6954 Views 393KB Size Report
We evaluated the method on our departmental Website. We used the CORDER method to first find related NEs of four types (organizations, people, projects, and ...
COMMUNITY RELATION DISCOVERY BY NAMED ENTITIES JIANHAN ZHU1, ALEXANDRE L. GONÇALVES2, VICTORIA S. UREN1, ENRICO MOTTA1, ROBERTO PACHECO2, DAWEI SONG1, STEFAN RÜGER1 1

Knowledge Media Institute, The Open University, Milton Keynes, United Kingdom 2 Stela Institute, Florianópolis, Brazil E-MAIL: {j.zhu, v.s.uren, e.motta, d.song, s.rueger} @ open.ac.uk, {a.l.goncalves, pacheco}@stela.ufsc

Abstract: Discovering who works with whom, on which projects and with which customers is a key task in knowledge management. Although most organizations keep models of organizational structures, these models do not necessarily accurately reflect the reality on the ground. In this paper we present a text mining method called CORDER which first recognizes named entities (NEs) of various types from Web pages, and then discovers relations from a target NE to other NEs which co-occur with it. We evaluated the method on our departmental Website. We used the CORDER method to first find related NEs of four types (organizations, people, projects, and research areas) from Web pages on the Website and then rank them according to their co-occurrence with each of the people in our department. 20 representative people were selected and each of them was presented with ranked lists of each type of NE. Each person specified whether these NEs were related to him/her and changed or confirmed their rankings. Our results indicate that the method can find the NEs with which these people are closely related and provide accurate rankings.

Keywords: Relation discovery, clustering, named entity recognition, similarities, ranking

1.

Introduction

Typical questions for knowledge managers are what do your employees know about, which of your customers have they contacts with, and who works well together in teams? However, the knowledge represented in organizational ontologies and other resources is often static, reflecting management's design of what should happen in the organization and not necessarily the real situation on the ground. The real situation is often characterized better by communities of practice [10], the groupings of people who collaborate together on shared tasks, rather than institutional divisions. We argue that the documents an organization produces mirror what people do and who they work with. In this paper a text mining method is presented for automatically processing Web pages in order to discover relations that indicate communities of practice and its first evaluation within our own department is reported. A Website is the product of an organization or virtual

community changing over the time, and it has a number of Web pages. These pages often mention named entities (NEs) such as people, organizations, and projects, and describe the relations between them. Relations may be explicit, e.g., in a sentence “Prof. Applebaum is the principal investigator of project Bluebird”, an “is-principal-investigator-of” relation is explicitly stated between the person Professor Applebaum and project Bluebird. Alternatively relations may be implicit, e.g., Dr Chang listed Bluebird on her homepage implying that there is an “is-member-of” relation between the two. These relations may be embedded in a number of Web pages, e.g., there are several Web pages talking about Prof. Applebaum and Bluebird. Thus the number of co-occurrences of two NEs is an indicator of their relatedness. Co-occurrence may indicate different levels of relatedness between NEs, e.g., one Web page talks about Prof. Applebaum and Bluebird together in one sentence implying a strong connection, while several paragraphs away it mentions Dr Chang, implying she is more loosely related to Prof. Applebaum. By combining co-occurrences and the distance between co-occurred NEs into a relevance measure, we can find NEs which have different levels of relatedness to each other. Thus, given an NE, such as Prof. Applebaum, which has a number of co-occurring NEs, we can use the relevance measure to rank them in terms of importance. We can further set a threshold on the relevance measure and assume co-occurring NEs that have a relevance measure with the given NE above the threshold are closely related to it. Our text mining method CORDER (COmmunity Relation Discovery by Named Entity Recognition) works on NEs extracted from the Web pages of a community. The CORDER method is unsupervised, i.e., the method does not need richly annotated corpora required for supervised learning or instances of relations as initial seeds for weakly supervised learning. The Web pages can be retrieved using a Web spider, in our case Verity Ultraseek search engine (http://www.verity.com). NEs of various types are extracted using ESpotter [12], a Named Entity recognizer built upon standard NER techniques which preprocesses Web pages and can adapt to different domains on the Web in order to achieve high precision and recall in NER. Variants of the same NE are prevalent,

e.g., a person can be referred to by full name (Andreas Applebaum), first name (Andreas), last name (Applebaum), nickname (Pippin), title followed by last name (Prof. Applebaum) etc. We propose a clustering method to align these variants for relation discovery. Given an NE, such as a person, CORDER discovers NEs, which are closely related to it. For example lists of the people, projects, organizations and research areas related to Prof. Applebaum. In the user evaluation reported here, people in our department were asked to judge lists of NEs related to themselves, considering their relevance, types of relations that had been found, and rankings. Our analysis shows that CORDER can present relevant lists of related NEs evaluated in terms of precision, recall and ranking accuracy, indicating the usefulness of the model. The rest of this paper is organized as follows. We present previous work in Section 2. We describe the CORDER method in Section 3. We report experiments on our departmental Website in Section 4 and user evaluation in Section 5. Finally we conclude and propose future work in Section 6.

sentence and are separated by at most a fixed number of words. According to their report, experiments on a newspaper corpus showed that relations between NEs can be detected, and appropriate labels can be automatically generated for them. However, their method is designed for explicit relations specified between NEs in the same sentence by context words, and also cannot find relations between less frequently co-occurring NE pairs in the same sentence. Their method works well on the newspaper text, which is usually well-formed sentences and follows a “house-style”. The problem is that Web pages even in a single domain are often written in individual styles and often contain information in lists such as research interests and publications. We want to detect relations which are not explicitly specified by context words. Their method also does not address the problem of ranking relations in terms of their level of relevance to the given NEs.

2. Previous work

CORDER discovers relations from the Web pages of a community. Its approach is based on co-occurrences of NEs and the distances between them. For a given NE, there are a number of co-occurring NEs. We assume that NEs that are closely related to each other tend to appear together more often and closer to each other in Web pages. We calculate a relation strength for each co-occurring NE based on its co-occurrences and distances from the given NE. The co-occurring NEs are ranked by their relation strengths. The process of CORDER is as follows. Data Selection. First, we find Web pages from an organization’s Website. We can use a Web spider to retrieve Web pages from a Website. Web pages, which contain noisy data, e.g., out-dated information and irrelevant information, are removed. Web pages, which are linked from Web pages on the Website, should be taken into account if they contain relevant information. Named Entity Recognition. Second, ESpotter recognizes NEs of various types from the Web pages and aligns variants of the same NE. In order to improve the quality of discovered relations, we need to have high precision and recall in NER. Domain knowledge needs to be taken into account to improve precision and recall, and align variants in NER. We describe this step in detail in Section 3.2. Relation Strength and Ranking. Third, for each NE that occurs in these Web pages, co-occurring NEs are ranked by taking into account their co-occurrences and distances from the given NE in co-occurred Web pages. Given an NE, related NEs are ranked and divided into separate lists for each type of NE. We describe this step in detail in Section 3.3.

The concept of relation extraction was introduced as part of the information extraction tasks in the Sixth Message Understanding Conference (MUC-6) [4]. Some previous work has adopted a supervised learning approach such as kernel methods [11] and needs richly annotated corpora which are tagged with relation instances. The limitation with this approach is that it takes a great deal of effort to prepare annotated corpora large enough to apply supervised learning. Some other previous work has adopted a weakly supervised learning approach. This approach has the advantage of not needing large tagged corpora. Brin [2] proposed DIPRE, a bootstrapping method for relation discovery. DIPRE finds patterns for a particular relation between NEs from a small set of training data, and uses these patterns for finding the relation between new NEs on test data. Snowball [1] improved on DIPRE by adopting the constraint of using a named entity tagger. KNOWITALL [5] uses patterns for relation extraction by taking advantage of the scale and redundancy of the Web. DIPRE and Snowball, however, need a small set of training data. It is also unclear how training data should be selected and how much data is needed. DIPRE, Snowball, and KNOWITALL work well on relations embedded in patterns but cannot spot relations shown in the context of text and layout of Web pages. The most similar relation discovery method to ours is by Hasegawa et al. [7]. They proposed a method which discovers relations among NEs from large corpora by clustering pairs of NEs according to the similarity of context words occurring between the NEs. Two NEs are considered to co-occur if they appear within the same

3. CORDER 3.1 Overview

Labeling Relations. Fourth, relations on Web pages are mostly implicit, domain knowledge, such as an ontology, is needed to label these relations. In this study, an ontology describing academic life was used to label relations that already exist in the ontology. For relations that do not exist in the ontology, we use the ontology to get a list of possible relations between two types of NEs and ask users to select appropriate relations from this list during their evaluation. Evaluation. Finally, results were evaluated in terms of the precision, recall, and ranking accuracy of ranked lists of NEs related to a particular user who also judged their relevance. 3.2 Named Entity Recognition Named entity recognition is a well studied area [3]. In our previous work, we have developed ESpotter [12], an NER system which uses standard NER techniques and adapts to various domains on the Web by taking into account domain knowledge. We have used ESpotter for NER in relation discovery. ESpotter recognizes NEs of standard types, e.g., people’s and organizations’ names. Users can configure ESpotter using new lexicon entries and patterns. Domain knowledge, taken from sources such as ontologies, is represented as lexicon entries, e.g., the names of projects in our department from our organizational ontology, and research areas from the ACM Computing Classification System (http://www.acm.org/class/1998/). Mark-up tags in Web pages are removed before NER. ESpotter recognizes terms which match, or are similar to, these lexicon entries. Variants of the same NE are prevalent on different Web pages on a Website, e.g., a person’s name can be referred to in many ways. We propose a clustering method which groups similar NEs together in order to find these variants and align them by taking into account the string similarity of two NEs. String similarity is defined as the length of the longer NE divided by the Levenshtein distance1 of the two NEs. Two NEs judged similar by their string similarity are more likely to be variants of the same NE if they appear on the same Web page or two Web pages which link to each other. The two NEs may appear on multiple Web pages, and we define the contextual distance between two NEs as the minimum number of links, regardless of link direction, between two Web pages where these two NEs appear. The contextual distance is zero if the two NEs both appear on the same Web page. We define the similarity between two NEs, E1 and E2, Sim( E1, E 2) , by taking into account their string similarity, , and contextual distance, StrSim( E1, E 2) ConDis( E1, E 2)

,

as

Sim ( E1, E 2) =

StrSim( E1, E 2) 1+ConDis ( E1, E 2) .

A hierarchical clustering algorithm based on single linkage method was used to group different variants based on their similarity. Clusters were presented to a domain expert to check the alignment by removing falsely clustered NEs and specifying an NE to represent all the NEs in a cluster. The algorithm can also help distinguish different entities with the same name. For example, there are two entities with the same name “John”. Based on their similarity with co-occurring entities, the algorithm can cluster them with two entities “John Smith” and “John White”, respectively. The method still needs to be improved in order to better tackle this “identity” problem, which is prevalent on the Web. 3.3 Relation Strength and Ranking Given an NE which occurs in different Web pages, there are a number of NEs of various types which all co-occur with the given NE in these Web pages. We propose a relation ranking algorithm which ranks co-occurring NEs using relation strength between the given NE and its co-occurring NEs. Thus NEs which have strong relations with the given NE can be identified. The relation strength between two NEs takes into account four aspects of the two NEs. Co-occurrence. Two NEs are considered to co-occur if they appear in the same Web page. Generally, if an NE is closely related to another NE, they tend to co-occur more often. For two NEs, E1 and E2, we use Resnik [9]’s method to compute a relative frequency of co-occurrences of E1 and E2 as

Num ( E1, E 2) pˆ ( E1, E 2) =

, where Num(E1,E2) is the N number of co-occurring Web pages for E1 and E2, and N is the total number of Web pages. Distance. Two NEs which are closely related tend to occur close to each other. If two NEs, E1 and E2, both occur only once in a Web page, the distance between E1 and E2 is the difference between the offsets of E1 and E2. If E1 occurs once and E2 occurs multiple times in the Web page, the distance of E1 from E2 is the difference between the offset of E1 and the offset of the closest occurrence of E2. When both E1 and E2 occur multiple times in the Web page, we average the distance from each occurrence of E1 to E2 and define the logarithm distance between E1 and E2 in the ith Web page as

di ( E1, E 2) =

∑ (1 + log 2 (min( E1 j , E 2))) j

, where Freqi ( E1) Freqi ( E1) is the number of occurrences of E1 in the ith Web page and min( E1 j , E 2) is the distance between the jth occurrence of E1, E1 j , and E2.

1

Levenshtein distance of two strings is the length of the shortest sequence of edit commands that transform one string to the other.

Frequency. An NE is considered to be more important if it has more occurrences in a Web page.

Consequently, a more important NE on a Web page tends to have strong relations with other NEs which also occur on the Web page. Page relevance. Given an NE, E1, the weight of each Web page is given indicating its relevance in associating other NEs co-occurring with E1 on the page with E1, e.g., for a person, a high relevance weight might be set to their homepage and a low relevance weight to their blog page. Relation strength. Given an NE, E1, we calculate the relation strength between E1 and another NE, E2, by taking into account their co-occurrences, distance and frequency in co-occurred Web pages. The relation strength, R( E1, E 2) , between E1 and E2 is defined in Equation 1. ⎛ w × f ( Freqi ( E1)) × f ( Freqi ( E 2)) ⎞ R ( E1, E 2) = pˆ ( E1, E 2) × ∑ ⎜ i ⎟ d i ( E1, E 2) i ⎝ ⎠

where wi is the weight showing the relevance of the ith Web page to E1, f ( Freqi ( E1)) = 1 + log 2 ( Freqi ( E1)) , f ( Freqi ( E 2)) = 1 + log 2 ( Freqi ( E 2)) , and Freqi ( E1) and Freqi ( E 2) are the numbers of occurrences of E1 and E2

in the ith Web page respectively. Given an NE, E1, the relation strength between E1 and each of its co-occurring NEs is calculated. We rank co-occurring NEs of E1 in terms of their relation strengths with E1. Since these co-occurring NEs are of different types, we divide the ranked list into a set of ranked lists for each type, e.g., lists of related people and related organizations. The higher the relation strength between two NEs, the closer they are related to each other. We set a relation strength threshold, so that only significant relations having relation strengths above the threshold are selected. Relations having relation strengths below the threshold are considered to result from noise in our data and are ignored. We can trade off precision against recall by setting the relation strength threshold. Higher thresholds give high precision and low recall, and vice versa. The threshold may depend on the nature of the data used in the analysis. In our study we set the threshold as the value at which two NEs co-occur with only one occurrence each in only one Web page, and their distance in the Web page is a certain value D. Two NEs with their relation strength above this threshold are considered to be related. 4. Experiments We applied the CORDER method to our department, the Knowledge Media Institute (KMi), Website (http://kmi.open.ac.uk). We used Verity Ultraseek search engine to get a list of Web pages which are linked from the KMi homepage and whose URL hostnames are from a list of URLs of sub-sites of the KMi Website e.g., URL for the departmental news, PlanetNews, site (http://news.kmi.open.ac.uk) . Web pages containing noisy data including obsolete Web pages were removed using the

(1)

patterns in their URLs and templates in their content. We got 503 Web pages, of which 122 are official pages from the KMi Website, 202 are from personal homepages, 111 are from the PlanetNews site, and the rest are from other relevant sources. We used an ontology (http://webonto.open.ac.uk) describing academic life for our department to get people’s names, project names, research area names, and organization names which are specific to our department and input them to ESpotter for NER. In addition, ESpotter uses its built-in lexicon entries and patterns, such as patterns for English names, to recognize NEs which are not covered by the ontology. Four types of NEs, i.e., people (PeoNE), organizations (OrgNE), projects (ProNE), and research areas (ResNE), are used in our study. To align variants of NEs from these Web pages, we calculated a similarity threshold of 0.833 by setting minimum string similarity of 2.5 and minimum contextual distance of 2 and used the similarity threshold to find clusters of variants (see section 3.2). We found 93 clusters for a domain expert to do manual alignment by removing falsely clustered NEs and specifying an NE to represent all the NEs in a cluster. To estimate the precision, PNER , and recall, RNER , of NER, we randomly selected 15 pages from the 503 pages and asked a human evaluator to annotate NEs of the four types respectively. Human annotations were used as a gold standard to compare with the NER results produced by ESpotter on the 15 pages, we got PNER and RNER 2 of 91% and 88% respectively. For the 503 pages, the numbers of unique PeoNEs, OrgNEs, ProNEs, and ResNEs are 860, 526, 21, and 273, respectively. The average numbers of PeoNEs, OrgNEs, ProNEs, and ResNEs on each page are 4.87, 2.95, 0.41, and 1.68, respectively. The numbers of novel PeoNEs, OrgNEs, ProNEs, and ResNEs, i.e., not in the ontology, are 763, 481, 1, and 215, respectively. The CORDER method was used to discover relations between people working in our department and NEs of the four types. Equation 1 calculates the relation strength between the person and each co-occurring NE of the four types. A relevance weight is set as 1 for Web pages of general relevance to the person and 2 for Web pages of special relevance to the person. We use the URL of Web pages to identify Web pages from the person’s homepage and set their relevance weight as 2, and set the relevance weight of all the other Web pages as 1. To set the relevance strength threshold, three people were selected and looked

2

PNER =

where

N ESpotter ,Correct N ESpotter

and

RNER =

N ESpotter ,Correct NUser

N ESpotter ,Correct is the number of correct NEs produced by

ESpotter for four types on 15 Web pages,

N ESpotter is the total number

of NEs produced by ESpotter for four types on 15 Web pages, and NUser is the number of NEs annotated by the evaluator of 15 pages.

at ten Web pages each where their names co-occur with other NEs, we found that by setting a distance threshold between their names and these NEs as 10, 92% of NEs within the distance threshold are judged by a human evaluator to be related to the three people. We calculated a relation strength threshold from Equation 1 as 4.6 ×10−4 . Only relations stronger than this threshold were set as relevant (see Figure 1). There are 60 researchers including PhD students, research staff, and teaching staff working in our department. For the 503 pages, the maximum, minimum, and average numbers of NEs of each type which co-occur with their names with relation strengths above the threshold are shown in Table 1. Table 1. Statistics of numbers of related NEs per person from 503 Web pages on our departmental Website. OrgNE PeoNE ProNE ResNE

Max 146 413 16 182

Co-occur Min 4 12 4 6

Aver 45.83 110 8.27 48.33

Co-occur and above threshold Max Min Aver 47 1 15.36 106 2 27.92 11 1 3.64 54 2 14

There are 3655 relations found by CORDER between these 60 researchers and the four types of NEs. Among these relations, 588 relations which already exist in the ontology were labeled, and the rest 3067 novel relations, i.e., not in the ontology, were not labeled and need to be evaluated. 5. Evaluation In order to evaluate CORDER, 20 people were selected out of the 60 researchers working in our department and none of them were from our relation discovery research group. They represent a range of experience from PhD students to professors. They were asked to independently evaluate an online form (such as in Figure 1) showing ranked lists of relevant NEs produced by CORDER. In Figure 1, the user can select one of the four types (Organization, Person, Project, and Research Area) at the top of the form to show a list of NEs of the corresponding type ranked by their relation strengths. The names of the NEs are shown in Section B. The relation strengths of the NEs are visualized in Section C. In Section A, NEs with relation strengths above the threshold are set as relevant by default and below the threshold they are set as not relevant. The type of the relations between an NE and the user is shown in Section D. Some of the relations are obtained from the ontology. The user can select one or multiple types of relations from the dropdown list and can also specify new types of relations. The rankings of the NEs are shown in Section E. An NE is given a ranking on the basis that it is judged as relevant by the user. The user can change the relevance values, types of relations, and rankings of the NEs in Section A, D, and E, respectively, based on his/her own opinions. To help the user judge the relevance, relation types, and ranking of each NE, he/she

can click on the NE in Section B to view the context in which the NE and the user’s name have been mentioned together in Web pages. Users can make comments, such as NEs that are missing from the list, in Section F.

Figure 1. User evaluation form. Herlocker et al. [8] review evaluation of collaborative filtering (CF) systems, which make recommendations to individuals using the opinions of a community of users. Precision and recall, the two most popular metrics for evaluating information retrieval systems, measure the ability of a CF system to retrieve relevant items. F-measure combines precision and recall. Ranking accuracy metrics measure the ability of a CF system to produce a recommended ordering of items that matches how the user would have ordered the same items. We have used precision, recall, and F-measure to measure the ability of CORDER to discover relevant NEs, and ranking accuracy to measure the ability of CORDER to provide rankings for relevant NEs. For a list of NEs of type, T , the number of NEs judged as relevant by CORDER is NCORDER ,Re levant . After user evaluation, the total number of NEs judged as relevant by the user is NUser ,Re levant . The number of NEs judged as relevant by both the user and CORDER is NUser ,CORDER ,Re levant . Precision, PT ,User , and recall, RT ,User are defined as in Equation 2.

PT ,User =

NUser ,CORDER ,Re levant NCORDER ,Re levant

RT ,User =

NUser ,CORDER ,Re levant NUser ,Re levant

(2)

CORDER and the user provide two sets of rankings to the list of NUser ,CORDER ,Re levant NEs judged as relevant by both. To measure how well CORDER’s ranking match the user’s ranking, we define the ranking accuracy, RAT ,User , as the Spearman coefficient of rank correlation [6] between the two sets of rankings as in Equation 3. RAT ,User = 1 −

6∑ (ri ,User − ri ,CORDER ) 2 i

NUser ,CORDER ,Re levant 3 − NUser ,CORDER ,Re levant

(3)

Figure 2. Precision, recall, and ranking accuracy of evaluation.

ri ,User and ri ,CORDER (1 ≤ i, ri ,User , ri ,CORDER ≤ NUser ,CORDER ,Re levant ) are the two rankings provided by the user and CORDER respectively for the ith NE in the list. There are no ties in a set of rankings, i.e., for any two NEs, Ei and E j (i ≠ j), ri ,User ≠ rj ,User and ri ,CORDER ≠ rj ,CORDER . We have

In Figure 2, we can see that precision of all types of NEs for all 20 users ranges between 70% and 100%, recall ranges between 70% and 100%, and ranking accuracy ranges between 0 and 1.0. The total number of NEs for all four types for all 20 users judged as relevant by CORDER is TotalCORDER ,Re levant ,

-1 ≤ RA ≤ 1 where RA =1 when the two sets of rankings by CORDER and the user respectively are in perfect agreement and RAT =-1 when they are in perfect disagreement.

the total number of NEs judged as relevant by 20 users is TotalUser ,Re levant , and the total number of NEs judged by both

Equations 2 and 3 were used to calculate the precision, recall, and ranking accuracy of the four lists of NEs, i.e., OrgNEs, PerNEs, ProNEs, and ResNEs, for each of the 20 users, and the results are shown in Figure 2.

We define the overall precision, PTotal , recall, RTotal , and F-measure, FTotal , of the evaluation as in Equation 4.

where

20 users and CORDER as relevant is TotalCORDER ,User ,Re levant .

PTotal = RTotal =

TotalCORDER ,User ,Re levant TotalCORDER ,Re levant TotalCORDER ,User ,Re levant

FTotal =

TotalUser ,Re levant

(4)

2 RTotal × PTotal RTotal + PTotal

We got PTotal , RTotal , and FTotal of 0.905, 0.882, and 0.904, respectively, for our evaluation. We used different relation strength thresholds to trade PTotal against RTotal , and we found that when the threshold is set as 4.6 ×10−4 , i.e., the maximum distance between pairs of NEs is set as 10, we get the highest FTotal of 0.904. We average the ranking accuracies for four ranked lists for the 20 users to get the overall ranking accuracy, RATotal , of 0.769. Given four options as “useless”, “occasionally useful”, “useful” and “very useful”, 15 users rated the results produced by CORDER as “very useful”, and the rest 5 users rated the results as “useful”. They suggested that the system should group NEs which are related to each other, such as two ResNEs, “text mining” and “data mining”, to help user evaluation. They found that it was hard to rank certain types of NEs, such as PerNEs, because their personal view of levels of importance of these NEs is hard to quantify. For NEs which are judged as relevant by CORDER but irrelevant by users, we found that most of them are just above the threshold, some were caused by noise in Web pages, and some were mistakes produced by named entity recognition. For NEs which are judged as irrelevant by CORDER but relevant by users, we found that most of them are just below the threshold, some were the results of lack of information in Web pages about related NEs, and some were NEs not recognized by ESpotter. In terms of user evaluation of NE rankings, we found that users typically only made regional changes rather than global changes, i.e., if we evenly divide the ranked list into three bands for high-ranked, medium-ranked, and low-ranked NEs, respectively, we found that most user changes are inside each band rather than across different bands. Despite various limitations such as bias in Web pages, mistakes in NER, and user subjectivity in ranking, CORDER produced good rankings. 6. Conclusions and future work

The CORDER method discovers relations between NEs relevant to a community based on co-occurrences of these NEs. The relation discovery process is fully automatic except that a domain expert was asked to align variants of NEs. Discovered relations were evaluated by human experts, who are our departmental members in the evaluation. Initial experiments show that CORDER can discover relations with high precision, recall, and ranking accuracy. CORDER’s running time increases linearly with the size and number of web pages it examines. CORDER can incrementally evaluate existing relations and discover new relations by taking into account new Web pages. Thus CORDER can scale well to a large dataset. We are currently experimenting CORDER with other data sources. Initial experimental results on the BBC Website, which has a much larger number of Web pages than our departmental Website, are promising. However, the evaluation of a large number of relations on BBC domain is difficult since there is not a gold standard to or domain expert to evaluate them. First, the continuing refinement of the method to deal with noise and variants from the named entity recognizer, and to introduce more sophisticated distance and relation strength metrics will be an important thread. Second, we intend to use CORDER for ontology maintenance, aiming to overcome the disconnection that we see between static organizational ontologies, as designed by managers, and the real situation, as experienced by communities of practice. Finally, we plan to integrate CORDER with current information extraction methods in order to discover both implicit and explicit relations, e.g., CORDER can be applied to plain text to complement Hasegawa et al. [7]’s method. Integration of CORDER with Hasegawa et al. [7]’s method and other natural language processing methods can help labeling relations not in the ontology. ACKNOWLEDGEMENTS This research was partially supported by the Designing Adaptive Information Extraction from Text for Knowledge Management (Dot.Kom) project, Framework V, under grant IST-2001-34038 and the Advanced Knowledge Technologies (AKT) project. AKT is an Interdisciplinary Research Collaboration (IRC), which is sponsored by the UK Engineering and Physical Sciences Research Council under grant number GR/N15764/01. The AKT IRC comprises the Universities of Aberdeen, Edinburgh, Sheffield, Southampton and the Open University. This research was also partially supported by the Brazilian National Research Council (CNPq) with a doctoral scholarship held by Alexandre Gonçalves.

REFERENCES [1] Agichtein, E., and Gravano, L. 2000. Snowball: Extracting Relations from Large Plain-Text Collections. In Proc. of the 5th ACM International Conference on Digital Libraries, 85–94. [2] Brin, S. 1998. Extracting Patterns and Relations from World Wide Web. In Proc. of WebDB Workshop at 6th International Conference on Extending Database Technology, 172–183. [3] Cunningham, H. 2002. GATE: a General Architecture for Text Engineering. Computers and the Humanities, 36(2):223-254. [4] DARPA (Defense Advanced Research Projects Agency). 1995. Proc. of the Sixth Message Understanding Conference. Morgan Kaufmann. [5] Etzioni, O., Cafarella, M., Downey, D., Popescu, A., Shaked, T., Soderland, S. Weld, D. S., and Yates, A. 2004. Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison. In Proc. of AAAI 2004, 391-398. [6] Gibbons, J. D. 1976. Nonparametric Methods for Quantative Analysis. Holt, Rinehart and Winston. [7] Hasegawa, T., Sekine, S., and Grishman, R. 2004. Discovering Relations among Named Entities from Large Corpora. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics, 415-422. [8] Herlocker, J. L., Konstan, J. A., Terveen, L. G., and Riedl, J. T. 2004. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems, 22(1):5-53. [9] Resnik, P. 1999. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research. 11:95-130. [10] Wenger, E. 1998. Communities of Practice, learning meaning and identity. Cambridge University Press. [11] Zelenko, D., Aone, C., and Richardella, A. 2002. Kernel Methods for Relation Extraction. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 71–78. [12] Zhu, J., Uren, V., and Motta, E. 2005. ESpotter: Adaptive Named Entity Recognition for Web Browsing. In Proc. of the 3rd Conference on Professional Knowledge Management (WM 2005)

Suggest Documents