Annotating documents by Wikipedia concepts - Semantic Scholar

3 downloads 97434 Views 290KB Size Report
words or phrases of an arbitrary document with Wikipedia articles (concepts) best describing ... some informal name (New York as “Big Apple”);. • if a concept is ...
Annotating documents by Wikipedia concepts∗ Peter Sch¨onhofen Computer and Automation Research Institute Hungarian Academy of Sciences E-mail: [email protected] Abstract We present a technique which is able to reliably label words or phrases of an arbitrary document with Wikipedia articles (concepts) best describing their meaning. First it scans the document content, and when it finds a word sequence matching the title of a Wikipedia article, it attaches the article to the constituent word(s). The collected articles are then scored based on three factors: (1) how many other detected articles they semantically relate to, according to the Wikipedia link structure; (2) how specific is the concept they represent; and (3) how similar is the title by which they were detected to their “official” title. If a text location refers to multiple Wikipedia articles, only the one with the highest score is retained. Experiments on 24,000 randomly selected Wikipedia article bodies showed that 81% of phrases annotated by article authors were correctly identified. Moreover, out of the 5 concepts deemed as the most important by our algorithm during a final ranking, in average 72% was indeed marked in the original text.

1

Introduction

For a machine, understanding the exact intent of a document downloaded from the Internet or retrieved from a corporate archive is unfortunately still a challenging task, despite the continuous progress made in natural language processing and related fields. However, we will prove that the next best thing, identifying the meaning of important words and phrases mentioned in its text – a knowledge which can be later utilized to determine the document topic – is feasible, thanks to Wikipedia, a collaboratively edited, rapidly expanding, freely available on-line encyclopedia. In fact Wikipedia can be viewed as an ontology, if we regard article bodies as concept definitions, article titles as word sequences through which these concepts are mentioned in documents, and hyperlinks between articles as ∗ Support

from OTKA 72845.

semantic connections. Although “proper” ontologies, like WordNet, OpenCyc or SUMO are more structured and carry much richer information about concept relationships, Wikipedia has its own advantages, namely wider coverage, very specific technical terms, large number of proper names (persons, places, books, products etc.) and up-to-date content. The basic idea is simple: recognize Wikipedia concepts mentioned in the text, and perform disambiguation when a word or phrase might refer to more than one concept. Obviously, for each phrase we should retain only a single concept which relates most closely to the document subject; but because this topic is not known beforehand, we have to guess it from the set of all recognized concepts. Fortunately, there are some straightforward heuristics helping us determine whether a concept was meant by the author or not: • concepts linking to or linked by several other concepts recognized in the same document are probably valid; • recognition through longer titles is more reliable, since these titles can typically refer only to a few concepts; • concepts mentioned by their “official” title are stronger candidates than those appearing in the text through some informal name (New York as “Big Apple”); • if a concept is semantically related to only a few other concepts, but they are frequently mentioned in the document, then it is worth as much as a concept which is related to many but individually rare concepts. Recognized concepts are thus scored according to the factors listed above (the exact formulae will be presented in Sect. 3.3), and if a word at a given text location has more than one concept attached, we keep only the concept with the highest score. Finally, the remaining concepts can be ranked to estimate their importance in the document. The proposed method is novel in two ways. First, the annotation is not limited to named entities – it involves every phrase having a corresponding Wikipedia article. Second, disambiguation does not rely on an extensive set of textual contexts the given concept occurs in, which is not always available (in case of WordNet, for example), but instead on the network of semantic relationships between concepts.

2

Related work

Labeling strongly resembles word sense disambiguation ([2, 12] give an overview, among others); however, there are significant differences. Now we focus on nouns and noun phrases, as they are more characteristic of the document topic than other parts of speech. We also have to decide which word sequences should be annotated at all; for instance inside “operating system process”, [Operating System] or [System Process] both can be emphasized. Similarly, sometimes annotation of a phrase and its constituent words are equally valid; e. g. “skeletal muscle” might represent [Skeletal], [Muscle] and [Skeletal Muscle]. Word sense disambiguation and named entity recognition (another related field) are often aided by ontologies. Navigli et al. [19] and [21] utilized WordNet, examining the efficiency of various measurements of semantic distance; [20] disambiguated with pattern matching, again employing WordNet. In [15], word senses identified in documents were exploited to build a search index; [16] explored how OWL could improve disambiguation accuracy. Cucerzan [8] mined surface forms of named entities from Wikipedia, then used both their textual context and the category tags of the containing Wikipedia articles to identify them in documents; [4] followed a similar approach, comparing contexts by SVM. In [18], hyperlinks inside Wikipedia article texts representing references to strictly non-named entities were used as training data for a machine learning algorithm which later performed disambiguation on SENSEVAL. The idea to convert Wikipedia to an ontology is not new, either [10, 11]; [23] established a mapping between it and WordNet; [5] gives an algorithm enriching its semantic content by correlating Wikipedia categories with each other. Due to the emergence of Web 2.0, semantic annotation got a lot of attention; [22, 25] overview of the field, with [13] outlining a generic framework where annotation is part of a larger system. To mention only two approaches: [3] applies linguistic processing and machine learning to identify entity (sub)classes in documents – persons, places, sporting activities etc.; [9] uses a taxonomy constructed from concepts and words frequently appearing with them in web pages, and during disambiguating it selects the concept most compatible with the context of the given phrase. Papers [6, 7] identified noun phrases in documents and formed hypotheses about their interpretation (“New York” is a city or hotel), then tested their validity by submitting the appropriate linguistic structures (“New York hotel”) to Google. In [14] sophisticated natural language processing techniques were used to detect concepts in texts and recognize their relationships; the described method cover not only proper nouns but also dates, verbs, nouns. Sect. 1 mentioned that our method estimates the importance of concepts retained in the final annotation, which is similar to link prediction [1, 17]. Especially so since in Sect.

4, the performance of our proposed algorithm will be measured by how well it is able to pinpoint phrases of Wikipedia articles on which authors placed hyperlinks. Of the approaches listed above, [8, 18] are the most similar to ours; however, two important differences remain. By exploiting links between ontology concepts rather than relying on the comparison or analysis of textual contexts, we obviate the need for deep natural language processing and an extensive set of training documents. Moreover, although we do not focus solely on the annotation of proper nouns and extend our coverage to nouns and verbs as well, performance of our algorithm remains at the level shown in [18].

3

Labeling algorithm

To make the discussion more succinct, we introduce the following terminology. The document to be annotated will be called the target document. A recognized concept is a Wikipedia article whose title occurs inside the document; the title becomes the surface form of the concept, and the words constituting the title at the specific text locations will have the concept as their candidate concept. The candidate concepts actually selected for annotation are final concepts.

3.1

Preparing Wikipedia

Wikipedia contains four kinds of entries. A regular article is composed from a title, a body (possibly with tables, images and hyperlinks to other regular articles or external web sites) and a list of relevant categories. Redirection pages merely redirect users from an alternative title to the official one (from “ANOVA” to “Analysis of Variance”). Disambiguation pages enumerate interpretations of a phrase (“Java” as an island and programming language). Category pages specify the sub- and super-category of the given category, and also list the regular pages assigned to it. To extract a concept network from Wikipedia, first disambiguation pages were split into pieces discussing a single concept, each inheriting the original title. Redirection pages were eliminated, their titles was registered as nonofficial titles of the articles they pointed to. Next, from titles we discarded supplementary words placed between parentheses or after a comma, semicolon or hyphen (simplifying “Superior, Arizona” to “Superior”). Characters were mapped to the Latin alphabet by removing diacritics, or ignored if that was not possible. Finally, we detected paragraph boundaries, performed stemming [24], and deleted stopwords. Stemming and stopword removal were applied to article titles as well, so some originally distinct titles became identical. Category pages were not processed. We end up with a set of c concepts (representing Wikipedia articles or disambiguation page fragments), each having a set of tc titles (one official), and connected in an undirected fashion, based on the hyperlinks present in their texts.

Benjamin Franklin James Paget

Albert (Discworld) University of Toronto James Hillier Hall

Electron

Cecil, Ohio Student

Pragmatism Invention

Eli Franklin Burton

Nobel Prize

Building Microscope

1938

J Award

Canada

canada

cecil

toronto university prebus albert hilly james hall

electron

ruska

student burton franklin eli build microscope

practical invention 1986 prise nobel award

practical fit

time 400 object magnify capable instrument

physicist german

primitive

1933

fact

build

knoll

microscope

max

Pragmatism Time Object (phylosophy)

German language Physicist

Primitive (geometry) FITS Fact

Capability Magnification

ruska ernst

During annotation, we look for tc concept titles inside the target document text, collect the c concepts behind them, and attach these cs as candidate concepts to each word w of tc – or more precisely to each text location l where w is part of tc . After the candidate concepts are scored, for each l we usually retain only the highest scoring c. The algorithm is outlined in Fig. 1; a sample annotation is shown on Fig. 2. In the first step, we examine the whole text of the target document, searching word sequences which exactly match a concept title. If a title is longer than 5 words or contains only a single number, it is ignored. While titles cannot cross paragraph boundaries, their words may be separated by punctuation. Paying attention to punctuation would make the search more complex and slower; and as Sect. 4 will show, multiword titles are identified fairly reliably. In the second step, we gather concepts pertaining to the previously identified titles, attaching them to the text locations occupied by the title words. This way in “home computer”, concept [Home Computer] is bound to both “home” and “computer”, because it will compete not only with other concepts represented by the whole phrase, but also with concepts referred to by the two words individually. Note that due to stemming, stopword removal and the break-up of disambiguation pages, the same title might represent several concepts, occasionally as many as 700 (Fig. 3). Titles might overlap (“operating system” and “system administrator”), which of course does not mean that only one of them should be a basis for annotation. Finally, a word appearing at various text locations sometimes refer to different set of titles (as its meaning is indeed different in each case), if it is not always surrounded by the same words. In the third step we score candidate concepts according

electron

Annotation

Musical instrument

3.3

Building

Target documents were pre-processed in the same way as Wikipedia itself: we mapped characters to the Latin alphabet (if possible); split text into paragraphs; stemmed words; discarded stopwords. The only difference was that words not present in any Wikipedia title were discarded.

Max Knoll

Preparing documents

Microscope

3.2

Ernst Ruska

Figure 1. Outline of the annotation algorithm

Electron

1. Identify concept titles present in the target document 2. Gather concepts connected to these titles; attach them to the text locations occupied by the appropriate titles 3. Score the collected concepts: Sc = Tc Ic Hc 4. At each text location rank concepts according to their scores, retaining only the top three of them 5. At each text location, keep the highest ranked c concept; retain the other concepts only if their surface form were longer than c’s

Figure 2. Annotated paragraph from the Wikipedia article about the electron microscope. to the following three criteria: (1) how specific or general it is; (2) how strongly it is connected to other candidate concepts; and (3) how similar is its surface form to its official title. We expect that concepts with high scores will be relevant to the primary document topic. The formula is: Sc = Tc Ic Hc

(1)

Let us now examine the above factors in detail. Tc is the length (in words) of the longest surface form of concept c anywhere in the target document. Multiword titles are more dependable than those consisting of a single word (Fig. 4): the probability that their words got written after each other by pure chance is low, and since they refer to rather specific concepts, few (if any) alternative interpretations exist. Ic , which attempts to capture how deeply concept c is integrated into the document content, is calculated as: Ic =

X p∈Lc

1 , Lc = {l|l ∈ Mr , r ∈ Nc } |Cp |

(2)

where Nc is the set of other candidate concepts semantically connected to c (linked to c or linked by c); Mr denotes the set of text locations attached to concept r; and Cp is the set of concepts attached to text location p.

To better understand the formula, for a moment ignore Cp , assuming that each text location has merely one assigned concept, so Ic specifies the number of text locations where concepts connected to c appear. Counting related text locations instead of concepts has many advantages. Frequently, other concepts having the same surface form as c pertain to the same domain, and thus link to each other (concepts [B-Tree], [B*-Tree], [R-Tree] all have the title “tree”). Moreover, a concept is significant no matter whether it is associated with only a few but frequently mentioned concepts, or to several but individually rare concepts. Lastly, this way we prefer concepts with longer surface forms (as they cover more text locations), whose recognition is more reliable. The “support” provided for c by some concept r attached to the text location p along with 500 other concepts, however, is not as strong as if r were the only one present there. In the second case, r does not have to compete, it will become the final concept for p. But in the first, the probability of this outcome is roughly inversely proportional to |Cp |, so text locations should be weighted with its reciprocal. When c is not connected to any other candidate concepts, we do not allow Ic to become zero, canceling the effects of Tc and Hc ; instead, a small value (0.001) is assigned to it. Also note that Sc does not include the number of times concept c occurs in the document. This is because the goal of Sc is to help decide which candidate concept is correct at text location l; however, the candidates attached to l usually have the same surface forms throughout the target document, and therefore their occurrence counts are equal. Hc represents the average degree of similarity between the official title and surface form of concept c: P p∈Mc |Fp,c ∩ Wc | Hc = 1 + (3) |Mc | The meaning of Mc is the same as in (2), that is, the set of text locations covered by the surface forms of c; Wc is the set of words in the official title of c; and Fp,c is the set of words in the surface form of c for text location p. Hc determines that in average how much percentage of words referring to c throughout the document can be found inside Wc . If c is mentioned always by its official title, Hc is 2, if never, then 1. The idea behind Hc : the official title, theoretically the best available phrase for c, is worth more than alternative ones. However, Hc does not penalize partial official titles – “franklin” and “benjamin franklin” support [Benjamin Franklin] equally strongly, as the intersection operation will not decrease the size of Fp,c . In the fourth step, candidate concepts at each text location are ranked by Sc , and only the top three is retained. The fifth step decides which candidate concept will become final: at each text location l we keep only the highest ranked concept, c1 . However, if the surface form of the second and/or third best candidate is longer than that of c1 , they

will be also retained – as “skeletal muscle” demonstrated in Sect. 2, multiple annotations are not always wrong. By limiting the scope of selection to the top three candidates, we ensure that document words will never be labeled by more than three Wikipedia articles. This number might seem large, but consider that the average number of candidate concepts for a given text location is 35.73. Moreover, as Table 3 will illustrate, multiple final annotations are rare. In the last, optional step, we re-score the final concepts, not only to estimate their importance with respect to the primary document topic, but also to provide a measurement for annotation quality (Sect. 4). The score is computed as: P Qc = Sc

p∈Mc

Q

w∈Fp,c

|Mc |

log



N fw

 (4)

where N is a sufficiently large number (e. g. 1,000,000), and fw is the number of Wikipedia articles whose bodies contain word w. For each surface form c assumes in the target document we compute the product of inverse document frequencies of its words (with respect to Wikipedia, not the target document collection), then calculate an average from these products, multiplying Sc by the result.

4

Evaluation

We performed experiments on the Wikipedia snapshot taken in August of 2006. It contained 3,166,967 entries: 1,281,395 (40.5%) redirects, 101,848 (3.2%) disambiguation pages, 156,385 (4.9%) category pages, and 1,627,339 (51.3%) regular articles. However, since the title of most redirects differed only in case or syntactic style from the title of the target article, and due to the stemming and stopword removal performed as part of pre-processing (Sect. 3.1), we got 1,728,641 concepts, accessible through only 2,184,226 titles. The average number of hyperlinks present in article bodies referring to other articles was 6.3. To evaluate annotation quality, we had to find a sizable document collection covering a wide range of topics, richly and consistently annotated by domain experts utilizing a large ontology – unfortunately, only Wikipedia itself comes close to fulfill these requirements. It may seem unfair to work with documents being part of the corpus from which concepts are derived, therefore we took several precautions: • When mining the concept network from Wikipedia, we extracted only the titles of articles, their bodies and the anchor texts of links to other articles were ignored. • The target documents contained only Wikipedia article bodies; article titles, self-references were removed. • Target documents were never matched against titles of the Wikipedia article from which they were generated.

Table 1. Accuracy measurements. Measurement Macro average Micro average Al 0.9349 0.9326 Ar 0.9407 0.9475 As 0.9241 0.9057 At 0.8127 0.7995 Because of the widely ranging characteristics of Wikipedia content, not all regular articles were suitable for testing. Some articles were too short, or carried too few links; while others discussed too many subjects at the same time – for example “Kennedy (surname)”, listing people with the family name “Kennedy”; or “1492”, enumerating the events of 1492. Therefore we stipulated two conditions. First, the text/link ratio, the total number of words in the article divided by the number of words inside anchor texts, should be at least 5. Second, the article should have at least 10 valid links, that is, links whose anchor text (after stemming and stopword removal) is neither empty, nor is a single number. From the approximately 315,000 articles fulfilling the above two criteria, 24,000 was randomly selected and annotated. In average, they contained 554 words, 35 paragraphs, and 34 references to other Wikipedia articles. Some phrases in the original text were anchors for hyperlinks pointing to other Wikipedia articles, an ideal gold standard. Annotation accuracy, At , was thus the percentage of such anchors whose targets were correctly identified; note that we did not check the validity of non-anchor phrases. To gain a better insight on the causes of erroneous or missing annotations, we introduced additional accuracy measurements, corresponding to each main processing stage: • Al (location): percentage of anchor phrases which were detected as concept titles at all; • Ar (recognition): percentage of these titles to which the correct concept was attached as a candidate; • As (selection): percentage of these titles for which the correct concept was chosen as annotation. If percentage values are represented in [0; 1], At can be computed as Al × Ar × As . Wikipedia articles usually place anchors only on the first occurrences of concepts, so we considered an annotation correct if its underlying phrase was an anchor for the given concept anywhere in the text. Table 1 shows the observed results, using micro (per document) and macro averaging. The algorithm achieved fairly high precision at each internal step, consistently above 90%, which led to a final annotation accuracy of around 81%. Table 2 list candidate concepts attached to a text location inside the article about electron microscopes (Fig. 2), carrying the word “max” (followed by word “knoll”), along with their scores and surface forms. Concept [Max Knoll] got the highest rank, with a far larger score than any of its competitors, because its surface form contained two words, and

Table 2. List of candidate concepts along with their scores attached to a text location containing word ”max” (followed by ”knoll”). Sc Concept title 72.9598 Max Knoll max knoll 0.8947 Max (software) max 0.7931 RE/MAX max 0.7201 Max, North Dakota max 0.6584 Max Keeping max 0.6229 MAX (band) max 0.4458 Max (Pok´emon) max 0.4344 MAX (Linux distribution) max 0.4333 Max (24 Character) max 0.3665 MAX max since it was linked to two other concepts appearing in the same document, [Ernst Ruska] and [Electron Microscope]. Accuracy values are probably better than what would be observed if we used documents fully annotated by authors. Obviously, only anchor phrases can be checked for correct annotations; and because they typically consist of two or more words, their recognition reliability is higher. In addition, authors are familiar with the related Wikipedia articles, therefore they almost always mention these articles through their official titles, artificially increasing Al . For this reason we performed another experiment (described at the end of this section), measuring how well our method was able to predict important concepts, to ascertain that the high accuracy values are not the result of indiscriminate annotation. Fig. 3 illustrates the required degree of disambiguation: the number of text locations where the final concept had to be selected from a given number of candidates. Locations inside anchor phrases are shown as crosses, and locations anywhere in the document as circles. Roughly 12% of words did not require any disambiguation, for 8% only two candidates existed, for 4% three, and for the remaining 76%, four or more. Although locations inside anchor phrases need slightly less disambiguation than generic ones, the difference is not relevant and quickly diminishes. Annotation accuracy does not depend on the document length, nor the number of anchor phrases present in them, but it is strongly influenced by the size of the surface forms (Fig. 4). While for single word titles it was 77%, an additional word increases the accuracy to 86%. Three-word titles performed worse at 82%; the deterioration can be attributed to a very low Al , thus it is the fault of the recognition, not the disambiguation phase. Despite its simplicity, disambiguation performs surprisingly well: in case of single word titles, it is responsible for 11% of mistakes (roughly half of the total number), for two-word ones, merely for 3%, and for larger titles it operates almost perfectly. If we look at the frequency of n-word titles (Fig. 5), n being the most visible factor influencing annotation accu-

Accuracy vs. title length

BYp‡YYžµÌž‡Yãú‡Y(ž (?Vm„púV›µ²

100% 90% 80%



70%



Accuracy

ôY‡Y²›VpYžµÌž›Y ›žÆµV›µ²?ž¯"Ý

S

< %

60% 50% 40% 30%

 20%

÷

10% 0%

à S



Sàà

1

2 3 4 5 Title length (words) Correct Not loc. (see Al) Not rec. (see Ar) Not sel. (see As)

Sààà

júm„Y‡žµÌžV²((V›Yžµ²Y˜›?ž¯ÆµpÝ 9PµÆYž›Y ›

g²Pµ‡?

Figure 3. Percentage of text locations (in the document and inside anchor phrases, represented by crosses and circles, respectively) with a given number of candidate concepts.

Figure 4. Percentage of correctly/incorrectly annotated anchor phrases as a function of their length (showing not located, not recognized, not selected concepts separately).

Table 3. Number/percentage of text locations annotated by a given number of concepts.

quired a strict matching between concept titles and word sequences inside the target document (in order to reduce the number of concept candidates competing with each other), therefore missing several hyperlinks.

Level of coverage Num. of text loc. Perc. not covered 1,479,003 11.06% covered by 1 concept 11,178,061 83.55% covered by 2 concepts 626,346 4.68% covered by 3 concepts 94,859 0.71% racy, we see that the distribution strongly differs for anchor and regular phrases. This explains why we would not get the same results if the target documents were fully hyperlinked. While for anchor phrases the number of single- and two-word titles are roughly equal, among generic phrases there are much fewer two-word titles, so the “lift” provided by longer titles probably would not be felt as powerfully. Table 3 shows how extensively our method covered the target documents with annotations, and also how often they overlapped. 11% of text locations did not receive any labels, since the Wikipedia snapshot included mostly higher-level concepts; basic verbs and nouns (“put”, “answer”), common grammatical elements (“consistently”) were missing. In the second experiment, we processed the same 24,000 documents as in the first, but now we also ranked the final concepts, according to the scoring formula (4). We examined what percentage of the top N -concepts were anchors phrases in the original Wikipedia article (Fig. 6). For N = 5 we achieved a precision of 72%, meaning that the concepts chosen to label the word sequences in the text are indeed strongly characteristic of the document topic. Recall values are low for two reasons. First, N ranged from 1 to 20, the upper limit still significantly smaller than the average number of hyperlinks, 34. Second, our method re-

5

Conclusion and future work

We described an algorithm which, by exploiting the Wikipedia link structure, is able to reliably label phrases of a document with the appropriate Wikipedia articles. As could be increased by exploiting Wikipedia category assignments, supplementary descriptions in article titles, part of speech tags in target documents. If As becomes sufficiently high, by allowing partial title matches (“nobel” retrieving “Nobel Prize”) the set of candidate concepts can be expanded, improving Ar . Our vision is to be able to trace the author’s train of thought phrase by phrase in a concept space.

References [1] S. F. Adafre and M. de Rijke. Discovering missing links in Wikipedia. In Proc. of the 3rd Int’l Workshop on Link Discovery, pages 90–97, 2005. [2] E. Agirre and D. Martinez. Knowledge sources for word sense disambiguation. In Proc. of the 4th Int’l Conf. on Text, Speech and Dialogue, pages 1–10, London, United Kingdom, 2001. Springer-Verlag. [3] P. Buitelaar and S. Ramaka. Unsupervised ontology-based semantic tagging for knowledge markup. In Proc. of the Workshop on Learning in Web Search at the Int’l Conf. on Machine Learning, 2005. [4] R. Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proc. of the 11th Conf.

BYp‡žµ‡žÌãpúµ(?µVžm(ãÌY„ž›µ m(Ìmž²p

Suggest Documents