Using Structured Text for Large-Scale Attribute ... - Semantic Scholar
Recommend Documents
Oct 30, 2008 - Company) might contain certain phrases (such as ceo, stock price .... Adobe Systems, Macromedia, Apple Computer, Gateway, Target, Netscape, Intel, New York Times ... These may include generic attributes like history, def-.
HTML/XML/SGML documents contain a range of nested sub-trees that are fully ... specific and exhaustive elements within the fetched articles (results grouped by ...
how domain generalization graphs can be constructed from multiple concept ... a CH for the Location attribute in a sales database is shown in Figure 1. ANY.
geomagnetic field during the 24ââ27 September 1998 major magnetic storm .... (ring current, magnetopause currents, induced currents in the solid Earth) to the ...
maps the formal parameters n and x to register 1, and register pair 2 and 3, respectively. .... domain-specific product parameters such as product type, spec class, product dimension and ... block of spec text. These extractions are cheap.
selecting attributes for a decision tree, is whether there ... table have been proposed for attribute selection: the ...... ics.uci.edu/ mlearn/MLRepository.html]. Press ...
the best tool is Decision Theory [9], which is aimed to help making decisions, i.e., to choose an ... 2.1 A Multilayered Bayesian Network Model for Structured.
Sep 2, 2009 - Stephen L. Harlin, MD; Ryan D. Harlin, BA; Thomas I. Sherman; Courtney M. Rozsas;. M. Shuja Shafqat; and ... A computer-administered instrument that measures HRQoL variables in patients with ..... Douglas A. Whitney, MD.
Jul 27, 2011 - The Times, The New York Times, Vogue, Vanity Fair, New York. Source Kenneth L. Lay, L. Dennis Kozlowski, Bernard J. Ebbers, Thomas R.
carried out on attribute reduction [2], particularly by people in the Rough Sets society. ... plastic heavy black small t3 round wood heavy white large t4 round wood light white ... cation of elements of the universe as the whole set of attributes.
Abstract. Filtered Attribute Subspace based Bagging with Injected. Randomness (FASBIR) is a recently proposed algorithm for ensembles of k-nn classifiers [28].
James Clark has provided a derivative-based validation algorithm for handling such interdependencies. This paper shows another algorithm, which is based on ...
Matthew G. Richards,* Adam M. Ross,â David B. Stein,â¡ and Daniel E. Hastings§. Massachusetts ... (c) 2009 M.G. Richards, A.M. Ross, and D.E. Hastings ..... concept enhancements design variables (units) a tm o s p h e ric d ra g flu c tu a tio n.
Oct 17, 1995 - 1 AFN 26DS-KX. 1 NAME Dan Reed /OLSEN/. 1 SEX M. 1 BIRT ... types. Each line of the second part gives a list of attribute values, each value being drawn .... phrases are produced from older ones by adding a single symbol.
Structured Casenotes: ... In Section 3, I will describe what a structured casenote would look like ... [4] First sale doctrine did not bar action involving exportation.
Implementing the CDS model can be a way to make PKM tools interoperable. .... is not meant to be hidden behind a yet to be defined user interface, but to be exposed ..... She enters âStephâ and would get an auto-completion list containing. âSte
Randeep Bhatia and Sudipto Guha and Samir Khuller and Yoram J. Suss- mann. Facility Location with Dynamic Distance Functions. J. of Combinatorial.
Dec 15, 1990 - processes can be disjoint for machine-printed text and highly ..... as city, text, state-abbreviation, digits, 9-digit ZIP + 4 Codes. digits-dash (digits.
the one hand caption text is usually synchronised and con- ... is to provide a caption or superimposed text extraction ..... textSegmentationAndRecognition.html.
Dec 15, 1990 - The semantic representation in this case is a digit string (ZIP Code). Method, for ... ed Technology under Tas~ ...... florida. Idaho. Iowa. Louisiana. Massachusclh. Mississippi .... tion engineering from the Indian Institute of Scien-
organization be held liable for the digital signatures its authorized agents provide? ... key infrastructure, digital signature, role-based authorization and access.
Indian Institute of Technology, Bombay. ABSTRACT ... ACM SIGMOD 2001 May 21-24, Santa Barbara, California, USA ..... tuition. We use a method called absolute discounting. In this method we subtract a small value, say x from the proba-.
T. Milo. Dynamic xml documents with distribution and replication. In ACM SIGMOD International .... [14] J. Saltz, U. Catalyurek, T. Kurc, M. Gray, S. Hastings,.
Using Structured Text for Large-Scale Attribute ... - Semantic Scholar
Oct 30, 2008 - tion (domain name of the website from which D was re- .... North Atlantic Treaty, Kyoto Protocol, Louisiana Purchase, Montreal Protocol, Berne ...
Using Structured Text for Large-Scale Attribute Extraction ∗
Sujith Ravi
University of Southern California Information Sciences Institute Marina del Rey, California 90292
[email protected] ABSTRACT We propose a weakly-supervised approach for extracting class attributes from structured text available within Web documents. The overall precision of the extracted attributes is around 30% higher than with previous methods operating on Web documents. In addition to attribute extraction, this approach also automatically identifies values for a subset of the extracted class attributes.
Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; I.2.7 [Artificial Intelligence]: Natural Language Processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval
General Terms Algorithms, Experimentation
Keywords Weakly-supervised information extraction, class attribute extraction, knowledge acquisition, structured text collections
1.
INTRODUCTION
Information extraction systems typically exploit unstructured text within Web documents to acquire data for various applications. Despite the unstructured nature of documents available on the Web, some inherent structure often exists within the text. For example, a website listing telephone directory information contains recognizable fields such as a person’s name and phone number. Web documents from encyclopedic resources such as Wikipedia [13] pertaining to specific entities (e.g., Italy) contain emphasized fields within the text (e.g., capital, official languages, population, ∗Contributions made during an internship at Google.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’08, October 26–30, 2008, Napa Valley, California, USA. Copyright 2008 ACM 978-1-59593-991-3/08/10 ...$5.00.
Marius Pa¸sca Google Inc. 1600 Amphitheatre Parkway Mountain View, California 94043
culture, etc.) that signal important information regarding the entity. If the same set of fields appear within many documents related to multiple instances (e.g., Italy, France, Japan, etc.) representing the same class (e.g., Country), then these fields may be important properties of the particular class, and they may be potential candidates for attribute extraction. Sometimes, the text may also contain field tuples (e.g., hcapital,Romei for Italy), in which case they may provide information for extracting the value (e.g., Rome) of an attribute (e.g., capital) of a particular instance (e.g., Italy) [16]. Existing extraction systems use specific hand-crafted patterns [8] or schemas [6] to discover relational instances or structured tuples like hclass, attributei pairs from unstructured text. Most of these approaches are constrained to specific domains or particular target relations [3] specified in advance, and therefore do not gracefully scale to more generic classes and applications for other domains. A previous approach for attribute extraction from Web document collections looks at the immediate textual vicinity of instances (e.g., Austria) from a target class (e.g., Country) to identify potential attributes of the class [10]. However, the attributes that can be extracted using this technique are limited to specific types, following certain patterns (e.g., X-of-Y patterns [15]). Sources such as Wikipedia or the CIA FactBook constitute appealing resources for obtaining documents related to specific instances of a class, and these often contain more than just attribute information. Some of these documents may provide additional information, such as relations between different entities associated with the particular instance. For example, the Wikipedia document pertaining to Austria also specifies that its capital is Vienna. Together, the pair hcapital, Viennai constitutes a piece of factual information for the instance Austria of the class Country. If we can find many such matching tuples for various instances of the class, then this would help us in mining facts related to the class on a large scale. But open-domain value extraction is a difficult problem, especially from noisy data like that available on the Web. There have been approaches in the past [3, 2] for extracting factual relations between objects, but many of them are geared towards extracting facts that correspond to answers to factual questions (e.g., “What is the capital of Austria?”), often following hand-crafted patterns. For any attribute extracted from a Web document relevant to a specific class instance, it is difficult to predict whether the source document also contains a matching value for the particular attribute. Even if the document contains
Input: target class C, available as a set of instances {I} small set of seed phrases {K} for the target class C expected in output Output: Ranked list of attributes for C Variables: {D} = document collection for target class C {P } = pool of candidate phrases {PI } = set of candidates P ∈ {P } that were extracted for instance I {PD } = set of candidates P ∈ {P } that were extracted from a specific document D ∈ {D} VP = signature vector for candidate phrase P {V } = Hierarchical pattern vector, one for every candidate phrase P ∈ {P } SCORE(P ) = Score for candidate phrase P ∈ {P } Vref = Reference vector containing patterns from seed attributes 0. P = 0; {V} = nil; Vref = nil 1. For each instance I ∈ {I}: 2. Query a search engine using the phrase I and add documents from top N search results to {D} 3. For each document D ∈ {D} 4. Scan D and select candidate phrases occurring within emphasized HTML tags, add these to {PD } 5. For each candidate phrase P ∈ {PD } 6. If ( {PD } ∩ {K} != nil ) ∧ (P is not a long phrase) 7. Add P to the candidate pool {P } 8. Retrieve corresponding HierarchicalPattern(P ) from document D 9. VP = Find/create entry for P in {V } using pattern 10. For each candidate phrase P ∈ {P } 11. For each seed attribute K ∈ {K} 12. If (P == K) 13. VP = Find entry for P in {V } 14. Merge elements (patterns) from VP into reference vector Vref 15. For each candidate phrase P ∈ {P } 16. VP = Find entry for P in {V } 17. SCORE(P ) = Compute Similarity(VP , Vref ) 18. Return Sorted list {P , SCORE(P )}
Figure 1: Weakly-Supervised Attribute Extraction Algorithm using Structured Text this information, finding the actual value within the text is non-trivial. In this paper, we present a new approach for extracting attributes from Web documents, making use of the existing structure (HTML tag hierarchy) within text. This method scales to a large number of classes, and the attribute extraction is also not restricted to any pre-determined pattern types (e.g., X-of-Y patterns). Instead, we automatically learn a wide variety of patterns for certain seed attributes of the target class provided as input, and use these to extract other attributes of the same class. The use of structured text for attribute extraction has another advantage: the weakly supervised algorithm used for attribute extraction can be extended to also perform value extraction, thus yielding hattribute,valuei tuples for various class instances.
2.
WEAKLY-SUPERVISED EXTRACTION FROM STRUCTURED TEXT
2.1 Attribute Extraction The weakly supervised attribute extraction algorithm used in our approach is shown in Figure 1. Given a target class C, the algorithm retrieves documents from the Web that are relevant to instances representative of the class in Steps 1-2, then selects candidate attributes (in Steps 3-7) and corresponding patterns (in Steps 8-9) from the retrieved documents. Steps 10-14 are aimed at extracting patterns cor-
Target Class ‘C’ = set of instances { I } Actor =>
Mel Gibson Sharon Stone Tom Cruise Will Smith Johnny Depp … … Marlon Brando
1
Class Instances ‘I’ (e.g. Mel Gibson) provided as query to search engine
Pick another class instance and repeat 1-3
Search Engine
top ‘N’ search results
Document Collection {D} for Target Class ‘C’ 3
2
Filter HTML documents
Figure 2: Fetching Web documents relevant to attribute extraction for a given class responding to a small set of seed attributes (5 known attributes per class) which guide the extraction of other attributes from the same class. The algorithm finally ranks the selected candidate attributes for the particular class (Steps 15-18).
2.1.1 Collecting Data from the Web Each target class C is available as a set of representative entities or instances {I}. A large collection of instances (e.g., Honda Accord, Audi A4, Mini Cooper, Ford Mustang etc.) sharing the characteristic properties of a particular class (e.g., CarModel) can be mined from text collections available on the Web using various techniques that have been explored before [1, 14]. Given a particular class, a generalpurpose search engine retrieves Web documents that are relevant to instances from the given class. In Steps 1-2 of the algorithm from Figure 1, each instance phrase I from the set representing the target class C is provided as a search query, and the top N documents returned for the query are fetched locally. Since we wish to exploit the structure within the document text, we limit our extraction to only documents containing HTML tag information; documents with extensions such as .pdf, .doc, .txt, etc. are ignored. Figure 2 illustrates via an example how relevant documents are collected from the Web to perform attribute extraction for a particular target class (Actor).
2.1.2 Selecting Candidate Attributes Steps 3 and 4 in Figure 1 scan the retrieved documents and collect a pool of candidate phrases that could be potential attributes of the given class. For this purpose, we identify fields that are recognizable within a particular document, utilizing the structure information available in the form of HTML tags. For example, a document related to a particular entity (e.g., Oracle) from the given class (e.g., Company) might contain certain phrases (such as ceo, stock price, headquarters, etc.) which convey important information regarding the particular entity. These phrases are usually emphasized within the structured text using certain HTML tags such as bold (), italics (), table columns (
), table headers (
), list elements (
) etc. We extract such emphasized phrases from each document and add them to our pool of candidate attributes {P }. Figure 3 shows how the candidates headquarters and net in-
Company: {Delphi, Apple Computer, Honda, Motorola, Google, Coca Cola, Time Warner, Intel…} (1-2)
Document Collection {D}
Hierarchical pattern vectors (one per candidate phrase P )
Example document D retrieved for the instance “Intel”
(3-7) Company: products
- -
[ Santa Clara, California ] >
Figure 3: Data flow diagram for attribute extraction and value extraction come (phrases enclosed within HTML tags
...
), are extracted from a Wikipedia document related to Intel, which is an instance of the class Company. Filtering Out Spurious Candidates: Due to the ambiguous nature of some of the instance phrases, some of the retrieved documents may be irrelevant. For example, the instance Chicago is associated with multiple classes (City and Movie), and therefore attributes relevant for the class Movie might be added to the candidate pool for the class City, and vice versa. To avoid extracting a lot of such undesirable attributes from random documents, Step 6 filters out documents that do not contain fields matching at least one of the seed attributes in {K}. We also restrict our candidate selection to relatively short phrases, i.e., containing five words or less. Extracting Hierarchical Patterns: For every candidate phrase generated, Step 8 of the algorithm from Figure 1 extracts the corresponding hierarchical tag pattern associated with it from the document. There have been many approaches in the past which utilize the HTML structure within documents for information extraction, but most of these approaches tend to focus only on specific elements like HTML tables within Web documents [4]. For a particular candidate P selected from document D, its hierarchical tag structure is composed of the HTML elements corresponding to its tag, parent tag, grandparent tag, and so on (all the way to the top of the document, within 10 previous tag levels). This tag structure along with other information (domain name of the website from which D was retrieved) constitutes a pattern. Figure 3 illustrates how the pattern corresponding to the candidate headquarters is extracted from a Wikipedia document for the instance Intel.
All such patterns extracted for candidate P are then collected into a pattern vector VP (Step 9 in Figure 1). The pattern vector represents the fingerprint or signature for a particular candidate phrase, and overlap between different pattern vectors indicates that candidates corresponding to these vectors share common patterns.
2.1.3 Ranking Class Attributes Every candidate extracted for an instance I represents a potential attribute of the particular class C to which I belongs. In Figure 3, the candidates headquarters and net income extracted for the instance Intel are attributes of the class Company. The selection process builds a large pool of such candidate attributes for each target class. Ranking the attributes within the candidate pool can be done in multiple ways: • Frequency-based ranking: One method is to rank the selected candidates for a particular class, by frequency [10] as shown in Equation 1. Each candidate P extracted for a class C is assigned a score based on the number of instances I from C that produced P . A candidate that is produced by more instances from the same class, receives a higher score. | I : ((I ∈ C) ∧ (I produces P )) | SCOREf req (P ) = (1) | I : (I ∈ C) | • Hierarchical pattern-based ranking: Pattern vectors VP are populated for every phrase P in the candidate pool {P }, for a particular class C. The individual patterns within each vector are weighted elements (the frequency of occurrence within documents in the collection). Patterns from vectors corresponding to the candidate phrases that match any of the seed attributes (e.g., headquarters) are then aggregated in Steps 10-14 (Figure 1) to form a reference vector Vref for
Class Actor AircraftModel Award BasicFood CarModel CartoonCharacter CellPhoneModel ChemicalElem City Company Country Currency DigitalCamera Disease Drug Empire Flower Holiday Hurricane Mountain Movie NationalPark NbaTeam Newspaper Painter ProgLanguage Religion River SearchEngine SkyBody Skyscraper SoccerClub SportEvent Stadium TerroristGroup Treaty University VideoGame Wine WorldWarBattle
Examples of Instances Mel Gibson, Sharon Stone, Tom Cruise, Sophia Loren, Will Smith, Johnny Depp, Kate Hudson Boeing 747, Boeing 737, Airbus A380, Embraer 170, ATR 42, Boeing 777, Douglas DC 9, Dornier 228 Nobel Prize, Pulitzer Prize, Webby Award, National Book Award, Prix Ars Electronica, Fields Medal fish, turkey, rice, milk, chicken, cheese, eggs, corn, beans, ice cream Honda Accord, Audi A4, Subaru Impreza, Mini Cooper, Ford Mustang, Porsche 911, Chrysler Crossfire Mickey Mouse, Road Runner, Winnie the Pooh, Scooby-Doo, Homer Simpson, Bugs Bunny, Popeye Motorola Q, Nokia 6600, LG Chocolate, Motorola RAZR V3, Siemens SX1, Sony Ericsson P900 lead, silver, iron, carbon, mercury, copper, oxygen, aluminum, sodium, calcium San Francisco, London, Boston, Ottawa, Dubai, Chicago, Amsterdam, Buenos Aires, Paris, Atlanta Adobe Systems, Macromedia, Apple Computer, Gateway, Target, Netscape, Intel, New York Times Canada, France, China, Germany, Australia, Lichtenstein, Spain, South Korea, Austria, Taiwan Euro, Won, Lire, Pounds, Rand, US Dollars, Yen, Pesos, Kroner, Kuna Nikon D70, Canon EOS 20D, Fujifilm FinePix S3 Pro, Sony Cybershot, Nikon D200, Pentax Optio 430 asthma, arthritis, hypertension, influenza, acne, malaria, leukemia, plague, tuberculosis, autism Viagra, Phentermine, Vicodin, Lithium, Hydrocodone, Xanax, Vioxx, Tramadol, Allegra, Levitra Roman Empire, British Empire, Ottoman Empire, Byzantine Empire, German Empire, Mughal Empire Rose, Lotus, Iris, Lily, Violet, Daisy, Lavender, Magnolia, Tulip, Orchid Christmas, Halloween, Easter, Thanksgiving, Labor Day, Independence Day, Lent, Yule, Hanukkah Hurricane Katrina, Hurricane Ivan, Hurricane Dennis, Hurricane Wilma, Hurricane Frances K2, Everest, Mont Blanc, Table Mountain, Etna, Mount Fuji, Mount Hood, Annapurna, Kilimanjaro Star Wars, Die Hard, Air Force One, The Matrix, Lord of the Rings, Lost in Translation, Office Space Great Smoky Mountains National Park, Grand Canyon National Park, Joshua Tree National Park Utah Jazz, Sacramento Kings, Chicago Bulls, Milwaukee Bucks, San Antonio Spurs, New Jersey Nets New York Times, Le Monde, Washington Post, The Independent, Die Welt, Wall Street Journal Leonardo da Vinci, Rembrandt, Andy Warhol, Vincent van Gogh, Marcel Duchamp, Frida Kahlo A++, PHP, C, C++, BASIC, JavaScript, Java, Forth, Perl, Ada Islam, Christianity, Voodoo, Buddhism, Judaism, Baptism, Taoism, Confucianism, Hinduism, Wicca Nile, Mississippi River, Hudson River, Colorado River, Danube, Amazon River, Volga, Snake River Google, Lycos, Excite, AltaVista, Baidu, HotBot, Dogpile, WebCrawler, AlltheWeb, Clusty Earth, Mercury, Saturn, Vega, Sirius, Polaris, Pluto, Uranus, Antares, Canopus Empire State Building, Sears Tower, Chrysler Building, Taipei 101, Burj Al Arab, Chase Tower Chelsea, Real Madrid, Juventus, FC Barcelona, AC Milan, Aston Villa, Real Sociedad, Bayern Munich Tour de France, Super Bowl, US Open, Champions League, Bundesliga, Stanley Cup, FA Cup Stade de France, Olympic Stadium, Wembley Stadium, Soldier Field, Old Trafford, Camp Nou Hezbollah, Khmer Rouge, Irish Republican Army, Shining Path, Tupac Amaru, Sendero Luminoso North Atlantic Treaty, Kyoto Protocol, Louisiana Purchase, Montreal Protocol, Berne Convention University of Oslo, Stanford, CMU, Columbia University, Tsing Hua University, Cornell University Half Life, Final Fantasy, Grand Theft Auto, Warcraft, Need for Speed, Metal Gear, Gran Turismo Port, Champagne, Bordeaux, Rioja, Chardonnay, Merlot, Chianti, Cabernet Sauvignon, Pinot Noir D-Day, Battle of Britain, Battle of the Bulge, Battle of Midway, Battle of the Somme, Battle of Crete
Table 1: Target classes used for attribute extraction and examples of instances the class (e.g., Company). Similarly to [9], the relevance of each candidate attribute (P ) for the class is then computed as a similarity score of its pattern vector VP with respect to the reference vector (Vref ) for the class (Steps 15-17 in Figure 1). The similarity score is obtained by computing the Jaccard coefficient between the two vectors as shown in Equation 2. A higher similarity score indicates that it is more likely for patterns within VP to match those of the seed attribute patterns, which in turn suggests that the candidate phrase P could potentially be a good attribute for the given class. SCOREhier (P ) =
| VP ∩ Vref | | VP ∪ Vref |
(2)
The framework described in Figure 1 (and illustrated via an example in Figure 3) is used to build two separate systems for attribute extraction: (1) using frequency-based ranking (which is the baseline for our comparison), and (2) using hierarchical pattern-based ranking. All the previous steps in attribute extraction (including candidate selection) are the same for both these systems. In addition to all the steps mentioned, the candidates may also pass through some postprocessing steps to obtain the final ranked list of class attributes. Attributes that are added simultaneously to the
candidate pools for multiple classes, are less useful in identifying characteristic properties that are unique to a particular class. These may include generic attributes like history, definition, etc. or phrases that commonly occur in boilerplate text within Web documents (e.g., contents, contact information, help, etc.). Consequently, if a candidate is extracted simultaneously for more than half the classes from the target class set, the candidate is removed from consideration.
2.2 Value Extraction as Part of Attribute Extraction The approach presented above is generic and can be used to extract attributes for a large number of classes, with minimal supervision. We propose to leverage the attribute extraction framework (described in Section 2.1) to perform value extraction, using a simple extension. For a particular document D that was retrieved for class instance I, we find the most frequent pattern (HTML hierarchical tag pattern within the pattern vector VP ) corresponding to every attribute P that was extracted from D. This information is subsequently used for extracting matching values for P . For many class instances, structured text retrieved from encyclopedic sources contains many recognizable field tuples such as hattribute,valuei pairs associated with the particu-
lar instance or entity described in the document. Usually, the fields constituting each such tuple occur in the vicinity of one other within the document. Therefore, documents containing attributes as fields might also contain potential matching values in nearby context. Given an attribute P , the document D from which it was extracted, and the corresponding instance I ∈ C, the value extraction process collects the text V ALP (from the tag element) immediately following the hierarchical tag pattern for P inside D, and returns the triplet hI | P ⇒ V ALP i as factual information for the class instance (C: I). In Figure 3, for the instance Intel of the class Company, the text “Santa Clara, California” is matched as the value corresponding to the candidate attribute headquarters, by first locating the position of the attribute using the hierarchical pattern, and then selecting the immediately following element (in this case, htdi...h/tdi). Using this integrated approach, we can inexpensively identify matching values for some of the correctly identified attributes, if this information is present in the document. Some example values extracted in this manner are shown below.
3.
Country: h Austria | population
⇒
Wine: h P inot Gris | color
⇒
2007 estimate = 8, 316, 487 i W hite i
EXPERIMENTAL SETTING
3.1 Attribute Extraction Target Classes: A total of 40 target classes [9] are provided as input in the form of sets of instances {I} in the experiments. Table 1 illustrates target classes and example instances used in the attribute extraction experiments. The number of instances per class varies from 25 (for SearchEngine) to 1500 (for Actor). The classes also vary widely with respect to the domain or genre (e.g., Entertainment for Movie, Championship Leagues for SportEvent) and types of instances (e.g., Boeing or Airbus for AircraftModel) comprising the particular class. Seed Attributes: As part of the weak supervision required for this approach, we provide a small set of 5 seed (known) attributes per class. These attributes are chosen independently, and are provided as input at the beginning of attribute extraction. An example of a seed attribute set is {headquarters, stock price, ceo, location, chairman} for the class Company. Evaluation Methodology: Regardless of the differences in methodologies used for attribute extraction, all experiments produce a ranked list of candidate attributes for each of the 40 target classes. Multiple lists of attributes (one list per class) from all experiments are evaluated in the same manner using precision as the metric. To avoid any undesirable bias towards higher ranked attributes during evaluation, each list is sorted alphabetically into a merged list. Each attribute in this merged list is assigned a correctness label by a human judge, as shown in Table 2. The strategy for assigning correctness labels is similar to the assessment method used in previous work [10]. An attribute is “vital” if it must be present in an ideal list of attributes for the target class; “okay” if it provides useful but non-essential information; and “wrong” if it is incorrect.
Label
Value
Examples of Attributes
vital
1.0
okay
0.5
Actor: date of birth, Flower: botanical name Company: vision, NationalPark: reptiles
wrong
0.0
BasicFood: low carb, Movie: fiction
Table 2: Correctness labels for attribute quality assessment All attributes extracted for each of the 40 target classes, are assigned correctness labels. These correctness labels are then mapped to quantitative values as shown in Table 2. The overall precision score at some rank N for a particular class C is then computed as the sum of the correctness values of the first N candidates in the ranked list of attributes extracted for C, divided by N. Parameter Settings: There are a few parameters in the attribute extraction pipeline that can be used to tweak different aspects of the system. Varying one or more of these parameters may have a bearing on the final results (i.e., class attributes) generated by the system: Top Documents [N = 50 or 200 (default=200)]: During data collection, we use a search engine to extract relevant documents (corresponding to top N search results) for member instances of the target class. We can vary the number of top results extracted per class instance by adjusting this parameter, and observe its effect on the attribute extraction results. WordNet Pruning [WN = on/off (default=on)]: Since the candidate attributes are extracted from noisy Web data, there could be spurious entries that make it into the final ranked list of class attributes (e.g., “1989”, “3.2-mp”, etc.) During the post-processing stage in the pipeline, we can filter out candidates that do not contain meaningful English words or terms using a general-purpose lexical resource such as WordNet [7]. Comparative Attribute Extraction Runs: We compare three different systems for attribute extraction in our experiments: (B) - Baseline system with structured text: The baseline system is implemented using the extraction framework shown in Figure 1, and uses the structured text within relevant Web documents (N = top 200 documents, per class instance) to select candidate attributes. The selected candidates are then ranked using frequency-based ranking (Equation 1 from Section 2.1.3). Generic attributes are removed and WordNet pruning is applied during the post-processing stage. (Shier ) - System using hierarchical patterns with structured text: The second system is identical to the baseline system, except it uses the hierarchical tag patterns within the structured text to rank selected candidates (Equation 2 from Section 2.1.3). WordNet pruning and generic attribute removal strategies are similarly applied. This system is also capable of performing value extraction. (Dpatt ) - Previous approach with unstructured text: We also implemented a third system using the attribute extraction strategy described in previous work [10]. This approach uses hand-crafted patterns (such as X-of-Y patterns) to extract attributes from unstructured text within Web documents. To ensure a faithful comparison, we used the same normalized ranking function introduced in [10] to rank the candidate attributes extracted for each class, since it was shown to perform better than a frequency-based ranking function.
Table 4: Precision (at various ranks N) of attributes extracted by the three systems (B, Dpatt , Shier ) for a couple of classes as well as average precision over all classes. The relative precision Improvement and Error reduction (Shier over Dpatt ) at various ranks, are also shown in the last two rows. Class
Top Extracted Attributes
Actor
awards, height, career, age, birthplace, born, biography, quotes, birth name, personal life
selected for class C), and (2) the value V ALP extracted for the attribute P , from a document relevant to instance I; generating a triplet hI | P ⇒ V ALP i for the class instance (C : I). A total of 500 such triplets are generated for the 20 classes (5 attributes per class, 5 instances per class, 1 value per attribute-instance pair). Each of these triplets is manually evaluated by a human judge and assigned a correctness label. A triplet is assigned a label “correct” if it forms a valid tuple - i.e., if the real value matching attribute P for the class instance I exists within V ALP ; otherwise the triplet is labeled as “incorrect”. For example, the triplet hItaly | capital ⇒ Romei is a valid tuple for the class Country. The precision scores for value extraction are then obtained by computing the number of triplets marked correct, divided by the total number of triplets.
Painter
biography, paintings, nationality, bibliography, birthplace, works, exhibitions, born, died, prints
4. EVALUATION RESULTS
Table 3: Top attributes extracted (using system Shier ) from structured text in Web documents for a few target classes The attributes extracted by all these systems are evaluated in the same manner, following the methodology described earlier in this section.
3.2 Value Extraction The value extraction procedure runs simultaneously with attribute extraction, as described earlier in Figure 3. For every candidate attribute P extracted from text for a particular instance I ∈ class C, we obtain some matching text V ALP that is extracted as the value for P . Target Classes: Since the evaluation of value extraction results is time consuming, we consider a smaller set of classes and instances as compared to earlier experiments - i.e., only 20 out of the 40 target classes that were used for attribute extraction are used for value evaluation. For each of these 20 classes, 5 instances are selected randomly from its member set to represent the class. Attributes: The text within the extracted element V ALP would constitute a meaningful value for the candidate P , only if P can be considered a valid attribute of the class C for which it was extracted. This is a minimum requirement for our value extraction approach to work. Consequently, from the ranked list of candidates extracted for a target class, we pick a set of 5 high quality attributes for that class, and use only these in our value evaluation experiments. For example, candidates such as climate, currency, population, president, and religion are considered high quality attributes for the class Country. Evaluation Methodology: Each instance (I ∈ C) is combined with (1) an attribute P (one among the 5 attributes
4.1 Precision of Extracted Attributes Every system used in the attribute extraction experiments produces a ranked list of class attributes corresponding to each target class. Table 3 shows some of the top attributes extracted for a few classes, using the seed-based approach with hierarchical tag patterns (Shier ). Many candidates ranked among the top list such as awards, height, age, birthplace, etc. for class Actor, and atomic number, symbol, atomic weight, etc. for class ChemicalElement represent vital characteristic properties of the particular class. But the list also contains a few wrong attributes, e.g., stage, hurricane for the class Hurricane, and prints for the class Painter. The list may also contain a few rather generic attributes, such as career or personal life, mixed with other more specific attributes of the same class (Actor). Table 4 displays some of the results from our experiments with respect to precision scores for the top N attributes extracted by different systems. Individual precision scores for a couple of the target classes (Actor, BasicFood), as well as overall average precision (combining scores from all the target classes) for each system are shown for comparison. It can be seen from Table 4 that the precision for different systems at a particular rank N varies with respect to the class. For some classes, like Actor, precision at lower ranks (N=5) is higher for the baseline system (B) when compared to the previous approach (Dpatt ), whereas for other classes like BasicFood, the phenomenon is reversed. However, the system Shier introduced in this paper, which uses hierarchical tag patterns from structured text, consistently outperforms the other two systems (B and Dpatt ), both in terms of individual precision scores for most of the classes, as well as average precision for all the target classes, over a wide range of ranks. For instance, the system Shier exhibits a 24% im-
Class: Actor
Class: AircraftModel
1
1 Shier
Shier
0.8
0.8
0.6
0.6
0.6
0.4
0.2
Precision
0.8
Precision
Precision
Shier
0.4
0.2
0 10
20
30
40
50
0.4
0.2
0 0
0 0
10
Rank
20
30
40
50
0
Class: BasicFood
Class: City
Shier
Shier
0.6
Precision
0.6
Precision
0.6
0.4
0.2
0 30
40
50
40
50
0.4
0.2
0 20
50
Shier 0.8
0.2
40
Class: Country
0.8
0.4
30
1
0.8
10
20
Rank
1
0
10
Rank
1
Precision
Class: Award
1
0 0
10
Rank
20
30
40
50
0
10
Rank
20
30
Rank
Figure 5: Individual precision scores at various ranks for attributes extracted by the system Shier for a few target classes Parameter
Table 5: Effect of individually varying different parameters of the system Shier on average precision of attributes
0.4
0.2
0 0
10
20
30
40
50
Rank
Figure 4: Relative system performance for attribute extraction in terms of average precision over all the classes provement in precision (which maps to around 17% reduction in the error rate) at rank 50, when compared directly to results from the previous approach. Secondly, the results also indicate that ranking using hierarchical tag patterns (Shier ) produces much better attributes than a frequencybased ranking approach (baseline system B). The graph in Figure 4 further illustrates these results, by showing a head-to-head comparison of the different systems in terms of overall attribute precision at ranks 1 through 50. Figure 5 shows the individual precision results for some of the target classes. The quality of the extracted attributes varies from class to class. In general, the algorithm is able to extract good quality attributes for many of the classes (e.g. Actor, AircraftModel, Country, etc.) especially at lower ranks