XAR: An Integrated Framework for Information Extraction Naveen Ashish, Sharad Mehrotra and Pouria Pirzadeh UC Irvine, Calit2 and Information and Computer Sciences 2074 Bren Hall, Irvine CA 92697
[email protected] Abstract We present an information extraction system for automated information extraction from free text. The system, XAR, available open-source to the community, abstracts an extraction application developer from low level text processing details, permitting developing applications with high-level extraction rules. The system also incorporates semantic information in the form of integrity constraints to improve the quality of the extraction. We describe the XAR system and present a summary of experimental results demonstrating the effectiveness of our approach.
1. Introduction The capability of automated information extraction from text is of significant interest in many application domains such as Web search, enterprise information integration and mining, and intelligence analysis. Despite state-of-the-art tools and systems for automated extraction, we see a significant effort required in building any new extraction application. Also, achieving high accuracy of the extraction is a challenge. Systems and approaches that can reduce the extraction application development effort, and improve the extraction accuracy and quality continue to be of interest. XAR 1 is a step in this direction. Our focus in information extraction is at the level of slot-filling from free text (as opposed to semi-structured data). As an example consider an application where we have to extract various details about a researcher such as her name, current job-title, employer, various academic degrees, associated alma-maters and degree dates etc., from her Web page bio in free text. Such kinds of extraction tasks can also be considered as “relation extraction” where we can think of the extraction task as that of populating a database relation with the extracted values. For developing a new extraction application the XAR system provides the user with 1) The capability 1
System and documentation available http://www.ics.uci.edu/~ashish/xar
online
at
of writing extraction rules in a Datalog [10] style declarative extraction language, 2) The capability of specifying semantic information about the data to be extracted. Specifically this is stated in terms of integrity constraints [12] about the relation to be extracted. While rule driven extraction systems are in themselves not new, XAR brings several capabilities that are attractive to a user in developing new extraction applications, namely: (i) The user can write extraction rules at a high level, being abstracted from the details or even knowledge of underlying lower level text processing and analysis tools. (ii) The extraction rule language allows for representing and exploiting different kinds of properties or features at different “richness” levels for extraction. 3) The extraction rules can be integrated with information in integrity constraints to result in an improved quality of information extraction.
Application Development with XAR Any extraction task is defined by a relation that we intend to extract (for instance the researcher-bios relation described above), and a corpus of text documents that the data is to be extracted from. Given a new corpus of text documents, the system sends this through a feature generation process which automatically identifies and generates features of various kinds for the tokens and entities in the text in the corpus documents. The features include the identification of tokens, their types (for instance whether a person, organization, etc.), their position in the text, etc. For every new relation to be extracted over this corpus, the user provides the following three items: a) A schema, which is an SQL [12] style schema describing the relation that we intend to extract. The schema specifies the various slots or attributes in the relation to be extracted, the types and classes of such slots, whether single or multi-valued etc. Below we illustrate a schema for the researcherbios domain. b) A set of declarative extraction rules that specify how tokens are to be assigned to slots. An example of such an extraction rule is:
phd-alma-mater(X) Å insamesentence(D,X)
university(X),
phd-degree(D),
which should be read as follows – “any token in the text that is of type university and is in the same sentence as another token of the type phd-degree, is a value for the phd-alma-mater slot”. Note that the extraction rule uses several features, for instance what tokens are of type university or degree, whether tokens are in the same sentence in the text, etc. c) Semantic information about the relation to be extracted in the form of integrity constraints. For instance in the researcher bios domain we know that the first computer science degrees were awarded beginning only in the sixties. This is an example of an attribute level constraint which is constraining the value that the phd-date attribute can take in the researcher-bios relation. We specify this in the schema below. As an example of another constraint, we know that the year in which a person was awarded a doctoral degree must be greater (later) than the year in which he was awarded a bachelor’s degree (for the same major at least). This is an example of a tuple level constraint, where we are specifying semantics between two extracted values. This is also stated as a constraint in the schema below. Finally we could also have constraints at the level of the relation, called relation constraints that ascertain properties that the collection of tuples in a relation must satisfy as a whole. XAR Schema and Constraints create table researcher-bios (name: person, job-title: title, employer: organization, phd-degree: degree, phd-alma-mater: organization, phd-date: date, master-degree: degree, master-alma-mater, master-date: date, bachelor-degree: degree, bachelor-alma-mater: organization, bachelor-date: date, previous-employers: organization ) check phd-date > 1959 check phd-date > bachelor-date
XAR uses the provided schema, extraction rules, and integrity constraints to then actually perform the information extraction. In the next section (Section 2) we present the system technical details, Section 3
summarizes experimental results, and Section 4 provides a comparison with related work and conclusion.
2. XAR Technical Details A schematic overview of the XAR system architecture is provided in Fig 1. The system can be considered to be comprised of two primary components: a) A feature generation component, where various kinds of text analyzers are applied to the input data to extract significant tokens and entities and their properties, and b) An inference component where we perform the actual extraction i.e., assign token and entities to different slots in a systematic fashion. The extraction rules, and other semantic information (such as the schema and constraints) specified, guide such inference for extraction. We describe these in more detail. Feature Generation For slot-filling kind of extraction a system needs at least a “shallow” analysis over the input text i.e., the identification of significant tokens, their types, entities etc. Deeper, language level analysis such as a complete semantic parse of the sentences in the text, may also be required in some cases. In the current implementation of XAR, we generate features based on two analyzers. This first is GATE [3], which is an open-source framework for text analysis which we use for the identification of named-entities, other significant tokens, parts-of-speech, etc. Many types of important entities (such as person names, locations, organizations etc) can be recognized with reasonable accuracy. Let us consider an example to illustrate the kinds of features that are identified. Consider a sentence such as: “He was awarded the university teaching excellence award in 1996 for exemplary undergraduate teaching.” An analysis of the above sentence by GATE yields information about tokens in what is called a GATE annotation: AnnotationImpl: id=81; type=Token; features={category=NN, kind=word, orth=lowercase, length=5, string=award}; start=NodeImpl: id=80; offset=226; end=NodeImpl: id=81; offset=233 For instance in the annotation above, the token “award” has been identified as a noun and other properties such as its position, and offset in the text are also identified.
Fig 1. XAR System
We extract the information from such GATE annotations (using wrappers) and represent it in logical predicates. Such features are called ‘shallow’ features. Optionally, we can use a second text analysis tool, in this case a natural language parser for “deep” analysis of the text. The particular parser we have used is the StanfordParser 2 which is also an open-source system that is a complete (statistical) natural language parser and (like GATE) can be used “as-is” i.e., without any additional user input or training. The StanfordParser can parse the same sentence and provide an output such as: nsubjpass(awarded-3, He-1), auxpass(awarded-3, was-2), det(award-8, the-4), nn(award-8, university-5) nn(award-8, teaching-6), nn(award-8, excellence-7) dobj(awarded-3, award-8), prep_in(award-8, 1996-10) amod(teaching-14, exemplary-12), amod(teaching-14, undergraduate-13), prep_for(awarded-3, teaching-14) which is a typed dependencies collapsed representation of the parse tree of this sentence. The typed dependencies representation is essentially a representation of the relationships in the parse tree of a sentence, in relational form. We extract important information about actions of interest from such typed dependencies. For instance in this example, the action “awarded” is of interest. From the typed dependencies 2
http://www-nlp.stanford.edu/downloads/lex-parser.shtml
we can (in many cases) extract who awarded what and to whom. In fact such associations (subject, object etc.) are typical of literally any action i.e., verb. As with GATE annotations, the extraction of such information from a typed dependency representation is done through a wrapper for such a representation. We refer to such features as ‘deep’ features. Inference for Extraction The inference component assigns tokens and entities to slots in a systematic fashion. The inference process is essentially driven by a set of logical rules complemented with additional semantics specified about the relation to be extracted. We start with the collection of features and relationships represented as logical predicates, provided by the feature extraction step. As part of the inference we (i) Apply the XAR extraction rules which form the basis for the intensional database (IDB), it is the realization of the intensional database that leads us to inferring possible values for various slots, and (ii) Apply semantic integrity constraints to restrict or eliminate unlikely (or impossible) extraction instances. XAR Extraction Rules The XAR extraction rules are essentially Datalog rules with syntactic sugar. Each rule is a horn-clause of the form: S(X) Å C B1,B2,…,Bm where S, Bi s are atoms. S, the head corresponds to a slot to be extracted, the Bi s are predicates corresponding to conditions based on which tokens are assigned to slots, and C is an (optional) confidence
value [0,1] that is a measure of the precision of the extraction rule. C reflects the maximum confidence with which we can state that a value inferred for the head predicate S by that rule is actually a value for the slot corresponding to S. The following are the key features of this rule language: (i) The rules are essentially horn-clause style rules. (ii) The predicates in the body may be either slot predicates or feature predicates. (iii). A token is assumed to be “consumed” (i.e., consequently not applied for another extraction rule) with the application of a rule, unless stated otherwise. (iv) Multiple rules are permitted for the same slot. (v). A precision value is (optionally) associated with each rule. (vi). Negation is permitted in the rule body. (vii). Some predefined predicates are provided for the user’s convenience in writing rules. The probabilistic framework for the rules is adapted, with simplification, from a general probabilistic logic framework developed in [5] .The work proposes both belief and doubt rules that capture in horn-clause form why a certain fact should (or should not be) true. A notion of probabilistic confidence (or rather a range defined by a lower and upper bound) is associated with each fact (predicate) and rule. Application of Rules A set of rules in the extraction language above can be translated to a regular Datalog program in a straightforward fashion. We do not provide the translation details here. (Bottom-up) inference in regular Datalog (including Datalog with stratified negation) is polynomial in the number of rules and/or base predicates [6] which makes the extraction rule inference tractable. The realization of the intensional database provides us, for each head predicate, a set of (zero or more) values that satisfy that predicate each associated with a confidence value. Each such value is then treated as a possible value for the slot that that head predicate corresponds to. At the end of such logical inference we have with each slot associated a set of possible values for that slot, each value associated with a probabilistic confidence. For each slot thus, instead of a single extracted value, we arrive at a set of possible values with an associated probability distribution. The probabilistic distribution is based upon extraction rule probabilities as well as other factors such as the feature identification accuracy. This uncertainty in slot values makes the extracted relation lend itself naturally to an uncertain database representation [16] where the value for each attribute in each tuple is, in general, a set of possible values with an associated probabilistic distribution. Application of Constraints The final step in inference is the application of integrity constraints. The inference
process up to the point when extraction rules have been applied results in an uncertain relation representing the extracted data, as described above. The application of integrity constraints is essentially a process of refining the uncertain extracted relation with the knowledge in the integrity constraints. The refinement results in a recalibrated uncertain extraction relation in which the possibilities inconsistent with the integrity constraints are eliminated. In general refining uncertain relations with integrity constraints is a topic by itself. While we have addressed this problem with an effective and scalable solution, we do not present our approach in this paper but refer the reader to our recently compiled technical report [13] that describes our solution in detail. System Availability and Implementation An opensource version of XAR has made available for community use under a Creative Commons License. We encourage potential users interested in either developing extraction applications or researching information extraction to consider using this system. We also encourage the reader to look at the documentation provided at this site that has more details on particular aspects of extraction rules and also provides several extraction application examples. The current version of the system is implemented in Java and using other off-the-shelf and open-source text analysis tools. For feature extraction we have used the GATE text analyzer and the StanfordParser for natural language analysis. The deductive inference has been implemented in TuProlog [14], a Java based prolog engine.
3. Evaluation Case Studies We conducted a detailed application case-study of XAR as an information extraction system. Our specific aims were (i) Assessing the effort and complexity required in developing new extraction applications and assessing the quality of extraction that can be achieved, (ii) Evaluating if (and if so to what extent) does access to an integrated feature space benefit extraction, and (iii) Assessing the effectiveness of exploiting semantic integrity constraints in extraction. We have conducted these experiments over three different real-world datasets and extraction tasks, namely a) An extraction task of extracting a researcher-bios relation over a corpus of (500) free text bios of computer science researchers from their homepages, b) The MUC-6 task of extracting management succession events (close to a 100 such instances) over a corpus of WSJ news stories, and c) A task of extracting details about instances of
aid or relief (about 385 such instances) being or having been dispatched or sent by a country or organization to another in the event of a disaster. This is over corpus of online new stories related to the S. E. Asian Tsunami disaster. We will refer to these as the researcher-bios, management-succession, and aid-dispatched tasks respectively. Due to space limitations we refer the reader to [4] which is an extended technical report version of this paper that we have made available online and that contains the details of the all the experiments, execution traces, and experimental results. In this paper we just summarize the key highlights of the various experiments: • With using declarative extraction rules we were able to achieve fairly high extraction accuracy i.e., precision and recall of 0.7 and above for all the extraction tasks. This compares well with extraction accuracies for systems at the ACE extraction competitions [15]. • An average 2.5 extraction rules per slot are required to achieve an extraction accuracy of 0.7. The extraction accuracy increases as the number of extraction rules increases. • We can combine the use of shallow and deep features in extraction rules in an optimal manner. The use of deep features results in a more succinct rule set but carries a higher cost for generating such deep features, shallow features are generated much faster but require a larger rule set. An adaptive approach can balance the rule set size as well as the time for feature generation. • The use of integrity constraints significantly improves extraction accuracy. We were able to specify over 40 integrity constraints for the researcher-bios domain resulting in an average accuracy improvement of as much as 41%.
4. Related Work and Conclusion Rule based extraction systems in themselves are not new. The work in [1] presents Xlog, a variant of Datalog, i.e., Datalog extended with procedural predicates as a language for information extraction and the focus is on optimized Datalog execution. Lixto [2] is focused on semi-structured data extraction on the Web. DIAL [7] is also a logic based extraction language that permits constraints albeit at a low i.e., language and syntactic level. AVATAR [8] is a project on building high precision information extractors, and consequently search engines and retrieval systems over extracted data. The application of this framework that has been described, so far, is to relatively lower level semis-structured data extraction and not to the more
general kind of free text extraction tasks that we have considered. We refer to [9] for a comprehensive survey on extraction systems. In comparison, we consider XAR to be the first effort to comprehensively integrate semantics as integrity constraints into the extraction process. It also abstracts application developers from low level feature generation details. Besides, XAR is an implemented system available for community use.
References [1] W. Shen, A. Doan, J. Naughton and R. Ramakrishnan, "Declarative information extraction using datalog with embedded extraction predicates," in ACM SIGMOD 2007, [2] G. Gottlob, C. Koch, R. Baumgartner, M. Herzog and S. Flesca, "The lixto data extraction project - back and forth between theory and practice," in PODS 2004, [3] H. Cunningham, "GATE: A General Architecture for Text Engineering," vol. 2007, [4] N. Ashish, S. Mehrotra and P. Pirzadeh. “XAR: An Integrated Framework for Information Extraction” (Extended version) http://www.ics.uci.edu/~ashish/pubn.htm [5] L. Lakshmanan and F. Sadri, "Probabilistic deductive databases," in SLP 1994, [6] J. D. Ullman, "Bottom-up beats top-down for datalog," in ACM PODS 1988, [7] R. Feldman, Y. Aumann, M. Finkelstein-Landau, E. Hurvitz, Y. Regev and A. Yaroshevich, "A comparative study of information extraction strategies," in ACL 2002, [8] T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan and H. Zhu, "Avatar Information Extraction System," IEEE Data Engineering Bulletin, 2006. [9] M. Kayed, M. R. Girgis and K. F. Shaalan, "A Survey of Web Information Extraction Systems," IEEE Transactions on Knowledge and Data Engineering, vol. 18, 2006. [10] S Ceri, G Gottlob, L Tanca, "What you always wanted to know about Datalog (and never dared to ask)". IEEE TKDE 1(1), 1989, pp. 146–66. [11] Proceedings of the 6th message Understanding Conference, MUC-6, Columbia MD, 1995 [12] J.D.Ullman and J.Widom. First Course in Database Systems. 3rd Ed., Prentice Hall 2008 [13] Naveen Ashish, Pouria Pirzadeh and Sharad Mehrotra. Incorporating Integrity Constraints in Uncertain Databases, http://www.ics.uci.edu/~ashish/pubn.htm [14] Enrico Denti, Andrea Omicini, Alessandro Ricci, tuProlog: A Light-Weight Prolog for Internet Applications and Infrastructures. PADL 2001, Las Vegas, NV [15] ACE Automatic Content Extraction http://www.nist.gov/speech/tests/ace/ [16] Nilesh Dalvi and Dan Suciu. Foundations of Probabilistic Answers to Queries. Tutorial, ACM SIGMOD 2005