Diderot: TIPSTER Program, Automatic Data Extraction from Text ...

Diderot: T I P S T E R Program, A u t o m a t i c D a t a Extraction from Text Utilizing Semantic Analysis Y. Wilks, J. Pustejovsky t, J. Cowie C o m p u t i n g R e s e a r c h L a b o r a t o r y , New M e x i c o S t a t e U n i v e r s i t y , Las C r u c e s , N M 88003

& Computer Science t, Brandeis University, Waltham, MA 02254

PROJECT GOALS

these were then hand translated to equivalent Japanese patterns. The Japanese patterns were confirmed using a phrasal concordance tool. A simple reference resolving module was also written. The system contained large lists of company names and human names derived from a variety of online sources. This system handled a subset of the joint venture template definition and was evaluated at twelve months into the project.

The Computing Research Laboratory at New Mexico State University, in collaboration with Brandeis University, was one of four sites selected to develop systems to extract relevant information automatically from English and Japanese texts. When we started, neither site had been involved in message understanding or information extraction. CRL had extensive experience in multilingual natural language processing and in the use of machine readable dictionaries for system building, Brandeis had developed a theory of lexical semantics and preliminary methods for deriving this lexical information from corpora. Thus, our approach focused on applying new techniques to the information extraction task. In the last two years we have developed information extraction software for 5 five different subject area/language pairs.

Attention was then focused on the micro-electronics domain. Much of the semantic information here was d o rived from the extraction rules for the domain. A single phrase in micro-electronics can contribute to several different parts of the template, to allow for this a new semantic unit the factoid was produced by the parser. This produced multiple copies of a piece of text, each marked with a key showing how the copy should be routed and processed in subsequent stages of processing. This routing was performed by a new processing module, which transformed the output from the parser. The statistical based recognition of text relevance was used for microelectronics only, as a much higher percentage of articles in the corpus were irrelevant. This system was evaluated at 18 months.

The system, Diderot, was to he extendible and the techniques used not explicitly tied to the two particular languages, nor to the finance and electronics domains which are the initial targets of the Tipster project. To achieve this objective the project had as a primary goal the exploration of the usefulness of machine readable dictionaries and corpora as source for the semi-automatic creation of data extraction systems.

RECENT RESULTS The first version of the system was developed in five months and was evaluated in the 4th Message Understanding Conference (MUC-4) wher e it extracted information from 200 texts on South American terrorism. At this point the system depended very heavily on statistical recognition of relevant sections of text and on the ability to recognize semantically significant phrases (e.g. a car bomb) and proper names. Much of this information was derived from the keys. The next version of the system used a semantically based parser to structure the information found in relevant sentences in the text. The parsing program was derived automatically from semantic patterns. For English these were derived from the Longman Dictionary of Contemporary English, augmented by corpus information and

464

Finally the improvements from micro-electronics were fed back to the joint venture system. An improved semantic unit recognizer was added to the parser. This handles conjunctions of names, possessives and bracketing. An information retrieval style interface to the Standard Industrial Classification Manual was linked into the English system. The reference resolving mechanism was extended to handle a richer set of phenomenon (e.g. plural references). This version was evaluated at 24 months.

PLANS FOR THE COMING YEAR CRL is participating in Tipster Phase 2. This will involve participation in the development of the architecture for the Phase 2 system, user interfaces to the system, software to handle document markup and multilingual information retrieval. Brandeis are continuing work on tuning lexical entries using information from corpora.

Diderot: TIPSTER Program, Automatic Data Extraction from Text ...

Diderot: TIPSTER Program, Automatic Data Extraction from Text ...

Suggest Documents

Automatic Extraction of Hierarchical Relations from Text

AViTExt: Automatic Video Text Extraction

AViTExt: Automatic Video Text Extraction

Morphological Lexicon Extraction from Raw Text Data

Information extraction from text messages using data

The TIPSTER SUMMAC Text Summarization Evaluation

Ontology reuse with Automatic Text Extraction from ... - CiteSeerX

Automatic Extraction of Useful Facet Hierarchies from Text ... - CiteSeerX

automatic extraction of spatio-temporal information from arabic text ...

Automatic Thai Keyword Extraction from Categorized Text Corpus

Automatic Extraction of Synonymous Collocation Pairs from a Text

Automatic Keyword Extraction from Spoken Text. - Semantic Scholar

Automatic Extraction of Archaeological Events from Text - Informatics ...

Automatic Extraction of Useful Facet Hierarchies from Text ... - CiteSeerX

Automatic Keyword Extraction From Any Text Document ... - CiteSeerX

The Diderot Information Extraction System - Semantic Scholar

Automatic Object Extraction from Electrical

Automatic Keyword Extraction for Text Summarization ...

Experiences Regarding Automatic Data Extraction From Web Pages

Automatic Data Extraction from Lists and Tables in Web Sources

automatic extraction of mangrove vegetation from optical satellite data

automatic extraction of buildings from lidar data and aerial ... - CiteSeerX

the automatic extraction of roads from lidar data - Cartesia.org

Automatic Data Extraction from Lists and Tables in Web ... - CiteSeerX