Using language models for generic entity extraction - Semantic Scholar

8 downloads 63460 Views 149KB Size Report
sublanguages (for personal names, company names, dates, table entries, and so ... for the different data types, while Apple's data detectors. (Nardi et al., 1998) ...
Using language models for generic entity extraction

Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan Computer Science University of Waikato Hamilton, New Zealand [email protected]

Abstract

to bind them at click time. Second, actions can be associated with different types of data, using either explicit programming or programming-by-demonstration techniques. A day/time specification appearing anywhere within one’s email could be associated with diary actions such as updating a personal organizer or creating an automatic reminder, and each mention of a day/time in the text could raise a popup menu of calendar-based actions. Third, text could be mined for data in tabular format, allowing databases to be created from formatted tables such as stock-market information on Web pages. Fourth, an agent could monitor incoming newswire stories for company names and collect documents that mention them—an automated press clipping service.

This paper describes the use of statistical language modeling techniques, such as are commonly used for text compression, to extract meaningful, low-level, information about the location of semantic tokens, or “entities,” in text. We begin by marking up several different token types in training documents—for example, people’s names, dates and time periods, phone numbers, and sums of money. We form a language model for each token type and examine how accurately it identifies new tokens. We then apply a search algorithm to insert token boundaries in a way that maximizes compression of the entire test document. The technique can be applied to hierarchically-defined tokens, leading to a kind of “soft parsing” that will, we believe, be able to identify structured items such as references and tables in html or plain text, based on nothing more than a few marked-up examples in training documents.

In all these examples, the key problem is to recognize different types of target fragments, which we will call tokens or “entities”. This is really a kind of language recognition problem: we have a text made up of different sublanguages (for personal names, company names, dates, table entries, and so on) and seek to determine which parts are expressed in which language. The information extraction research community (of which we were, until recently, unaware) has studied these tasks and reported results at annual Message Understanding Conferences (MUC). For example, “named entities” are defined as proper names and quantities of interest, including personal, organization, and location names, as well as dates, times, percentages, and monetary amounts (Chinchor, 1999).

1. INTRODUCTION Text mining is about looking for patterns in text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Compared with the kind of data stored in databases, text is unstructured, amorphous, and difficult to deal with. Nevertheless, in modern Western culture, text is the most common vehicle for the formal exchange of information. The motivation for trying to extract information from it is compelling—even if success is only partial.

The standard approach to this problem is manual: tokenizers and grammars are hand-designed for the particular data being extracted. Looking at current commercial state-of-the-art text mining software, for example, IBM’s Intelligent Miner for Text (Tkach, 1997) uses specific recognition modules carefully programmed for the different data types, while Apple’s data detectors (Nardi et al., 1998) uses language grammars. The Text Tokenization Tool of Grover et al. (1999) is another example, and a demonstration version is available on the Web. The challenge for machine learning is to use

Text mining is possible because you do not have to understand text in order to extract useful information from it. Here are four examples. First, if only names could be identified, links could be inserted automatically to other places that mention the same name—links that are “dynamically evaluated” by calling upon a search engine

1

AI IS CS

Vol. 8, No. 25.1 August 18, 1998

2> Career jobs (in our CCJ 8.25 digest this week):

THE COMPUTISTS' COMMUNIQUE

Fraunhofer CRCG (Providence, RI): MS/PhD researcher for digital watermark agents.

"Careers beyond programming." Case Western Reserve U. (Cleveland): ESCES dept. chair. 1> Politics and policy. 2> Career jobs. 3> Book and journal calls. 4> Silicon Valley jobs. _________________________________________________________________

UOklahoma (Norman): CS dept. director. Santa Fe Institute (NM): postdocs in complex, adaptive systems. 3> Book and journal calls:

"Our entire culture has been sucked into the black hole of computation, an utterly frenetic process of virtual planned obsolescence. But you know -- that process

CRC Press is seeking proposals for future volumes in its International Series on Computational Intelligence, or for chapters in such volumes. Lakhmi C. Jain . [connectionists, 13Aug98.]

1> Politics and policy: The President's Information Technology Advisory Committee has issued an Aug98 Interim Report about future research needs. It's online at . [Maria Zemankova , IRList, 10Aug98.]

Kluwer Academic Publishers has a new Genetic Programming book series, starting with Langdon's "Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming!". Book ideas may be sent to John R. Koza

The US created 20K new computer services jobs in Jul98, plus 3K in computer manufacturing, out of just 66K new US jobs total. The Bureau of Labor Statistics characterized the computer field as having "strong long-term growth trends." [TechWeb, 08Aug98. EduP.]

4> Silicon Valley jobs: Wired ran an article last year about headhunting in Silicon Valley, by Po Bronson, author of "The First $20 Million Is Always the Hardest" (Random House). Bronson says there is tremendous demand for programmers, computer operators, and marketing people -- so much so that Nohital Systems had acceded to the demands of a programmer who brought his 8-foot python to work and [temporarily] a night-shift operator

The peak year for female CS graduates was 1983-4, when women earned 37% (32,172) of BSCS degrees. It dropped to 28% in 1993-4

Figure 1. Masthead and beginning of each section of the electronic newsletter training instead of explicit programming to detect instances of sublanguages in running text.

understanding it is not committed to any particular ontology.

This paper explores the ability of the kind of adaptive language models used in text compression to locate patterns in text, patterns of the kind typically sought by text mining systems.

The sheer quantity of different information items in Table 1 is impressive: 175 items, in ten categories, from a mere four pages. The volume of such information items is highly dependent on the kind of text—one would expect far less in a novel, for example—but this example is not atypical of the factual, newsy, writings that information workers frequently have to scan on a regular basis.

2. STRUCTURED INFORMATION IN TEXT Looking for patterns in text is really the same as looking for structured data inside documents. According to Nardi et al. (1998), “a common user complaint is that they cannot easily take action on the structured information found in everyday documents ... Ordinary documents are full of such structured information: phone numbers, fax numbers, street addresses, email addresses, email signatures, abstracts, tables of contents, lists of references, tables, figures, captions, meeting announcements, Web addresses, and more. In addition, there are countless domain-specific structures, such as ISBN numbers, stock symbols, chemical structures, and mathematical equations.”

Nardi et al. (1998) define structured information as “data recognizable by a grammar,” and go on to describe how such data can be specified by an explicit grammar and acted on by “intelligent agents.” Despite the intuitive appeal of being able to define data detectors that trigger particular kinds of actions, there are practical problems with this approach that stem from the difficulty of enabling an ordinary user to specify grammars that recognize the kind of tokens he or she is interested in. First, practical schemes that allow users to specify grammars generally presuppose that the input is somehow divided into lexical tokens, and although “words” delimited by non-alphanumeric characters provide a natural tokenization for many of the items in Table 1, such a decision will turn out to be restrictive in particular cases. For example, generic tokenization would not allow the unusual date structure in this particular document (e.g. 30Jul98) to be recognized. In general, any prior division into tokens runs the risk of obscuring information.

As an example, Figure 1 shows the masthead and beginning of each section of a 4-page, 1500-word, weekly electronic newsletter, and Table 1 shows information items extracted (manually) from it—items of the kind that readers might wish to take action on. They are classified into generic types: people’s names; dates and time periods; locations; sources, journals, and book series; organizations; URLs; email addresses; phone numbers; fax numbers; and sums of money. Identifying these types is rather subjective. For example, dates and time periods are lumped together, whereas for some purposes they should be distinguished. Personal and organizational names are separated, whereas for some purposes they should be amalgamated—indeed it may be impossible for a person (let alone a computer) to distinguish them. The methodology we develop here accommodates all these options: unlike AI approaches to natural language

Second, practical grammar-based approaches that allow users to specify the structure use deterministic grammars rather than probabilistic ones. These paint the world black and white. Yet the situation exemplified by Table 1 reeks of ambiguity. A particular name might be a person’s name, a place name, a company name—or all three. A name appearing at the beginning of a sentence may be indistinguishable from an ordinary capitalized word that starts the sentence—and it is not even completely clear what to “start a sentence” means. Determining what is

2

Table 1 Generic data items extracted from a 4-page electronic newsletter People’s names (n) Al Kamen Barbara Davies Bill Park Bruce Sterling Ed Royce Eric Bonabeau Erricos John Kontoghiorghes Heather Wilson John Holland John R. Koza Kung-Kiu Lau Lakhmi C. Jain Lashon Booker Lily Laws Maria Zemankova Mark Sanford Martyne Page Mike Cassidy Po Bronson Randall B. Caldwell Robert L. Park Robert Tolksdorf Sherwood L. Boehlert Simon Taylor Sorin C. Istrail Stewart Robinson Terry Labach Vernon Ehlers Zoran Obradovic Sums of money (m) $1K $24K $60 $65K $70 $78K $100

Dates/time periods (d) 30Jul98 31Jul98 02Aug98 04Aug98 05Aug98 07Aug98 08Aug98 09Aug98 10Aug98 11Aug98 13Aug98 14Aug98 15Aug98 August 18, 1998 01Sep98 15Sep98 15Oct98 31Oct98 10Nov98 01Dec98 01Apr99 Nov97 Jul98 Aug98 Mar99 August July Spring 1999 Spring 2000 1993-4 1999 120 days eight years eight-week end of 1999 late 1999 month twelve-year period

Sources, journals, book series (s) Autonomous Agents and Multi-Agent Systems Journal Commerce Business Daily (CBD) Computational Molecular Biology Series DAI-List ECOLOG-L Evolutionary Computation Journal Genetic Programming book series IRList International Series on Computational Intelligence J. of Complex Systems J. of Computational Intelligence in Finance (JCIF) J. of Symbolic Computation (JSC) J. of the Operational Research Society Parallel Computing Journal Pattern Analysis and Applications (PAA) QOTD SciAm TechWeb WHAT'S NEW Washington Post Wired comp.ai.doc-analysis.ocr comp.ai.genetic comp.ai.neural-nets comp.simulation Dbworld Sci.math.num-analysis Sci.nanotech

Email addresses (e) [email protected] [email protected] [email protected] [email protected] erricos.kontoghiorghes@info. unine.ch [email protected] [email protected] [email protected] koza@genetic-programming. org [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] Organizations (o) ACM Austrian Research Inst. for AI Bureau of Labor Statistics CRC Press Case Western Reserve U. Fraunhofer CRCG Ida Sproul Hall Kluwer Academic Publishers NSF Nohital Systems Oregon Graduate Inst. Permanent Solutions Random House Santa Fe Institute UOklahoma UTrento

Locations (l) Beaverton Berkeley Britain Canada Cleveland Italy Montreal NM Norman Providence, RI Quebec Silicon Valley Stanford US Vienna the Valley Phone numbers (p) 650-941-0336 (703) 883-7609 +44 161 275 5716 +44-1752-232 558 Fax numbers (f) 650-941-9430 fax +44 161 275 6204 Fax (703) 883-6435 fax +44-1752-232 540 fax

URLs (u) http://cbdnet.access.gpo.gov/ http://ourworld.compuserve.com/homepages/ftpub/call.htm http://www.ccic.gov/ac/interim/ http://www.cs.man.ac.uk/~kung-kiu/jsc http://www.cs.sandia.gov/~scistra/DAM http://www.cs.tu-berlin.de/~tolk/AAMAS-CfP.html http://www.elsevier.nl/locate/parco http://www.santafe.edu/~bonabeau http://www.soc.plym.ac.uk/soc/sameer/paa.htm http://www.wired.com/wired/5.11/es_hunt.html

3

AI IS CS

Vol. 8, No. 25.1 August 18, 1998

2> Career jobs (in our CCJ 8.25 digest this week):

THE COMPUTISTS' COMMUNIQUE

Fraunhofer CRCG (Providence, RI): MS/PhD researcher for digital watermark agents.

"Careers beyond programming." Case Western Reserve U. (Cleveland): ESCES dept. cha 1> Politics and policy. 2> Career jobs. 3> Book and journal calls. 4> Silicon Valley jobs. _________________________________________________________________

Uoklahoma (Norman): CS dept. director. Santa Fe Institute (NM): postdocs in complex, adapti 3> Book and journal calls:

"Our entire culture has been sucked into the black hole of computation, an utterly frenetic process of virtual planned obsolescence. But you know -- that process

CRC Press is seeking proposals for future volumes in its International Series on Computational Intelligence, chapters in such volumes. Lakhmi C. Jain Politics and policy: The President's Information Technology Advisory Committee has issued an Aug98 Interim Report about future research ne It's online at . [Maria Zemankova , IRList, 10Aug98.]

Kluwer Academic Publishers has a new Genetic Program book series, starting with Langdon's "Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming!". Book ideas may be sent to John R. K 4> Silicon Valley jobs:

The US created 20K new computer services jobs in Jul plus 3K in computer manufacturing, out of just 66K new U jobs total. The Bureau of Labor Statistics charac the computer field as having "strong long-term growth trends." [TechWeb, 08Aug98. EduP.]

Wired ran an article last year about headhunting in Silicon Valley, by Po Bronson, author of "The First $20 Mil Is Always the Hardest" (Random House). Bronson says there Tremendous demand for programmers, computer operators, and Marketing people -- so much so that Nohital Systems had acceded to the demands of a programmer who brought his 8-foot python to work and [temporarily] a night-shift operator

The peak year for female CS graduates was 1983-4, when w earned 37% (32,172) of BSCS degrees. It dropped to 28% in 199

Figure 2. Marked up version of Figure 1 (some lines truncated for display) and what is not a time period is not always easy—the notion of “time period” in natural language text is illdefined. Text, with its richness and ambiguity, does not necessarily support hard and fast distinctions. Email addresses and URLs are exceptions: but, of course, they are not “natural” language.

3.

LANGUAGE MODELS FOR TEXT COMPRESSION

Statistical language models are well developed in the field of text compression. Compression methods are usually divided into symbolwise and dictionary schemes (Bell et al., 1990). Symbolwise methods, which generally make use of adaptively generated statistics, give excellent compression—in fact, they include the best known methods. Although the dictionary methods such as the Ziv-Lempel schemes perform less well, they are used in practical compression utilities such as Unix compress and gzip because they are fast.

Third, recognition should degrade gracefully when faced with real text, that is, text containing errors. Traditional grammars decide whether a particular string is or is not recognized, whereas in this application it is often helpful to be able to recognize a string as belonging to a particular type even though it is malformed, or a name even though it is misspelled. (The newsletter used for Table 1 is unusual in that it is virtually error free.)

In our work we use the Prediction by Partial Matching (PPM) symbolwise compression scheme (Cleary and Witten, 1984), which has become a benchmark in the compression community. It generates “predictions” for each input symbol in turn. Each prediction takes the form of a probability distribution that is provided to an encoder. The encoder is usually an arithmetic or Huffman coder; fortunately the details of coding are of no relevance to this paper.

Fourth and most important, text mining will require incremental, evolutionary development of grammars. The problems are not fully defined in advance. Grammars will have to be modified to take account of new data. This is not easy: the addition of just one new example can completely alter a grammar and render worthless all the work that has been expended in building it. Finally, many of the items in Table 1 cannot be recognized by conventional grammars. Names are a good example. Some will have been encountered before; for them, table lookup is appropriate—but the lookup operation should recognize legitimate variants. Others will be composed of parts that have been encountered before, say John and Smith, but not in that particular combination. Others will be recognizable by format (e.g. Randall B. Caldwell). Still others—particularly certain foreign names—will be clearly recognizable because of peculiar language statistics (e.g. Kung-Kui Lau). Others will not be recognizable except by capitalization, which is an unreliable guide—particularly when only one name is present.

PPM uses finite-context models of characters, where the previous few (say three) characters predict the upcoming one. The conditional probability distribution of characters, conditioned on the preceding few characters, is maintained and updated as each character of input is processed. This distribution, conditioned on the actual value of the preceding few characters, is used to predict each upcoming symbol. Exactly the same distributions are maintained by the decoder, which updates the appropriate distribution as each character is received. This is what we call “adaptive modeling”: both encoder and decoder maintain the same models—not by communicating the models directly, but by updating them in precisely the same way. Rather than using a fixed context length (three was suggested above), the PPM method chooses a maximum

4

Table 2 Confusion matrix: tokens are identified by language models (a) Tokens in isolation

(b) Tokens in context

computed label d

n

s

o

l

m

3 d 47 27 1 2 n 29 1 s 3 15 1 o 5 11 l 8 m 1 e u p f

total 47 32 37 16 14 8

e

u

computed label p

f

t

1

19 10 4 4 20 10

4

4

0

total

d

50 30 31 19 16 8 20 10 4 4

43

n

s

o

28

l

m

e

u

p

f

1 26 14

1

10

t

total

7 1 5 5 5

50 30 31 19 16 8 20 10 4 4

8 19

1 10 4 4

192

43 29 26 14 11

context length and maintains statistics for this and all shorter contexts. For example, in most of the experiments below the maximum context length was five, and statistics were maintained for models of order five, order four, order three, order two, order one, and order zero. These are not stored separately, they are all kept in a single trie structure.

8

19 10

4

4

24

192

The only remaining question is how to calculate the escape probabilities. There has been much discussion of this, and several different methods have been proposed. Our experiments use method D (Howard, 1993), which calculates the escape probability in a particular context as 1 2

d , n

To encode the next symbol, PPM starts with the maximum-order model (say order five). If it contains a prediction for the upcoming character, it is transmitted according to the order-five distribution. Otherwise, both encoder and decoder “escape” down to order four. There are two possible situations. If the order-five context—that is, the preceding five-character sequence—has not been encountered before, then escape to order four is inevitable, and both encoder and decoder can deduce that fact without requiring any communication. If not, that is, if the preceding five characters have been encountered in sequence before but not followed by the upcoming character, then only the encoder knows that an escape is necessary. In this case, therefore, it must signal this fact to the decoder by transmitting an “escape event”—and room for this event must be made in each probability distribution that the encoder and decoder maintain.

where n is the number of times that context has appeared and d is the number of different symbols that have directly followed it. The probability of a character that has occurred c times in that context is c − 12 . n

Since there are d such characters, and their counts sum to n, it is easy to confirm that the probabilities in the distribution (including the escape probability) sum to 1. One slight further improvement to PPM is incorporated in the experiments: deterministic scaling (Teahan, 1997). Although it probably has negligible effect on our overall results, we record it here for completeness. Experiments show that in deterministic contexts, for which d=1, the probability of the single character that has occurred before reappearing is greater than the 1 – 1/2n implied by the above estimator. Consequently, in this case the probability is increased in an ad hoc manner to 1 – 1/6n.

Once any necessary escape event has been transmitted and received, both encoder and decoder agree that the upcoming character will be coded by the order-four model. Of course, this may not be possible either, and further escapes may take place. Ultimately, the order-zero model may be reached; in this case the character can be transmitted if it is one that has occurred before. Otherwise, there is one further escape (to an “order –1” model), and the 8-bit ASCII representation of the character is sent.

4.

USING LANGUAGE MODELS TO RECOGNIZE TOKENS

Character-based language models provide a good way to recognize lexical tokens. Tokens can be compressed using models derived from different training data, and classified

5

Table 3 Confusion matrices averaged over 20 test documents (leave-one-out training) (a) Tokens in isolation

(b) Tokens in context

computed label d

n

s

o

l

m

e

u

d 35 22 1 2 2 n 1 19 2 s 1 10 1 o 1 2 1 17 l 14 m 17 e 1 u p f total 35 26 22 14 22 14 17 1

computed label p

f

t

1 5 1

5

0

total

d

35 27 23 13 21 14 17 1 1 5

33

n

s

o

l

m

e

u

p

f

24 14 10 1

1

1 1 15

t

total

2 2 8 2 5

35 27 23 13 21 14 17 1 1 5

20

157

14 16 1 1 5

157

33 25 14 11 17 14 16

according to which one supports the most economical representation.

1

1

5

shows the average confusion matrices over twenty test documents, derived using leave-one-out training: for each issue, the compression models were trained on the other nineteen issues. The single issue that we will discuss in detail is fairly representative of this data set.

To test this, several issues of the same newsletter used for Table 1 were analyzed manually to extract all people’s names; dates and time periods; locations; sources, journals, and book series; organizations; URLs; email addresses; phone numbers; fax numbers; and sums of money. The issues were marked up to identify these items using an XML-style markup; Figure 2 shows the markedup version of the extract in Figure 1.

Of the 192 tokens in Table 1, 174 are identified correctly and 18 incorrectly. In fact, 69 of them appear in the training data (with the same label) and 123 are new; all of the errors are on new symbols. Three of the “old” symbols contain line breaks that do not appear in the training data: for example, in the test data Parallel Computing Journal is split across two lines as indicated. However, these items were still identified correctly. The 18 errors are easily explained; some are quite understandable.

Various experiments were carried out to determine the power of language models to discriminate these tokens, both out of context and within their context in the newsletter. Throughout this work, we use the PPM text compression scheme as described above, with order five unless otherwise mentioned.

Beaverton, a location, was mis-identified as a name. Compressed as a name, it occupies 3.18 versus 3.25 bits/char as a location. Norman (!), Cleveland, Britain and Quebec were also identified as names. Conversely, the name Mark Sanford was mis-identified as a location. Although Mark appears in five different names, this is outweighed by eighteen appearances of San in locations (e.g. San Jose, San Mateo, Santa Monica, as well as Sankte Augustin). The name Sorin C. Istrail was also identified as a location, as was the organization Fraunhofer CRCG.

3.1 DISCRIMINATING ISOLATED TOKENS Lists of names, dates, locations, etc. in 19 issues of the newsletter were input to PPM separately to form ten language models labeled n, d, l, s, o , u , e, p , f, m. In addition, a plain text model, t, was formed from the full text of all 19 issues. These models were used to identify each of the tokens in Table 1 on the basis of which model compresses them the most. The results are summarized in the form of a confusion matrix in Table 2a, where the rows represent the correct label and the columns the computed one. Notice that although t never appears as the correct label, it could be assigned to a token of a different type because it compresses it best.

ACM, an organization, was mis-identified as a source. In fact, the only place these letters appear in the training data is in ACM Washington Update, which is a source. Sources are the most diverse category and swallow up many foreign items: the dates eight-week, eight years and Spring 2000 (note: comp.software.year-2000 is a source); the name Adnan Amin; organizations Commerce Business Daily (CBD) and P A A ; and the email address [email protected] (because genetic-

We will discuss these results below, looking at individual errors to get a feeling for the kinds of mistakes that are made by compression-based token identification. These results are for a single issue of the newsletter. Table 3a

6

100

100

80

80

errors 60

60

40

40

20

20

0

0 0

1

2

3

4

5

0

model order

1

2

3

4

5

model order

(a) Isolated tokens

(b) Tokens in context

Figure 3 Effect of model order on token identification (light areas represent tokens mis-identified as plain text) programming appears in the training data as both a source and plain text). Some sources were also mis-identified: ECOLOG-L was identified as an organization, and sci.nanotech as an email address.

e – (e0 + em) bits where em is the entropy of the token with respect to model m. This was evaluated for each model to determine which one classified the token best, or whether it was best left as plain text. The entire procedure was repeated for each token individually.

3.2 DISTINGUISHING TOKENS IN CONTEXT Our quest is to identify tokens in the newsletter. Here, contextual information is available which provides additional help in disambiguating them. However, identification must be done conservatively, so that strings of plain text are not misinterpreted as tokens—and since there are many strings of plain text, there are countless opportunities for error.

Table 2b shows the confusion matrix that was generated. (Again, cross-validation results for the entire data set are shown in Table 3b.) The number of errors has increased from 18 to 26; however, 24 of these “errors” are caused by failure to recognize a token as being different from plain text, and only two are actual mis-recognitions (Berkeley is identified as a name and Mark Sanford as a location).

For example, email addresses in the newsletter are always flanked by angle brackets. Many sources are preceded by a [ or ,• and followed by a ,•. (Bullets are used to make spaces visible.) These contextual clues often help in identifying tokens. Conversely, identification may be foiled in some cases by misleading context. For example, some names are preceded by Rep.•, which reduces the weight of the capitalization evidence because capitalization routinely occurs following a period. But by far the most influential effect of context is that to mark up any string as a token requires the insertion of two extra symbols: begin-token and end-token. Unless the language models are good, tokens will be misread as plain text to avoid the overhead of these extra symbols.

There is a small overall improvement in compression through the use of tokens. To code the original test file using a model generated from the original training files takes a total of 28,589 bits for 10,889 characters, an average of 2.63 bits/char. To code the marked-up test file using the appropriate models generated from the markedup training files takes 841 fewer bits—despite the fact that there are 364 additional begin-token and end-token symbols). This represents a 2.9% improvement for the file as a whole, or 0.077 bits per character of the original file. Of course, the improvement is diluted by the presence of a large volume of plain text. When the savings are averaged over the affected characters—namely the 2945 characters that are present in the tokens alone—the improvement seems more impressive, 0.28 bits/char.

In order to evaluate the effect of context, all tokens were replaced by a symbol that was treated by PPM as a single character. The training data used for the plain-text model was transformed in this way, a new model was generated from it, and the text article was compressed by this model to give a baseline entropy figure of e0 bits. The first token in Table 1 was restored into the test article as plain text and the result recompressed to give entropy e bits. The net space saved by recognizing this token as belonging to model m is

3.3 EFFECT OF MODEL ORDER To investigate the effect of model order, the token discrimination exercise was re-run using PPM models of order 0, 1, 2, 3, 4, and 5, using the same training and test data. Figure 3 shows the number of errors in token identification observed, for both isolated tokens and

7

60

40 errors identification failures (in context) 20 actual errors (in isolation) actual errors (in context)

0 0

Figure 4

10

20 number of training files

30

40

Effect of quantity of training data on token identification

tokens in context. The dark bars show actual errors, and the light ones indicate failure to identify a token as something other than plain text. The number of actual errors is very small when tokens are taken in context, even for low-order models. However, the number of identification failures is enormous when the models are poor ones.

a very satisfactory reduction in errors as the amount of training data increases, but the number of identification failures stabilizes at an approximately constant level. Our choice of 19 training documents for all other tests seems to be a sensible one. 5. LOCATING TOKENS IN CONTEXT

Although the lowest error rates are observed with the highest-order models, the improvement as model order increases beyond 2 is marginal. Note that we have used models of the same order for all information items (including the text model itself); better results may be obtained by choosing the optimal order for each token type individually. For example, order-0 models identify every single sum of money, email address, URL, phone and fax number, with no errors—even in context.

We locate tokens in context by considering the input as an interleaved string of information from different sources. This model has been studied by Reif and Storer (1997), who consider optimal lossless compression of nonstationary sources produced by concatenating finite strings from different sources. However, they assume that the individual strings are long, and grow without bound as the input increases; this assumption allows an encoding method to be derived with asymptotically optimal expected length. Volf and Willems (1997) study the combination of two universal coding algorithms using a switching method. They devise a dynamic programming structure to control switching, but rather than computing the probability of a single transition sequence, they weight over all transition sequences.

The tradeoff, with tokens in context, between actual errors and failures to identify is not a fixed one. It can be adjusted by using a non-zero threshold when comparing the compression for a particular token with the compression when its characters are interpreted as plain text. This allows us to control the error rate, sacrificing a small increase in the number of errors for a larger decrease in identification failures.

Our work derives from Teahan et al.’s (1997) for correcting English text using PPM models. Suppose every token is bracketed by begin-token and end-token symbols; the problem then is to “correct” text by inserting such symbols appropriately. Begin- and end-token will identify the type of the token in question—thus we have beginname-token, end-name-token, etc, written as , . The innovation in the present paper is that whenever a begin-token symbol is encountered, the encoder switches to the language model appropriate to that type of token, initialized to a null prior context. And whenever endtoken symbol is encountered, the encoder reverts to the plain text model that was in effect before, replacing the token by the single symbol representing that token type.

3.4 EFFECT OF QUANTITY OF TRAINING DATA Next we examine how identification accuracy depends on the amount of training data. We varied the training data from one to 38 issues of the same newsletter, marked up manually. Figure 4 plots the number of errors in token identification, for the same test file, against the amount of training data. The middle line corresponds to isolated tokens (as in Figure 3a). The other two correspond to tokens in context, the lower one to the number of actual errors made (as in the dark bars of Figure 3b), and the upper one to the number of failures to identify a token as anything but plain text (the light bars). Notice that there is

8

any path that might turn out to be the best one, were it not for the fact that a different sequence of unterminated models might exist in the two paths, some of which, when terminated, might cause the contexts to differ between the paths. At present, we do not check for this eventuality. The second pruning operation is to delete any leaf that (a) has a larger entropy than the best path so far, and (b) represents a point that lags more than k symbols behind that best path. This pruning heuristic is not guaranteed to be safe: it may delete a path that would ultimately turn out to be best. The reason is that the price in bits to enter any model is unbounded; yet so is the benefit that may eventually accrue from using the model. The third is to restrict the open-leaves list to a predetermined length.

4.1 ALGORITHM FOR MODEL ASSIGNMENT Our algorithm takes a string of text and works out the optimal sequence of models that would produce it, along with their placement. As a small example, the input string In•1998,•$2. produces the output In•1998,•$2. Here, t is a model formed from all the training text with every token replaced by a single-symbol code. The characters 1998 have been recognized as a date token, because the date model compresses them best; $2 has been recognized as a money token, because the money model compresses them best. The remainder has been recognized as plain text.

These pruning strategies cause a small number of identification errors (discussed below); other strategies are under investigation.

The algorithm works by processing the input characters to build up a tree in which each path from root to leaf represents a string of characters that is a possible interpretation of the input. The paths are alternative output strings, and begin-token and end-token symbols appear on them. The entropy of a path can be calculated by starting at the root and coding each symbol along the path according to the model that is in force when that symbol is reached. The context is re-initialized to a unique starting token whenever begin-token is encountered, and the appropriate model is entered. On encountering end-token, it is encoded and the context reverts to what it was before. For example, the character that follows

4.2 RESULTS WHEN LOCATING TOKENS IN CONTEXT To evaluate the procedure for locating tokens in context, we used the training data from the same 19 issues of the newsletter that were used previously, and the same single issue for testing. Error counting is complicated by the presence of multiple errors on the same token. Counting all errors on the same token as one, there are a total of 47 errors: 2 identification errors noted in Section 3.2 for incontext discrimination 24 failures to recognize a token as being different from plain text 5 incorrect positive identifications 9 boundary errors 3 phone/fax absorption errors 4 pruning errors

In•1998 will be predicted by the four characters In•, where < d/d> is the single-symbol code representing the occurrence of a date. This context is interpreted in the t model.

In addition, a further 9 “errors” occurred which were actually errors in the original markup, made by the person who marked up the test data.

What causes the tree to branch is the insertion of begintoken symbols for each possible token type, and the endtoken symbol for the currently active token type (in order that nesting is properly respected). To expand the tree, a list of open leaves is maintained, each recording the point in the input string that has been reached and the entropy value up to that point. The lowest-entropy leaf is chosen for expansion at each stage.

The two identification errors and 24 failures to recognize are those noted in the confusion matrix of Table 2b. The five incorrect positive identifications picked out bonus as a date, son and Prophet as names, field as and Ida as organizations. Boundary errors are more interesting. Many involved names: Wilson, Kung-Kiu, Lashon and Sorin C were identified by the algorithm as names whereas it was Heather Wilson, Kung-Kiu Lau, Lashon Booker and Sorin C. Istrail that were marked up. Similar mistakes occurred with other token types: year for twelveyear, C B D for CBDNet, International Series and Computational Intelligence (separately) for International Series on Computational Intelligence, koza and .org (separately) for [email protected], J. of Symbolic Computation and “JSC)” separately for J. of Symbolic Computation (JSC), and comp.ai for

Unless the tree and the list of open leaves are pruned, they grow very large very quickly. Currently, three separate pruning operations are applied that remove leaves from the list and therefore prevent the corresponding paths from growing further. First, if two leaves are labeled with the same character and have the same preceding k characters, the one with the greater entropy is deleted. Here, k is the model order (default 5). This would be a “safe” pruning criterion that could not possibly eliminate

9

comp.ai.genetic. Phone/fax absorption errors are a special case of boundary errors involving phone and fax numbers: for example, +44-1752-232 540 was identified as a phone number whereas it was +44-1752-232 540 fax that was marked up, as a fax number. Finally, pruning errors are caused by the pruning strategy described in the previous section.

account reduces the error rate at the expense of an increase in the number of symbols that are mistaken for plain text—a tradeoff that is adjustable. The dynamic programming method allows this technique to be used to identify tokens in running text with no clues as to where they begin and end. The methodology works with hierarchically-defined tokens, where each token can contain subtokens. No explicit programming is required for token identification: rather, machine learning methodology is used to acquire identification information automatically from a marked-up set of training documents. The result is automatic location and classification of the items contained in test documents.

6. USING LANGUAGE MODELS TO RECOGNIZE STRUCTURES So far we have not taken advantage of the potentially hierarchical nature of the tokens that are inferred. We use the term “soft parsing” to denote inference of what is effectively a grammar from example strings, using exactly the same compression methodology. After analyzing the errors noted above, we refined the markup of the training documents to use forename, initial, surname for names; username, d o m a i n , and top-level domain for email addresses; and embedded phone numbers for fax numbers. Examples are

To summarize the success of token identification, 449 out of a total of 535 names were recognized correctly in isolation (Table 3a gives these figures as per-document averages, quantized into integers), for a success rate of 83.9%. These figures improved to 478 correctly recognized names, that is 89.4%, when context was taken into account (Table 3b). Other token types fared similarly: the recognition rate for phone numbers, for example, increased from 75% without context to 90% in context. However, 84% of the 456 sources were correctly identified without context, but this dropped to 60% when context was taken into account. The reason is that many sources (37%) were misrecognized as plain text, an option that was unavailable in the experiments on tokens in isolation. Thus in this case only 3% of sources were misidentified as other token types.

Name: Ian•H.•Witten Email: [email protected] Fax:

+64-7-856-2889

•fax (Of course, different symbols were used for f, s, u, d, and t to avoid confusion with the other models.) Then, during training, models are built for each component of a structured item, as well as the item itself, and when the test file is processed to locate tokens in context, these new tags are inserted into it too. The algorithm described above accommodates nested tokens without any modification at all.

The success rate for token location is harder to quantify, because some errors involve slight misplacement of the boundary, and a significant number of other errors can be attributed to mistakes made in the markup. We observed a fairly small increase in actual misidentification of the token type, but a significant number of further instances where tokens in the text were missed; however, these have not yet been quantified.

The results were mixed. Some errors were corrected (e.g. Kung-Kiu Lau and Sorin C. Istrail was correctly marked), but other problems remained (e.g. the fax/phone number mix-up) and a few new ones were introduced. Some of these are caused by the pruning strategies used; others are due to insufficient training data.

The examples we have given also highlight some weaknesses of the use of compression models for identifying and locating tokens. The difficulty we observed with a particular email address, for instances, highlights the fact that for some lexical items (in this case an artificial rather than a natural one), the appearance of a certain character (“@”) is very strong evidence of token type that should not easily be outvoted by unrepresentative language statistics.

Despite these inconclusive initial results, we believe that soft parsing will be a valuable technique in situations with stronger hierarchical context (e.g. references and tables).

Applications of text mining based on language modeling are legion; we have just begun to scratch the surface. Identifying references in documents; locating information in tables (such as stock prices) expressed in either html or plain text, inferring document structure, finding names, addresses, phone numbers on Web pages, data detectors of any kind—all of these could be accomplished without any explicit programming.

7. CONCLUSIONS In this paper we have, through an extended example, argued the case that statistical language modeling techniques are valuable for text mining. Different kinds of tokens in text can be classified because different models compress them better. Good results are achieved for tokens in isolation. Taking each token’s context into

10

Great power is gained from the hierarchical nature of the representation. A reference, for example, contains tokens that represent names, year of publication, title, journal, volume, issue number, page numbers, month of publication. These tokens will be separated by short fillers involving spaces, punctuation, quotation marks, the word “and”, etc. The fillers will be quite regular, and although the tokens that appear, and the order in which they appear, will vary somewhat, the number of different possibilities is not large.

Chinchor, N.A. (1999) “Overview of MUC-7/MET-2.” Proc Message Understanding Conference MUC-7. Cleary, J.G. and Witten, I.H. (1984) “Data compression using adaptive coding and partial string matching.” IEEE Trans on Communications, Vol. 32, No. 4, pp. 396–402. Grover, C., Matheson, C. and Mikheev, A. (1999) “TTT: Text Tokenization Tool.” http://www.ltg.ed.ac.uk/ Howard, P.G. (1993) The design and analysis of efficient lossless data compression systems. PhD thesis, Brown University, Providence, RI.

In order to investigate the application of language modeling to text mining in a constrained context, the experiments reported here have been self-contained. All training and testing has taken place on issues of a particular electronic magazine—we have eschewed the use of any additional information. However, in practice, it is easy and very attractive to prime models from external sources—lists of names, organizations, geographical locations, information sources, even randomly-generated dates, sums of money, phone numbers. Priming will greatly reduce the volume of training data that needs to be marked up manually, making text mining practical even with small amounts of training data.

Nardi, B.A., Miller, J.R. and Wright, D.J. (1998) “Collaborative, programmable intelligent agents.” Comm ACM, Vol. 41, No. 3, pp. 96–104. Reif, J.H. and Storer, J.A. (1997) “Optimal lossless compression of a class of dynamic sources.” Proc Data Compression Conference, edited by J.A. Storer and J.H. Reif. IEEE Computer Society Press, Los Alamitos, CA, pp. 501–510. Teahan, W.J. (1997) Modelling English text. PhD thesis, University of Waikato, NZ.

We are very grateful to Stuart Inglis and John Cleary who provided valuable advice and assistance.

Teahan, W.J., Inglis, S., Cleary, J.G. and Holmes, G. (1997) “Correcting English text using PPM models.” Proc Data Compression Conference, edited by J.A. Storer and J.H. Reif. IEEE Computer Society Press, Los Alamitos, CA, pp. 289–298.

REFERENCES

Tkach, D. (1997) Text mining technology: Turning information into knowledge. IBM White paper.

ACKNOWLEDGMENTS

Bell, T.C., Cleary, J.G. and Witten, I.H. (1990) Text compression. Prentice Hall, Englewood Cliffs, New Jersey.

Volf, P.A.J. and Willems, F.M.J. (1997) “Switching between two universal source coding algorithms.” Proc Data Compression Conference, edited by J.A. Storer and J.H. Reif. IEEE Computer Society Press, Los Alamitos, CA, pp. 491–500.

11

Suggest Documents