Automatic Extraction and Linking of Person Names in Legal Text Christopher Dozier & Robert Haschart Computer Science Research Department West Group 610 Opperman Drive Eagan, MN 55123,USA
[email protected]
Abstract This paper describes an application that creates hypertext links in text from named individuals to personal biographies. Our system creates these links by extracting MUC-style templates from text and linking them to biographical information in a relational database. The linking technique we use is based on a naïve Bayesian inference network. In particular, our application involves the extraction of attorney and judge names from American caselaw and the creation of links between the names and a file containing their biographies. It is a real world commercial application that involves the automatic creation of millions of reliable hypertext links in millions of documents. The techniques described in this paper could be applied to other domains besides law. Our experiments show that, by combining information extraction and Bayesian record linkage techniques, we can automatically extract and match attorney and judge names in caselaw to biographies with an accuracy rivaling that of a human expert.
Introduction At West Group, we provide our customers with online access to millions of American caselaw documents and to the biographies of approximately 1,000,000 U.S. attorneys and 20,000 judges. To add value to these data, we have built a system that automatically creates these hypertext links from attorney and judge names in caselaw to their biographies. The value of linking attorney and judges names in this way is that it gives users the ability to access biographical information about a person immediately and accurately. It also gives system developers the means to build highly precise inverted indexes for specific individuals. The inverted indexes can be used to retrieve cases associated with a particular attorney or judge. For a user to find biographical information regarding an attorney or judge without these links, the user would have to search the biographical database him/herself using a combination of first name, last name, and other information. Our system accomplishes this automatically. Figure 1 shows a sample link between an attorney name in U.S. caselaw and the attorney's biography page, as generated by the system. Our system combines information extraction techniques with a record linkage technique that makes use of a Bayesian network. Although many systems have been implemented to extract data from text and many systems have been built to link relational database records, we know of no other system that combines these techniques to automatically generate hypertext links. Neither do we know of any other system that uses Bayesian networks to match person name entities. The extraction portion of our system is similar to template extraction systems described in the Message Understanding Conferences proceedings (MUC-6, 1995) and elsewhere (Appelt et al.,1993) (Grishman,1997). Our extraction process relies on a finite state machine that identifies paragraphs in
caselaw containing attorney and judge names and a semantic parser that extracts attorney and judge template information from the paragraphs. The record linkage portion of our system uses a Bayesian network to match and link attorney and judge templates to biographical records. This network computes the probability that a given biographical record matches the same person specified in an extracted attorney or judge template. To compute this match likelihood, we treat first name, middle name, last name, firm, city, state, and court information as independent pieces of match evidence. We compute the prior probability of a match by calculating the probability that a randomly selected biographical record will match a template. We then compute conditional match probabilities for each piece of evidence using a manually tagged training set. For each piece of evidence, we compute the conditional probability that a biographical record matches a template when that piece of evidence matches exactly, matches in a strong fuzzy way, matches in a weak fuzzy way, is unknown, or mismatches. We define what we mean by strong fuzzy and weak fuzzy in later sections. But basically we mean that a piece of match evidence matches in a fuzzy way when it is compatible with another piece of evidence but does not match exactly. (Newcombe, 1988) describes a record linkage technique similar to ours that relies on odds calculations. Newcombe applies his method to the problem of matching records in structured databases. Our system is different in that we create a structured record from text in the form of a template and use a Bayesian method rather than an odds calculation. (Borgman & Siefried, 1992) gives a good overview of name matching and record linkage systems. (Pearl, 1988) provides a good description of Bayesian networks. Using these automatic methods, we are able to link attorney names in attorney paragraphs with 0.99 precision and 0.92 recall when compared with links created manually. And we are able to link judge names with 0.98 precision and 0.90 recall. In the sections that follow, we describe our system and the experiments we conducted to measure its precision and recall.
Overview of System Our system consists of four main modules. The first module extracts attorney and judge templates from caselaw. The second matches the extracted templates to biography records which have id numbers. The third inserts hypertext links into caselaw using the id numbers of the people identified by the matching module. And the fourth loads the biographical records in the relational database to the West's IR system as text. Figure 2 shows a diagram of the components of our system. The extraction module performs three steps. First, it identifies paragraphs within caselaw documents that contain attorney names, judge names, dates, and court names. Second, it parses the sentences in the attorney and judge paragraphs into constituency trees. And third it creates a template for each attorney and judge name found. The attorney template contains the attorney’s name, state, city, and firm as well as the date on which the case was decided. The judge template contains the judge’s name and court as well as the case's date. The matching module finds the biography records that most probably match the extracted templates. For each template, the matching module reads candidate biography records and computes the probability that a given candidate record refers to the same person referenced in the template. The module then selects the biography record with the highest match probability and inserts the biography id into the template record.
Hyper link Caselaw Documents
Insert Links 3
Load Documents 4
Extract Templates 1
Match Templates 2
Templates
Biography Documents
Relational Database
Figure 2: Overview of Hypertext Tagging System The extraction and matching modules are discussed in the following sections.
Extraction of Templates In the extraction module, the relevant paragraphs are identified, the relevant paragraphs are parsed, and templates are generated for each attorney and judge name found. Identifying Relevant Paragraphs To identify paragraphs that contain attorney names, judge names, date, and court information, the extraction module uses a sophisticated set of programs that rely on cue phrases and the position of paragraphs within the document. The paragraph types we are interested in are attorney, judge, date decided, and court. The attorney paragraphs identify the attorneys litigating a case, the parties the attorneys are representing, the attorneys' firms, and the cities from which the attorneys practice. The paragraph in figure 3 is an example of an attorney paragraph.
H. Patrick Weir, Jr., Lee Hagen Law Office, ltd., Fargo, N.D., Jeffrey J. Lowe, Gray & Ritter, P.C., St. Louis, MO, and Joseph P. Danis and John J. Carey, Carey & Danis, LLC, St. Louis, MO, for plaintiff and appellant.
Figure 3: Example Attorney Paragraph
The judge paragraphs list the judges deciding a case and often a verb phrase that indicates whether a particular judges agrees with or dissents from the majority opinion. The paragraph in figure 4 is an example of a judge paragraph.
23 Fla. L. Weekly D1564 BROWN & WILLIAMSON TOBACCO CORPORATION v. Grady CARTER and Mildred Carter, Appellees. No. 96-4831. District Court of Appeal of Florida, First District. June 22, 1998. An appeal from the Circuit Court for Duval County. Brian J. Davis, Judge. Robert P. Smith and Robert A. Manning, Tallahassee; J.W. Prichard, Jr. and Robert B. Parrish of Moseley, Warren, Prichard & Parrish, Jacksonville, for Appellant. Thomas E. Bezanson, Thomas E. Riley and Steven L. Vollins of Chadbourne & Parke, LLP, New York, NY, of Counsel.
Name: Smith, Robert P. Jr. Law Firm: Hopping, Green, Sams & Smith NamePosition: Shareholder City: Tallahassee County: Leon State: Florida Address: P. O. Box 6526, 123 S. Calhoun St. Tallahassee, FL 32301-1517 Country: U.S.A. Phone: (904) 222-7500 Fax: (904) 224-8551 Education: University of Florida College of Law, Gainesville, Florida Born: 1932 Admitted: Florida 1957 Practice: Appellate; Litigation
Figure 1: Example Link Between Caselaw and Biography
Justice opinion opinion SOUTER,
STEVENS announced the judgment of the Court and delivered the of the Court with respect to Parts I, II, and VI, and an with respect to Parts III, IV, and V, in which Justice Justice GINSBURG, and Justice BREYER join.
Figure 4: Example Judge Paragraph Date and court paragraphs contain the date on which the case was decided and the name of court in which the case was decided. Examples of a date paragraph and court paragraph are shown below. Court of Appeal, Second District, Division 7, California. Feb. 24, 1999.
Figure 5: Court Name and Case Date Paragraphs By convention, attorney, judge, date, and court paragraphs have distinctive syntax and predicable ordering within particular jurisdictions. At West, we have for many years been tagging these paragraphs automatically to support fielded searches in caselaw. For the name linking application, we reused these routines to handle the paragraph segmentation task. The finite state machine shown in figure 6 provides a simplified representation of how we use paragraph ordering to segment and label paragraphs in caselaw. To determine whether a paragraph belongs to a given category such as attorney or judge paragraph, we rely on cue words and phrases as well as paragraph order. For example, if a paragraph contains a firm name phrase and the phrase "representing the plaintiff" and is followed by a date paragraph, the extraction program labels it as an attorney paragraph. Similarly if the paragraph contains a single sentence ending in "concur" and is followed by an attorney paragraph, it labels it a judge paragraph. (Wasson, 1998) describes a system that segments the front matter of news documents in a manner that is somewhat analogous to our caselaw document segmentation.
Start
Docket
Attorn
Cite
Court
Versus
Party
Judge
Date
Opinion
End
Figure 6: Finite State Machine for Paragraph Segmentation Parsing Sentences In Attorney and Judge Paragraphs The second step in our extraction process involves parsing the paragraphs identified in the first step. Our purpose in parsing court, date, attorney, and judge paragraphs is to extract attorney and judge names and information related to the names to enable us to match the name to a particular attorney or judge biography.
Input Paragraph: Robert P. Smith and Robert A. Manning, Tallahasee; J.W. Prichard, Jr. and Robert B. Parrish of Moseley, Warren, Prichard & Parrish, Jacksonville for Appellant.
Output Tree: 3rd FSM
Sentence
Counsel phrase
Counsel phrase
Appellant phrase
2nd FSM
1st FSM
Attorney
Robert P. Smith
Attorney
Robert A. Manning
City
Tallahassee
Attorney
J.W. Pritchard, Jr.
Attorney
Robert B. Parrish
Firm
Moseley, Warren, Pritchard & Parrish
City
Jacksonville
for Appellant
Figure 7: Example of Parsed Attorney Paragraph Parsing information out of the attorney paragraphs is not trivial, but we have found that the attorney paragraphs are amenable to parsing with a semantic parser. (McDonald, 1996) provides a good discussion of the advantages of semantic parsing for information extraction purposes. Figure 7 shows a sentence within an attorney paragraph and the output from successive finite state machines that create a constituency tree for the sentence. The first column of the table shows the tokenized sentence, the second column shows tokens collected into simple phrases, and the third column shows the simple phrases grouped into more complex phrases still. To parse judge paragraphs, we apply a method similar to the one we use for attorney paragraphs. For the date and court paragraphs, we convert the parsed date and court names into normalized forms that can be placed in the template records and used for name matching purposes. We should note that general purpose name recognition software that simply identifies person names is not adequate for our application. We need to identify attorney names and judge names and be able to
differentiate these names from the names of plaintiffs and defendants. A special purpose grammar and name ontology is essential for our application. Template Generation Once our system has identified and parsed attorney and judge paragraphs, it generates a template record for each judge and attorney name within the paragraph. Attorney templates contain fields that identify first, middle, and last name as well as firm, city, state, date, and character offset information. The character offset information is used by the insertion module to place hypertext links into the correct spot within the attorney paragraph. Judge templates contain the same information as attorney templates except that court name is used and firm, city, and state information are not used. An example attorney template is shown below. Last name First name Middle name Name Suffix Firm City State Date Paragraph offset Name Length
Parrish Robert B. Blank Moseley, Warren, Prichard & Parrish Jacksonville FL June 22, 1998 75 17 Table 1: Example Attorney Template
To attach the correct firm and city-state information to an attorney name, we look for firm, city, and state names that have been attached to the attorney name under counsel phrases. If no city or firm name is present, we leave these fields blank. If no state information is present in the attorney paragraph, we look for state information in the court name. For example, if the court name is the Supreme Court of California and no state name evidence can be found in the attorney paragraph itself, we set the state name in the template to California. To attach court name information to a judge name in a judge template, we need only attach the court name information collected from the court paragraph.
Matching Names The job of the matching module is to find the biography record that most probably matches each template record created by the extraction module. The process of matching one fielded record (such as the template) to another fielded record (such as a biography record) is often referred to as record linkage. To accomplish its job, the matching module reads candidate biography records and computes the probability that a given candidate record refers to the same person referenced in the template using a Bayesian belief network. The belief network for attorneys and the belief network for judges are shown below.
First name
Middle name
Name suffix
Last name
Firm
CityState
Attorn Match
Figure 8: Attorney Matching Network
First name
Middle name
Last name
Name suffix
Court
Judge Match
Figure 9: Judge Matching Network The processing steps of the match module for attorneys are the following. For each template record, we read the set of all biography records whose last names match or are compatible with the last name in the template. We call this set of biography records the candidate records. For each candidate record, we determine how well the first name, middle name, last name, name suffix, firm, and citystate match the template fields. Using the degree to which each piece of evidence matches, we compute a match probability score for the linkage. The candidate record with the highest match probability is the record we use to build our hypertext link. The processing steps of the judge match module are the same as the attorney match module except that the pieces of match evidence are first name, middle name, last name, name suffix, and court name. Of course, the conditional probabilities pertaining to judges are derived from judge training data and are distinct from the attorney name probabilities. Besides having the highest match probability, a candidate record must meet three additional criteria before we link it to the template. First, the date on the candidate record must be earlier than the template record date. Second, the highest scoring record must have a probability that exceeds a minimum threshold. Third, there must be only one candidate record with the highest probability. If two or more records share the highest score, no linkage is made.
Bayesian Network for Matching To evaluate how well a template matches a biography candidate record, we set up two Bayesian networks. One network uses six pieces of evidence to match attorneys. The other network uses five pieces of evidence to match judges. Within each network, we consider each piece of evidence to be independent of the other pieces. That is, we assume that the state of any match variable does not effect the state of any other variable. For example, we assume that the fact that last names match exactly has no effect on whether the first names match exactly. This assumption of independence seems harmless and has the advantage of simplifying our match probability calculation. We allow each match variable to have from three to five states. We determine the states using the following heuristic. When we compare an element of a template and a candidate, our comparison can result in one of four mutually exclusive determinations. First, the elements match exactly. Second, the elements mismatch. Third, we do not have enough information to tell whether they match or mismatch. Fourth, the elements are compatible but fall short of matching exactly. We applied this heuristic to each of our variables. For some of the variables, (namely, first name, middle name, suffix, court, city-state, and firm) we split the fourth determination into strongly compatible and weakly compatible and arrive at five states. For three of the variables (namely, last name, suffix and court), we drop the third determination because it is never the case that we do not have enough information to determine whether the candidate and template match or mismatch with respect to this evidence. Consequently, the last name variable reduces to three states, and the court and suffix variables reduce back to four. So, for first name, middle name, city-state, and firm variables, we establish five match states: exact, strong fuzzy, weak fuzzy, unknown, and mismatch. For name suffix and court name, we establish four states: exact, strong fuzzy, weak fuzzy, and mismatch. And for last name, we establish three states: exact, strong fuzzy, and mismatch. We compute the belief in the correctness of a match using the following form of Baye's rule:
P( M | E ) =
P( M )∏ P( Ei | M ) i
P( M )∏ P( Ei | M ) + P(¬M )∏ P( Ei | ¬M ) i
i
P(M|E) is the probability that a template matches a candidate record given a certain set of evidence. P(M) is the prior probability that a template and biography record match (i.e. refer to the same person). P(¬M) is the prior probability that a template and biographical record do not match. For attorneys, P(M) is 0.000001 and P(¬M) is 0.999999 since there are approximately 1,000,000 attorney records in the biography database. For judges, P(M) is 0.00005 and P(¬ M) is 0.99995 since there are approximately 20,000 judge records in the biography database. P(Ei|M) is the conditional probability that Ei takes on a particular value given that a template matches a biography record. For example, if we let E1 stand for first name match evidence, the probability that the first names in the attorney template and candidate record match exactly given that the template and candidate match is P(E1=exact|M). Table 2 shows that P(E1=exact|M) = 0.94573. P(Ei|¬ M) is the conditional probability that Ei takes on a particular value given that a template does not match a biography record. For example, if we let E1 stand for first name match evidence, the probability that the first names in the attorney template and candidate record match given that the template and candidate do not match is P(E1=exact|¬ M). Table 2 shows that P(E1=exact|¬ M) = 0.00925.
We determine the conditional probabilities for attorneys and judges by using a manually tagged training set of 7,186 attorney names and 5,323 judge names. We compute the conditional probability for first names matching exactly when the attorney template and candidate match by using the following formula:
P( E1 = exact | M ) = (1 − a) / x + a( y / z ) Where x = number of evidence states. For first name this is five. y = number of attorney match pairs in which first name matched exactly z = total number of attorney match pairs a = a smoothing constant. We used 0.999999 We compute all other conditional probabilities in a similar manner. First Name Evidence Some of the rules for the first name evidence variable are as follows. If the template and candidate last name strings match exactly, the degree of match is exact. If one name is a nickname of the other or if one name is an initial only and matches the first letter of the other name, the degree of match is strong fuzzy. If one first name is within an edit distance of two from the other, the degree of match is weak fuzzy. If either the template first name is unspecified or the candidate first name is unspecified, the degree of match is unknown. Otherwise the degree of match is mismatch. An example of a strong fuzzy match would be the template first name containing “Bob” and the candidate “Robert”. An example of a weak fuzzy match would be the template containing “Roberto” and the candidate “Robert”. Table 2 below shows the conditional probability matrix for the attorney first name match variable and the judge first name match variable. Each row contains the probability distribution associated with a match or mismatch conditional assumption. For example, the row labeled "Attorney M" shows the conditional probabilities associated with the condition that the attorney template matches the attorney candidate. Each column contains one of the five mutually exclusive states the first name match variable can assume. Each cell in the table then contains a conditional probability for a given state under a given conditional assumption. All of these probabilities derive from the training data. Tables 3 through 8 for the other variables are analogous to Table 2.
Attorney M Attorney ¬ M
Mismatch 0.00070 0.97872
Unknown 0.00195 0.00028
Weak Fuzzy 0.00390 0.00184
Strong Fuzzy 0.04773 0.00991
Exact 0.94573 0.00925
Judge M Judge ¬ M
0.00301 0.103513
0.89367 0.893848
0.00094 0.000146
0.01071 0.001426
0.09168 0.001067
Table 2: First Name Evidence Matrix Note that the great difference in the attorney and judge probabilities is due to the fact that it is far more likely for an attorney to have his/her first name mentioned in caselaw than it is for a judge. Judges are usually referenced only by last name as in the phrase "Justice Smith". Middle Name Evidence The rules for the middle name evidence are essentially the same as for first name. There are two differences, however. If both names are unspecified, the degree of match is exact. And if the template
contains a blank middle name and the candidate middle name contains an initial, the degree of match is weak fuzzy.
Attorney M Attorney ¬ M
Mismatch 0.00404 0.68003
Unknown 0.00654 0.12242
Weak Fuzzy 0.13485 0.12862
Strong Fuzzy 0.06763 0.03921
Exact 0.78695 0.02972
Judge M Judge ¬ M
0.00263 0.085929
0.09299 0.095118
0.71013 0.718558
0.08210 0.004860
0.11215 0.095536
Table 3: Middle Name Evidence Matrix Note that the difference in probability distributions for judges and attorneys is mostly due to the fact that judge middle names are not usually specified in caselaw. Last Name Evidence Some of the rules for the last name evidence are the following. If the template and candidate last name strings match exactly, the degree of match is exact. If part of a two-part name matches a single name, the degree of match is strong fuzzy. Otherwise the degree of match is mismatch. An example of a strong fuzzy match would be if the template last name contained “Smith-Turner” and the candidate "Smith".
Attorney M Attorney ¬ M
Mismatch 0.0000003 0.99910
Strong Fuzzy 0.01225 0.00165
Exact 0.98775 0.00025
Judge M Judge ¬ M
0.01803 0.999201
0.02160 0.000138
0.95998 0.000571
Table 4: Last Name Evidence Matrix Name Suffix Evidence Some of the rules for the name suffix matching are the following. If a name suffix such as "Jr." is specified in both the template and the candidate record and the suffixes are the same, the degree of match is exact. If both the template and the candidate have no suffix, the degree of match is strong fuzzy. If one suffix is specified and the other suffix is blank, the degree of match is weak fuzzy. And if both suffixes are specified and mismatch, the degree of match is mismatch.
Attorney M Attorney ¬ M
Mismatch 0.00000025 0.00367
Weak Fuzzy 0.01809 0.11430
Strong Fuzzy 0.91080 0.87166
Match 0.07111 0.01037
Judge M Judge ¬ M
0.00000025 0.000715
0.10013 0.127851
0.89029 0.870471
0.00958 0.000963
Table 5: Name Suffix Evidence Matrix City State Evidence Some of the rules for the city-state matching are the following. If both the city and state name match exactly, the degree of match is exact. If the states match and the cities are in the same county, the
degree of match is strong fuzzy. If the states match but the city is unspecified in the template, the degree of match is weak fuzzy. If the city and state are unspecified in the template, the degree of match is unknown. If the states do not match, the degree of match is mismatch.
Attorney M Attorney ¬ M
Mismatch 0.02588 0.95303
Unknown 0.00153 0.00400
Weak Fuzzy 0.21291 0.03547
Strong Fuzzy 0.02714 0.00155
Exact 0.73253 0.00595
Table 6: City State Evidence Matrix
Firm Name Evidence Some of the rules for the firm matching are the following. If all of the tokens found within the template match the tokens within the candidate firm name, the degree of match is exact. If all but one of the tokens match, the degree of match is strong fuzzy. If all but two of the tokens match, the degree of match is weak fuzzy. If one or both of the firm names is unspecified, the degree of match is unknown. If more than two of the firm tokens mismatch, the degree of match is mismatch.
Attorney M Attorney ¬ M
Mismatch 0.08043 0.40850
Unknown 0.60785 0.58870
Weak Fuzzy 0.02087 0.00187
Strong Fuzzy 0.02950 0.00091
Exact 0.26134 0.00002
Table 7: Firm Name Evidence Matrix Court Name Evidence The court match variable can take on one of five values: exact match, strong fuzzy, weak fuzzy, and mismatch. The rules for court name comparison are complex due to two factors. The first is the fact that a single court can often be referred to in a number of different ways. The second is that some courts allow judges from other courts to serve in the capacity of visiting judge. To compare court names, we converted the names to a normalized form that identified the court's system, the court's level of jurisdiction, and the court's division within jurisdiction. If the systems, jurisdiction levels, and divisions match, the degree of match is exact. If the systems and jurisdiction levels match, the degree of match is strong fuzzy. If only the systems match, the degree of match is weak fuzzy. In all other cases, the degree of match is mismatch.
Judge M Judge ¬ M
Mismatch 0.06669 0.877317
Weak Fuzzy 0.09525 0.085656
Strong Fuzzy 0.30434 0.031814
Match 0.53372 0.005213
Table 8: Court Name Evidence Matrix Example Calculation As an example of how we compute match probabilities for template-biography matches, consider the attorney template for “Robert B. Parrish” and three candidate biography records shown below. Table 9 contains biography candidates: Robert B. Parrish of Jacksonville, Robert Parrish of Tampa, and Bob B. Parrish, Jr. of Jacksonville. We compute the match probabilities for each of these candidates to be respectively: 0.999, 0.026 and 0.659. The best match for our template lawyer then is Robert B. Parrish of Jacksonville and this is the one we select as a match. The exact calculation for Robert B. Parrish of Jacksonville is shown in figure 10.
0.99 =
0.000001 * 0.94573 * 0.78695 * 0.98775 * 0.91080 * 0.73253 * 0.26134 ((0.000001 * 0.94573 * 0.78695 * 0.98775 * 0.91080 * 0.73253 * 0.26134) + 0.999999 * 0.00925 * 0.02972 * 0.00025 * 0.87166 * 0.00595 * 0.00002)) Figure 10: Calculation of Match Probability for Robert B. Parrish of Jacksonville
Last Name First Name Middle Name Name Suffix CityState Firm
Template Data Parrish
Cand 1
Evid 1
Cand 2
Evid 2
Cand 3
Evid 3
Parrish
Exact
Parrish
Exact
Parrish
Exact
Robert
Robert
Exact
Robert
Exact
Bob
B.
B.
Exact
Blank
Unknown
B.
Strong Fuzzy Exact
Blank
Blank
Blank
Jacksonville FL Moseley & Warren
Strong Fuzzy Weak Fuzzy
Jr.
Jackson-ville FL
Strong Fuzzy Exact
Exact
Smith & Jones
Moseley & Warren
Tampa FL
0.999
Match Prob
Mismatch
Jacksonville FL Brown & Young
0.026
Weak Fuzzy Exact
Mismatch 0.659
Table 9: Example Calculations for Attorney Match
Experiments To evaluate our program, we compared its performance against the performance of a human expert on 600 caselaw documents. We took the links generated by the human expert to be the gold standard. In the 600 documents, our expert successfully linked 3,838 attorney names and 3,849 judge names to unique biographies. We then automatically linked attorney names and judge names using match probability thresholds of 0.99, 0.90, 0.75, 0.50, 0.25, 0.10, and 0.01 and compared the results. Our results for attorney names are shown below in Table 10. Threshold
Number Manually Matched
Number Auto Matched
0.99 0.90 0.75 0.50 0.25 0.10 0.01
3838 3838 3838 3838 3838 3838 3838
2549 3265 3408 3477 3540 3674 3816
Number Auto Matched Correct 2542 3254 3393 3458 3515 3599 3654
Precision
Recall
F-Measure
0.997 0.996 0.996 0.995 0.993 0.980 0.958
0.662 0.848 0.884 0.901 0.916 0.938 0.977
0.796 0.916 0.937 0.946 0.953 0.959 0.967
Table 10: Attorney Name Linking Test Results For attorney names, the precision is very high at all levels and increases as the threshold rises. Recall is high at lower thresholds and declines as the threshold rises.
Our results for judge names are shown below in Table 11. Threshold
Number Manually Matched
Number Auto Matched
0.99 0.90 0.75 0.50 0.25 0.10 0.01
3849 3849 3849 3849 3849 3849 3849
238 616 2780 2781 3522 3536 3655
Number Auto Matched Correct 238 616 2728 2729 3456 3458 3555
Precision
Recall
F-Measure
1.000 1.000 0.981 0.981 0.981 0.978 0.973
0.062 0.160 0.709 0.709 0.898 0.898 0.924
0.117 0.276 0.823 0.823 0.938 0.936 0.948
Table 11: Judge Name Linking Test Results For judge names, the precision is also very high. Recall is fairly high at thresholds of 0.25 and below. The very low recall at thresholds of 0.90 and above is due to the fact that most judges are specified only by their last name. To reach thresholds above 0.90, we must have a first or middle name specified in the caselaw and this is not often the case. To get an idea how our method performs against plausible automatic alternatives, we compared our method to three other matching techniques for both attorney and judge names. For attorney names, we measured the precision and recall we would get (1) if we link attorney names only when the first, middle, last name, and city-state match exactly, (2) if we link attorney names only when the first, middle, and last name match exactly without regard to city-state or firm information, and (3) if we link attorney names only when the first and last name match exactly without regard to middle name, city-state, or firm. The results of this test are shown below and are compared with the Bayesian matching method at threshold 0.25.
Bayesian threshold 0.25 Exact Match on first name, middle name, last name, and city-state Exact Match on first, middle and last name Exact Match on first and last name only
Precision 0.993 0.994
Recall 0.916 0.422
F-Measure 0.953 0.592
0.950
0.613
0.745
0.939
0.590
0.725
Table 12: Attorney Link Method Comparisons For judge names, we measured the precision and recall we would get (1) if we link judge names only when the first name, last name, and court match exactly, (2) if we link judge names only when the last name and court name matched exactly, and (3) if we link judge names only when the last names match exactly without regard to court. The results of this test are shown below and are compared with the Bayesian matching method at threshold 0.25.
Bayesian threshold 0.25 Exact Match on first name, last name, and court name Exact match on last name and court name Exact Match on last name only
Precision 0.981 1.000
Recall 0.898 0.047
F-Measure 0.938 0.090
0.979
0.665
0.792
0.955
0.361
0.524
Table 13: Judge Link Method Comparisons
Discussion This work shows that the constituent pieces of a name can be used as independent pieces of match evidence for the purpose of name matching and that other pieces of match evidence may be attached to a name through semantic parsing. Our work suggests that one can use a relatively small number of pieces of evidence from a document to identify a person uniquely--provided one selects the evidence carefully. This type of approach could complement approaches that use a somewhat more coarse grained approach such as clustering paragraphs (Bagga & Baldwin, 1998). What makes this system viable is the existence of a fairly comprehensive biographical database and the fact that caselaw documents have predictable structures and syntax. Certainly not all domains have these characteristics. If a domain does have these characteristics or if one can somehow impose them on the domain, one can build hypertext links automatically with great precision and, as an added benefit, resolve cross document coreferences for the linked names. Precision errors for attorney names usually resulted from the program assigning an attorney to a biography in situations where the human expert deemed the name unmatchable. In these cases, the names matched by the program were not incompatible but the human expert thought there was too much uncertainty in the evidence to make a match reliably. For example, the program matched the biography of a James Jackson of Palm Springs, California to a James P. Jackson practicing in Sacramento, Cailfornia in 1990. In this case, the human expert deemed the name James P. Jackson unlinkable. It is interesting to note in these cases that the names involved usually were comprised of common first and last names. When the names involved were rare, the human expert often deemed a match correct in circumstances where she would have marked a common name unlinkable. Recall errors for attorney names resulted from a variety of causes. In some cases, the attorney had gotten married and was now using her spouse's last name. In others, the last name had been misspelled in the caselaw or in the biography database. And in still others, the biography database contained two records for the same individual. In this last case, the program would find two records that shared the top score and would discard the link. Precision errors for judges usually resulted from the situation in which two judges with the same last name presided over compatible courts at the same time but only one of the judges had a biography. The solution to this of course is to add the second biography. But this does show that, when there is a minimal amount of match evidence available from the document, the precision of the program depends upon the comprehensiveness of the biography database. Recall errors for judges were mostly due to a limitation in the biographical data we used for matching. We did not incorporate past position information into our biography data for matching purposes. So, for example, if a judge had been promoted from a lower court to a higher court, we were only able to link his/her name in the higher court decisions. When the human expert was aware through world knowledge or through the past position information in the biography itself that the judge had served on the lower court, the expert could make the link and the program could not.
An interesting problem surfaced with respect to matching first names. When we first developed our algorithm, we used edit distance alone to assess whether a name matched in a weak fuzzy way. This allowed us to match names such as “Robert” and “Roberto” but it also caused us to match “Mark” with “Mary”. To improve our matching routine, we changed our algorithm to count such cross-sex comparisons as mismatches. We should note that our Bayesian match method allowed us to properly link names even when the extraction module incorrectly attached a firm name but attached city-state correctly as well as when it incorrectly attached city-state but attached firm correctly.
Conclusion Using MUC-style information extraction technique and a Bayesian based record linkage technique, we have built a system that can create hypertext links automatically. Specifically, we built a system that can link attorney and judge names in caselaw to biography records with a precision of 0.99 and recall of 0.92 for attorney names and a precision of 0.98 and recall of 0.90 for judge names. By creating this system, we have made it feasible to build hypertext links for the millions of attorney and judge names occurring in American caselaw. And as an added benefit, we have made it possible to search these names with unprecedented accuracy. To expand this method to other domains, one would need to be able to predict the structure of the domain documents and to have an authority file of the named individuals or entities to which one was building links. If these two conditions exist or can be created, one should be able to create links automatically and thereby resolve cross document coreferences for these names. It is quite conceivable that the techniques described here could be applied to other domains, such as linking medical reports to the doctors that authored them or to the patients to which they apply, or linking the names of public figures in news articles to biographical information.
Bibliographical References Appelt, D., Hobbs, J., Bear, J., Israel,D., & Tyson, M. (1993). Fastus: a finite-state processor for information extraction from real-world text. Proceedings of IJCAI-93. Chambery, France. Bagga, A. & Baldwin, B. (1998) Entity-Based Cross-Document Coreferencing Using the Vector Space Model. Proceedings of COLING-ACL '98 Conference. (pp. 79--85) Monreal, Quebec, Canada. Borgman, C. L. & Siegfried, S.L. (1992). Getty’s Synoname and its cousins: A survey of applications of personal name-matching algorithms. Journal of the American Society of Information Science, 43(pp. 459-476) Grishman, R. (1997). Information Extraction: Techniques and Challenges. In Information Extraction : A Multidisciplinary Approach to an Emerging Information Technology : International Summer School, Frascati, Italy.(pp. 13--27). Springer Verlag. . McDonald, D. (1996). The Interplay of Syntactic and Semantic Node Labels in Parsing. Recent Advances In Parsing Technology, edited by Harry Bunt and Masaru Tomita. Kluwer Academic Publishers. Newcombe, H. (1988). Handbook of Record Linkage. New York: Oxford University Press. Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Network of Plausible Inference.
Morgan Kaufmann Publishers. Proceedings: Sixth Message Understanding Conference (MUC-6) (1995). Columbia, MD, USA. Morgan Kaufmann. Wasson, M. (1998) Using Leading Text for News Summaries: Evaluation Results and Implications for Commercial Summarization Applications. Proceedings of COLING-ACL '98 Conference. (pp. 1364--1368 ) Monreal, Quebec, Canada.