Evaluating Document Retrieval in Patent Database: a ... - CiteSeerX

I’

._

:.

,.,

I

:,,,-

- .; ~,;$$@&;‘~ ;,,$$&

.;{ .-4.* ., g;;\ ; ‘I;p;sfi ;. >k?@,

‘;

.*-y:: .*;;‘& t’

1

j from the ambiguous noun phrase x1 .,. x,. The score assigned to a candidate pair is the sum of the scores for each occurrence of @is pair in any compound nominal within the training corpus. For each occurrence, the score is maximum when the words xi and Xjare the only words in the phrase, i.e., we have unambiguous nominal Xjxi, in which case the score is 1. For longer phrases, for non-adjacent words, and for pairs anchored at words toward the left of the compound, the score decreases proportionately.

Figure 1. Referencing Citations in Patents

5118617 /

2. For each set Xj = {xi + xj I for i > j ] of candidate pairs

rank alternative pairs by their scores. 3. Disambignate by selecting the top choice from each set

such that its score is global threshold, it is ond best choice from iower than the scores Xi.

above an empirically established significantly higher than the secthe set, and it is not significantly of pairs selected from other sets

The effectiveness of this algorithm can be measured in terms of recaI1 (the propotion of all valid head+modifier pairs extracted from ambiguous nominals) and precision (the proportion of valid pairs among those extracted). The evaluation was done on a small sample of randomly selected phrases, and the algorithm performance was compared to. manually selected correct pairs. The folIowing numbers were recorded: recd 66% to 71%; precision 88% to 91%, depending on the size of the training sample. In terms of the total number of pairs extracted unambiguously from the parsed text (i.e., those obtained by the procedure described in the previous section), the disambjguation step recovers an additional 10% to 15% of pairs, all of which were previously thrown out as unrecoverable.

4273875

t

3813316

4.0 Queries and Aevancy judgments Dnring a normal patent application examination process, a list of previously granted patents (both U.S. and foreign) related to the current invention is compiled by the patent examiner, and subsequently recorded in section Patents Cited (UREA) of the newly accepted patent document. A patent application thus serves as a query for searching for “prior art” in the database of granted patents (and outside of it). We simulated this situation by selecting patents from the test collection to serve as queries for finding other pntcnts, These documents were subsequently removed from the collection. The citations of other patents found in UREP seetions of query-patents provided the relevancy judgcmcnts information. On average there were approximately 6 rclcvant documents per query.

i 1

I 1.i ,, ’ I

,:‘1

)

:

-!

,:, ’ / -,

-. ‘ .

’ ’

,-‘., ‘. i :I .. .. .; ,:

,; ;

.:.:.j .: J ,,.~ I_.

3.0 Experimenti

collection

‘I.,,

‘,.

1 ‘,,2 , .1 i-.a ..‘, 3’ ‘_,,. -* c-z’;, ~~ ., “.,f> i, .>;*. 2,. ‘.

., ,.- ,>,* ..I ’ r ‘L ,., :_., 4 _“’

i

,., ,: I II ,’ I

I < 0, ,,

! 4

..; .. -

The experimental collection was derived from an approximately 1 Gbyte (13,747 patents) subset of U.S. PI’0 patent database covering patents in-classes 395 and 437 issued in the years 1972 to 1992 inclusive. AI1 were uniformly preformatted into SGML to allow for easy parsing by the retrieval system. The documents within the colIection showed a wide variation in length. The majority of free-text within a patent document is contained within the following four sections (ordered in terms of increasing average length): title (‘ITL), abstract (AIM?), brief summary @SUM) and detailed description (DETD). The DETD section typically constituted the bulk of the free-text in a single patent. The remaining sections of the patent contain highly

The patent appIication process requires examiners to scorch granted patents in deciding whether to approve any particular patent application. The relevancy judgements stored in the UREF field are therefore highly reliable indicators of

1 i 1 i

B

218

-- -_--

.+ -uz.

-.

relevance that would be impossible to judge except by experts familiar with the domain of the patent application and legal terminology used in writing that patent. Patents should be viewed as a valuable resource, comprising many gigabytes of text, with predetermined relevancy judgemen& Figure 1 shows tbe relationships between patents as described by the UREF section. The selection of Patent 4513086 as a query would require that patents 4332898, 4332900, 4338400 be marked as relevant, since they are directly cited. Patent 4745056 could also be added to the collection since it may be reasonable to expect that the relationship between a cited patent and its referencing document is symmetrical, and therefore this patent could be considered relevant. Note that patent numbers reflect a temporal ordering: 4745056 could not have been originally cited by any of the above patents, since it did not exist at the time the others were created. For the purposes of the experiments described here these potential forward links were not considered, which may have resulted in underrating of our system’s performance. The transitive closure of the set of patents referenced in the direction of the arrows in Figure 1. cannot be assumed to be relevant, although there may be cases where this would be true, because the part of the patent that is relevant via one link may not be the part that is relevant in the next link. Only immediate citation in the UREF section of the patent acting as a query is recorded as relevant. 5.0 Experimental design To provide a basis for the evaluation of the natural language indexing enhancements a series of initial retrieval runs were undertaken in an attempt to determine the optimal SMART configuration for the experimental collection. Various combinations of section indexed, stemming method and weighting scheme were used. Following the initial base-line retrieval runs, a further series of retrieval runs were performed using the pattern-matching indexing method. Again, various combinations of stemming methods, section selection and weighting schemes were used in an effort to maximize the retrieval effectiveness of the new indexing methods.

matching the following pattern (* indicates 0 or more occurrences): ADiTl NOUFJl NOLJlV2of ADJ2 NOVN3* NOUN.4 would be normalized by concatenating the bindings in the following order:

for indexing as a single term. Further experiments were mn to examine the performance of the pair extraction technique. Again attempts were made to optimize the returns by varying the application of the method on different sections of the document. 6.0 Results Several small- to medium-scale evaluations are reported. The main purpose was to see if indexing and retrieval using phrases obtained through simple pattern matching can lead to overall performance improvement in retrieval, and if these improvements can be sustained. Small-scale experiments involved the following runs: (1) the collection of 120 documents (approx. 10 MBytes) and 20 queries with the average of 6.3 relevant documents per query; and (2) the collection of 1200 documents (approx. 100 MBytes) and 230 queries witb the average of 1.3 relevant documents per query. Medium-size evaluations with about 6000 patent collection (about 500 MBytes) and nearly 2000 queries (with the average of 1.8 relevant documents per query) posted much lower average precision, but showed similar performance gains. The retrieval performance was measured using Cornell’s SMART system with ltc.ltc weighting scheme, which was optimal for tbe patent data. Only three sections within the patent document were used for indexing: Claims (CLMS), Brief Smmnary (BSUM) and Detailed Description (DETD). Indexing of additional sections produced less effective representation which resulted in reduced retrieval performance. Simple suffix stemming (Porter) was applied to terms in both BSUM and DETD sections. No stemming was applied to terms in CLMS. The baseline, single-word term performance is shown in rows marked BASE. The performance of baseline SMART augmented with phrase indexing is reported in rows marked PM1 and PM2. Again, only sections CJLMS, BSUM and DETD were indexed, while stemming was applied to sections BSUM and DETD. PM1 shows the performance when phrase extraction is applied to CLMS and BSUM, whereas PM2 shows the case when the indexing phrases are also extracted from section DETD.

Retrieval runs were also performed to determine the effect of increasing the complexity of the patterns, and thus approach a finite-state approximation of a simple and very limited parser. In addition, terms bound during the match were rearranged before indexing to attempt to normalize the generated term. For example the bindings resulting from

219

.‘; /., : .’:: .. ,, ‘: ~. ;-.-‘:: / ‘, rT P ,;: ;;r;,(; ; .f : ‘; ..‘, “Y> :; ;, ;-‘, _ .‘ ; : ~~-pz&~j .-,

Evaluating Document Retrieval in Patent Database: a ... - CiteSeerX

Evaluating Document Retrieval in Patent Database: a ... - CiteSeerX

Suggest Documents

Patent Document Retrieval and Classification at KAIST - CiteSeerX

Patent Retrieval: A Literature Review

Test Collections for Patent Retrieval and Patent ... - CiteSeerX

patent database: am ethodology of information retrieval from pdf

Content Based Image Retrieval System for Patent Database - IIT Kanpur

Content Based Image Retrieval System for Patent Database - IIT Kanpur

A Biochemical Database Retrieval System

Document Image Database Retrieval and Browsing ... - Jonathan J. Hull

Spoken Document Retrieval: 1998 Evaluation and ... - CiteSeerX

Document Retrieval Using Proximity-Based Phrase ... - CiteSeerX

Improved biomedical document retrieval system with ... - CiteSeerX

Information Extraction: Beyond Document Retrieval - CiteSeerX

Intelligent Document Retrieval from Heterogenous ... - CiteSeerX

robust talker-independent audio document retrieval - CiteSeerX

Retrieval from Document Image Collections - CiteSeerX

Forensic Handwritten Document Retrieval System - CiteSeerX

Document Image Retrieval: An Overview - CiteSeerX

Spoken Document Retrieval: 1998 Evaluation and ... - CiteSeerX

Document Retrieval Using Proximity-Based Phrase ... - CiteSeerX

Evaluating Content Based Image Retrieval Techniques ... - CiteSeerX

Evaluating Content Based Image Retrieval Techniques ... - CiteSeerX

Database Indexing, Retrieval and Clustering - CiteSeerX

bringing information retrieval back to database ... - CiteSeerX

Patent Document US05470877