http://www.jipa.or.jp/english/index.html
correspond to a newspaper clipping and supplementary memorandum, respectively. As with other existing test collections, search topics also contain fields for titles (), descriptions (), and narratives (). For each topic, the description field generally includes terms (words) used in the title field. Intuitively, a combination of the article and supplement fields is as informative as the description field. However, newspaper articles often contain sentences irrelevant to search topics. Example terms associated with a topic are marked with . One or more patent application IDs relevant to the newspaper article are marked with . All 31 topics were initially written in Japanese, and were manually translated into English, Korean and, traditional or simplified Chinese. In the Patent Retrieval Task, for the purpose of crosssystem evaluation in a standard framework, each participant group was obliged to submit at least one retrieval result obtained with a combination of the article and supplement fields. However, in principle, any fields can be used to formulate queries depending on the purpose of evaluation.
2.4
Relevance Assessment
Relevance assessment was performed as follows: 1. After/while producing topics, the JIPA members performed manual searches to collect as many relevant patents as possible. We shall call the patent set assessed during the manual search “PJ”. The JIPA members were allowed to use any systems and resources, so that we were able to obtain a patent document set retrieved under the circumstances of their daily patent searching. 2. Participant groups submitted retrieval results, from each of which we extracted the top 30 patent documents, and produced a pool of patent documents for each of the 31 topics. We shall call this pool “PS”. 3. The JIPA members assessed the relevance for patent documents in “PS-PJ”, which were the patents they had not seen in their preliminary search. Grades of relevance were “A (relevant)”, “B (partially relevant)”, “C (irrelevant, judged by reading the content of a document)”, and “D (irrelevant, judged by looking at only the title of a document)”. When judging A, B, and C, claims and other fields were considered equally important. The average numbers of A, B, C, and D documents were 45.2, 29.3, 141.4, and 411.4, respectively. To produce the first six topics, the 12 members were divided into six groups consisting of two members, and each group produced one topic and performed relevance assessment. If there was disagreement between two members in the same group, they negotiated with each other to complete the six topics. However, for the remaining 25 topics, each of the 12 members independently produced one or more topics and performed relevance assessment, because negotiation (cross-check) was time-consuming. In all cases, members created queries in the domain of their professional work. Unlike existing methods for relevance assessment used in previous NTCIR workshops, where manual searches were performed after pooling to increase the exhaustiveness of the relevant documents, we performed manual searches before pooling to enable the comparison of the search abilities of
P004EN technology survey Device to judge relative merits by comparing codes such as barcodes with each other
Figure 2: An example search topic in the NTCIR-3 Patent Retrieval Collection (topic ID: P004).
human experts and participant systems. The average numbers of A documents in “PJ-PS”, “PS-PJ”, and “PJ∩PS” were 14.2, 11.0, and 20.0, respectively. In other words, the numbers of relevant documents obtained by participating systems were fairly comparable with those obtained by the JIPA members.
3.
EXPERIMENTATION
3.1
Overview
The purpose of our experiments was to re-examine existing retrieval models from different perspectives, in the context of patent retrieval. For this purpose, we performed experiments on retrieving both patent documents and newspaper articles, and compared the results to find the scientific knowledge inherent in the patents, if any. Thus, we used the NTCIR-3 Patent Retrieval Collection (described in Section 2) and the NTCIR-3 CLIR (cross-lingual information retrieval) test collection consisting of two years worth of Japanese “Mainichi” newspaper articles, to perform comparative experiments. It may be argued that we can obtain certain knowledge by analyzing official results obtained with participating systems in the NTCIR-3 Workshop. However, because fundamental modules, such as morphological analyzers and term weighting methods, differed depending on the system, it is difficult to conduct a glass-box evaluation. In view of this problem, we implemented different retrieval models (systems) and performed comparative experiments independently of the NTCIR-3 Workshop. We used the ChaSen morphological analyzer (version 2.2.9) combined with the IPA dictionary (version 2.4.4)7 to extract nouns, verbs, adjectives, and out-of-dictionary words (i.e., words identified as “unknown” by ChaSen) as index terms from target documents, and performed word-based indexing. Out-of-dictionary words are often technical terms. We also used the same method to extract terms from search topics. We used GETA (Generic Engine for Transportable Association)8 , which includes C/Perl libraries for typical retrieval modules, and implemented different retrieval models on a distributed environment consisting of two PCs (CPU: dual Xeon 1.7 GHz, RAM: 2GB). Because all the software toolkits and test collections used are available to the public, our experiments can easily be reproduced. We used the following contents as target documents independently: • entire contents (full texts) of unexamined patent applications (Full), • author abstracts in unexamined patent applications (Abs), • claims in unexamined patent applications (Claim), • a combination of Abs and Claim (Abs+Claim),
Here, unexamined patent applications are target documents in the NTCIR-3 Patent Retrieval Collection (i.e., two years of Japanese patent applications published in 1998 and 1999). Table 2 shows statistics associated with the document length in words for different target document types. By looking at this table, one can see that the average length of Full is approximately 24 times that of Newspaper. The standard variance of Full is approximately 20 times that of Newspaper. In other words, the length of patent applications in words significantly differs depending on the document. Additionally, the maximum number of unique words (word types) contained in a single application is approximately 30,000, which is 20 times as large as that of newspaper articles. Figure 3 shows the distribution of the document length in words for different target document types. In this figure, the distribution of the document length for Abs and Jsh is roughly a normal distribution. Table 2: Statistics of document lengths in words. Genre Patent Type Full Abs Claim Abs+Claim Mean 3886 101 319 416 Std. variance 3198 32 386 392 Max 251,695 283 58,166 58,187 Max (unique) 29,189 133 3752 3775
Newspaper Jsh 165 35 930 229
165 160 4004 1397
Figure 4 shows the retrieval models compared in our experiments; they form a sample of typical models [9]. In this figure, “log(tf).idf.dl” is equivalent to a metric used in the SMART system [7]. Unlike the other seven models, SMART and BM25 [5]9 use the document length factor, which has been proved effective in the IR literature [8]. It should be noted that because the nine retrieval models were implemented based on common software modules, differences among retrieval models in retrieval accuracy are because of the effectiveness of the models themselves. For the purpose of formulating queries in patent retrieval, we used the following (combinations of) topic fields independently: , +, and
3.2
Results
Tables 3 and 4 show the mean average precision (MAP) values targeting patent documents and newspaper articles, respectively. In both tables, only A documents were regarded as relevant ones (i.e., rigid relevance). Figure 5 fundamentally shows the same information as Tables 3 and 4, but the x/y-axes correspond to retrieval models and average precision values, respectively.
• the JAPIO Patent Abstracts in 1998 and 1999 (Jsh), • two years of Mainichi newspaper articles in 1998 and 1999 (Newspaper). 7
http://chasen.aist-nara.ac.jp/ 8 http://geta.ex.nii.ac.jp/
9
Although log( N −nntt+0.5 ) of the BM25 may have an anomalous negative value where nt is very large [6], we sticked to the original formula in the experiments.
Figure 3: Distribution of the document length in words.
Retrieval model hits baseline tf idf tf.idf
wt bq,t × bd,t fq,t × bd,t fd,t fq,t × dlf d fq,t × idft fq,t × idft ×
log(tf)
(1 + log(fq,t )) ×
log(tf).idf log(tf).idf.dl BM25
fd,t dlfd
1+log(fd,t ) 1+log(avefd ) 1+log(fd,t ) (1 + log(fq,t )) × idft × 1+log(avef d) 1+log(fd,t ) 1 (1 + log(fq,t )) × idft × 1+log(avef × avedlb+S×(dlb d) d −avedlb) (K+1)×f d,t t +0.5 fq,t × log( N n−n ) × dlf d }+f t +0.5 K×{(1−b)+b avedlf
d,t
q query d document t term N number of documents in the collection nt number of documents in which term t exists bx,t existence (1) or absence (0) of term t in x fx,t frequency of term t in x idft inverse document frequency of term t (i.e., “1 + log( nNt )”) dlbx number of unique terms in x P dlfx sum of term frequencies in x (i.e., “ t∈x fx,t ”) x avefx average of term frequencies in x (i.e., “ dlf ”) dlbx avedlb average of dlbx in the collection The normalized scores for fd,t provided better results than those without normalization. The values of the constants (S = 0.2, K = 2.0, b = 0.8) were determined through preliminary experiments. Figure 4: Retrieval models (RSVq,d =
P t
wt ) compared in our experiments.
Table 3: Mean average precision values over the 31 topics targeting patent information (rigid relevance). Retrieval model hits baseline tf idf tf.idf log(tf) log(tf).idf log(tf).idf+dl BM25
D .1050 .0931 .0156 .1515 .0277 .1642 .2230 .2272 .2280
Full DN .0534 .0725 .0227 .1577 .0390 .1255 .2132 .2660 .2503
AS .0166 .0292 .0046 .0744 .0231 .0337 .1082 .1790 .0875
D .0727 .0732 .0132 .1197 .0239 .0579 .0884 .0887 .0838
Abs DN .0429 .0813 .0158 .1272 .0284 .0723 .1151 .1169 .0997
AS .0120 .0304 .0036 .0755 .0197 .0237 .0781 .0844 .0707
D .0516 .0566 .0136 .0941 .0279 .0669 .0978 .1028 .1039
Claim DN .0128 .0572 .0172 .0935 .0337 .0527 .1029 .1182 .1129
AS .0025 .0168 .0047 .0367 .0155 .0099 .0380 .0752 .0557
Abs+Claim D DN AS .0840 .0242 .0045 .0854 .0741 .0274 .0151 .0183 .0042 .1278 .1265 .0665 .0298 .0353 .0227 .0917 .0787 .0186 .1237 .1306 .0725 .1215 .1419 .1062 .1302 .1426 .0786
D .1171 .1066 .0113 .1730 .0222 .0821 .1226 .1184 .1356
Jsh DN .0547 .1138 .0166 .1682 .0258 .0899 .1465 .1501 .1474
AS .0373 .0538 .0025 .1271 .0166 .0457 .1223 .1271 .1015
D: , N: , A:
Figure 5: Mean average precision values for different retrieval models.
Table 4: Mean average precision values over the 50 topics targeting Mainichi newspaper articles (rigid relevance). Retrieval model D DN hits .1397 .1063 baseline .1436 .1865 tf .0755 .1054 idf .1914 .2443 tf.idf .1041 .1279 log(tf) .2266 .2124 log(tf).idf .2940 .2853 log(tf).idf+dl .2746 .3212 BM25 .2759 .3346 D: , N:
3.3
Discussion
We discuss suggestions derived from Tables 3 and 4 and Figure 5, from different perspectives.
Term frequencies In almost all cases (runs), the MAP value of the model relying solely on term frequencies (i.e., the tf model) was smallest in all the models compared. The tf model was even worse than the hits and baseline models. Additionally, the MAP value of the idf model was often decreased when combined with the tf model (i.e., the tf.idf model)10 . We used the paired t-test for statistical testing, which investigates whether the difference in performance is meaningful or simply because of chance [1, 4]. Table 5 shows that for patent retrieval, the MAP values of the idf and tf.idf models were significantly different in all 15 runs at the 5% level and in 12 cases at the 1% level, respectively. Besides this, when retrieving newspaper articles, the MAP values of the idf and tf.idf models were significantly different in both runs at the 1% level. These results suggest that a simple (naive) use of term frequencies was not effective and even decreased the MAP value. However, the logarithmic formulation of term frequencies (i.e., the log(tf) models) was effective when combined with the idf model. That is, the MAP value of the log(tf).idf model was greater than that of the idf model, specifically in cases where target documents are full texts. In Table 5, the MAP values of the log(tf).idf and idf models were significantly different in the three patent runs and the two newspaper runs (at the 1% level), respectively. At the same time, when retrieving abstracts, we could not find any significant differences between the log(tf).idf and idf models in the MAP value. In other words, the log(tf) model was effective in retrieving longer documents.
Inverse document frequencies Inverse document frequencies were generally effective. For Jsh and Abs, the MAP value of the idf model was greater than those of the SMART and BM25 models. For Jsh, the MAP value of the idf model was greater than those of the other models, irrespective of the topic field used. The MAP values of the idf model and the second highest model were significantly different at the 1% level (which is not shown in the tables). Additionally, for Abs, the MAP value of the idf model was greater than those of the other models except “AS”. The MAP values of the idf model and the second highest model were significantly different at the 5% level (which is not shown in tables). One possible rationale is that because abstracts (Jsh and Abs) are standardized (normalized) in terms of document length, the effect of the document length model in SMART and BM25 was decreased.
Comparison between SMART and BM25 The MAP values of SMART and BM25 were generally greater than those of the other models. The difference between SMART and BM25 in the MAP value was marginal. In fact, in Table 5 the MAP values of SMART and BM25 were significantly different at the 5% level, for only five cases. However, 10
This problem was also suggested in the first ACM SIGIR 2000 Workshop on Patent Retrieval. http://research.nii.ac.jp/ntcir/sigir2000ws/
for cross-genre retrieval (i.e., scenarios where queries were formulated from combinations of
Cross-genre retrieval By comparing the MAP values obtained with and and those obtained with combinations of
Comparison between full texts and abstracts The MAP values targeting full patent texts were generally greater than those for abstracts. Specifically, for SMART and BM25, the MAP values of full texts and abstracts were significantly different at the 1% level (which is not shown in the tables). By comparing Jsh and Abs in Table 3, the MAP value of Jsh was greater than that of Abs, except for the tf model. In Table 6, the differences were statistically significant at the 1% level, for most cases. This suggests that professional abstracts generally maintain higher quality than author abstracts for retrieval purposes.
Comparison between patents and newspapers By looking at Table 5, the relative superiority among different retrieval models did not significantly differ depending on the document genre (i.e., patents and newspaper articles). This may be counter-intuitive, because patent documents are significantly different from newspaper articles from a number of perspectives, such as document length and terminology. However, those perspectives did not affect the results of our experiments. In other words, existing standard models were relatively effective even for patent retrieval. One possible rationale is that our experiments simulated technology survey, which was relatively similar to the conventional retrieval scenario, compared with invalidity search. To further explore this issue, in the NTCIR-4 Patent Retrieval Task, we plan to perform an invalidity search task where each participant group searches five years worth of patent applications for those that could invalidate the demand in an existing claim.
4.
CONCLUSION
Given the growing number of large test collections for information retrieval (IR) since the 1990s, extensive comparative experiments have been performed to explore the effectiveness of various retrieval models. Most collections consist of newspaper articles and abstracts in technical publications. While a number of commercial patent retrieval systems and services have operated for a long time, patent retrieval has not received much attention in the IR community. One
Table 5: t-test result of the differences between retrieval models.
idf vs. tfidf idf vs. log(tf).idf log(tf).idf.dl vs. BM25
D À ¿
Full DN AS À > ¿ < À
D À >
Abs DN AS À À À
of the major reasons is the lack of test collections targeting patent information. This background motivated us to promote research and development on patent information retrieval, by providing a test collection consisting of patent documents. In this paper, we described the NTCIR-3 Patent Retrieval Collection and comparative experiments performed using this collection. First, we described the process of producing the NTCIR-3 Patent Retrieval Collection, which includes two years of Japanese patent applications. This collection also includes five years of Japanese patent abstracts and their English translations. Since this collection was produced to evaluate IR systems for technology survey, search topics were produced on the basis of technical newspaper articles. Human experts in patent searching produced 31 topics and also performed relevance assessment. Second, we reported experimental results obtained by using the above collection. For these experiments, we used open software toolkits to implemented nine existing retrieval models and re-examined the effectiveness of those models in the context of patent retrieval. To investigate the scientific knowledge inherent in patent retrieval, we also used the NTCIR-3 CLIR test collection consisting of two years of newspaper articles, and compared the results obtained with different genres of documents. Through our experiments, we re-validated past experimental results (e.g., discussions associated with the effectiveness of term frequencies, inverse document frequencies, and document length) in the context of patent retrieval. We also found that existing state-of-the-art retrieval models (i.e., SMART and BM25) were effective in patent retrieval. Future work will include investigating the effectiveness of various indexing methods, such as character/phrase-based methods, for patent retrieval. To further explore patent retrieval from a scientific point of view, in the NTCIR-4 Patent Retrieval Task, we plan to perform an invalidity search task where users search five years worth of patent applications for those that can invalidate the demand in an existing claim.
5.
REFERENCES
[1] D. Hull. Using statistical testing in the evaluation of retrieval experiments. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 329–338, 1993. [2] H. Itoh, H. Mano, and Y. Ogawa. Term distillation for cross-DB retrieval. In Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering, 2003.
Patent Claim D DN AS À À >
Abs+Claim D DN AS À À >
Newspaper Full D DN À À ¿ ¿
Jsh DN À
D AS À À À > < > “¿”“À”: 0.01, “”: 0.05, “ ”: not significantly different
Table 6: t-test result of the difference between Jsh and Abs.
D À D À
hits DN idf DN À
AS À AS À
Jsh vs. Abs baseline D DN AS D DN À À À D
tf.idf DN AS
tf AS log(tf)
D >
DN
AS À
log(tf).idf log(tf).idf.dl BM25 D DN AS D DN AS D DN AS À À À À À > À À À “¿”“À”: 0.01, “”: 0.05, “ ”: not significantly different
[3] M. Iwayama, A. Fujii, N. Kando, and A. Takano. Overview of patent retrieval task at NTCIR-3. In Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering, 2003. [4] E. M. Keen. Presenting results of experimental retrieval comparisons. Information Processing & Management, 28(4):491–502, 1992. [5] S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 232–241, 1994. [6] S. Robertson and S. Walker. On relevance weights with little relevance information. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 16–24, 1997. [7] G. Salton. The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall, 1971. [8] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–29, 1998. [9] J. Zobel and A. Moffat. Exploring the similarity space. ACM SIGIR FORUM, 32(1):18–34, 1998.