names (e.g., âBig Appleâ, âKremlinâ or âUE Headquartersâ), these named .... Cityâ and âOffice buildings in the United Statesâ are handled by REMBRANDT as ad-.
Experiments with Geographic Evidence Extracted from Documents Nuno Cardoso, Patrícia Sousa and Mário J. Silva Faculty of Sciences, University of Lisbon, LASIGE {ncardoso,csousa,mjs}@xldb.di.fc.ul.pt
Abstract. For the 2008 participation at GeoCLEF, we focused on improving the extraction of geographic signatures from documents and optimising their use for GIR. The results show that the detection of explicit geographic named entities for including their terms in a tuned weighted index field significantly improves retrieval performance when compared to classic text retrieval.
1
Introduction
This paper presents the participation of the XLDB Team from the University of Lisbon at the 2008 GeoCLEF task. Following a thorough analysis of the results achieved on the 2007 participation [1], we identified the following improvement points: Experiment with new ways to handle thematic and geographic criteria. Our previous methodology was moulded on the assumption that the thematic and geographic facets of documents and queries were complementary and nonredundant [2]. Previous GIR prototypes handled thematic and geographic subspaces in separate pipelines. As the evaluation results did not show significative improvements compared to classic IR, an alternative GIR methodology should be tested. Capture more geographic evidence from documents. The text mining module, based on shallow pattern matching of placenames, often failed on the extraction of essential geographic evidence for geo-referencing many relevant documents [1]. We therefore considered reformulating our text annotation tools, making them capable of capturing more geographic evidence from the documents. As people describe sought places in several other ways other than providing explicit placenames (e.g., “Big Apple”, “Kremlin” or “UE Headquarters”), these named entities (NE) can be captured and grounded to their locations, having an important role on defining the geographic area of interest of documents. Smooth query expansion. Query expansion (QE) is known to improve IR performance in most queries, but often at the cost of degrading the performance of other queries. We do not assign weights to query terms, so the expanded terms have the same weight as the initial query terms. This means that we do not control the impact of QE in some topics, which causes query drifting [3]. This year, we wanted to use QE with automatic re-weighting of text and geographic terms, to soften the undesired effect of query drifting.
To address these topics, we made the following improvements to our GIR prototype: Query Processing: We now handle placenames both as geographic terms and plain query terms. In fact, placenames revealed to be good retrieval terms, and were frequently ranked at the top on the query expansion step [1]. While placenames may be used in other unrelated contexts, such as proper names, they seem to help recall when used as plain terms. In addition, their geographic content can be used afterwards to refine the ranking scores, promoting documents with placenames referred in a geographic context. Text mining: We developed R EMBRANDT, a new named entity recognition module that identifies and classifies all kinds of named entities in the CLEF collection [4]. Used as a text annotation tool, R EMBRANDT generates more comprehensive geographic document signatures (Dsig ). We first introduced Dsig on last year’s participation [1] as a means to capture the geographic scope of documents as lists of geographic concepts corresponding to the grounded names in the documents used for computing the geographic similarity of documents and queries. The Dsig comprise two kinds of geographic evidence: i) explicit geographic evidence, consisting of grounded placenames that designate locations, and ii) implicit geographic evidence, consisting of other grounded entities that do not designate explicitly geographic locations but are strongly related to a geographic location (e.g., monuments, summits or buildings). Document Processing: To cope with the new query processing approach, we needed a simple ranking model that elegantly combined the text and geographic similarity models, eliminating the need for merging text and geographic ranking scores, while still allowing us to assign a weight to each term. We extended MG4J [5] to suit our requirements for this year’s experiments, and we chose the BM25 [6] weighting scheme to compute a single ranking score for documents using three index fields: text field, for standard term indexes, explicit local field, for geographic terms labeled as explicit geographic evidence, and implicit local field, for geographic terms associated to implicit geographic evidence.
2
System Description
Figure 1 presents the architecture of the assembled GIR prototype. In a nutshell, the CLEF topics are pre-processed by the QE module, QuerCol, which generates termweighted query strings in MG4J syntax. The CLEF collection is annotated by R EM BRANDT, which generates the geographic signatures of the documents (Dsig ). The text and Dsig of the documents are indexed by MG4J, which uses an optimised BM25 weighting scheme. For the retrieval, MG4J receives query strings and generates results in the trec_eval format. A geographic ontology assists QuerCol in its geographic term expansion. R EMBRANDT and MG4J use other geographic knowledge resources, as described further in this section. 2.1
R EMBRANDT
R EMBRANDT is a language-dependent named-entity recognition (NER) system that uses Wikipedia as a knowledge resource, and explores its document structure to classify
GeoCLEF topics
QuerCol 2008 Query expansion
Geographic Ontology
GeoCLEF documents
REMBRANDT Text annotation tool
Query strings
MG4J Indexing & Ranking
GeoCLEF runs
Annotated GeoCLEF documents
Fig. 1. Architecture of the GIR prototype used in GeoCLEF 2008. all kinds of named entities (NE) in the text. Through Wikipedia, R EMBRANDT obtains additional knowledge on every NE that is also a Wikipedia entry, which can be useful for understanding the context, detecting relationships with other NEs, and contextualise and classify surrounding NEs in the text. One example of use of this additional knowledge is deriving implicit geographic evidence for each NE from Wikipedia’s page categories. R EMBRANDT handles category strings as text sentences and searches for place names in a similar way as it is performed on normal texts, generating a list of captured place names that are considered as implicit geographic evidence for the given NE. R EMBRANDT currently classifies NEs using the categorization defined by H AREM, a NER evaluation contest for Portuguese [7,8]. The main categories of H AREM are: PERSON, ORGANIZATION, PLACE, DATETIME, VALUE, ABSTRACTION, EVENT, THING and MASTERPIECE. R EMBRANDT can handle vagueness and ambiguity by tagging a NE with more than one category or sub-category. R EMBRANDT’s strategy relies on mapping each NE to a Wikipedia page and subsequently analysing its structure, links and categories, searching for suggestive evidences. R EMBRANDT also uses manually crafted rules for capturing NE internal and external evidences, classifying the NEs that were not mapped to a Wikipedia page or mapped to a page with insufficient information, and contextualising NEs that have a different meaning (for example, in “I live in Portugal Street”, “Portugal” designates a street, not a country). The classification is best illustrated by following how an example NE, “Empire State Building,” is handled: the English Wikipedia page of the Empire State Building (en. wikipedia.org/wiki/Empire_State_Building) is labelled with 10 categories, such as “Skyscrapers in New York City” and “Office buildings in the United States.” With this information, R EMBRANDT classifies the NE as a PLACE/HUMAN/CONSTRUCTION. In the hypothetical case that this NE could not be mapped to a Wikipedia page, the presence of the term “Building” in the end (internal evidence) gives a hint for PLACE/HUMAN/CONSTRUCTION. External evidence rules check the context of the NE, ensuring that it is not referred in a different context (for example, as a hypothetical movie, street or restaurant name). Finally, the categories “Skyscrapers in New York City” and “Office buildings in the United States” are handled by R EMBRANDT as additional text, and the placenames “New York City” and “United States” are treated as implicit geographic evidence associated to the NE.
Table 1. Classification of H AREM categories and sub-categories as having explicit, implicit or no geographic evidence for the generation of Dsig . Explicit geographic evidence PLACE/PHYSICAL: {ISLAND, WATERCOURSE, WATERMASS, MOUNTAIN, REGION, PLANET} PLACE/HUMAN: {REGION, DIVISION, STREET, COUNTRY} Implicit geographic evidence EVENT: {PASTEVENT, ORGANIZED, HAPPENING} PLACE/HUMAN: {CONSTRUCTION} ORGANIZATION: {ADMINISTRATION, INSTITUTION, COMPANY}
No geographic evidence THING: {CLASS, CLASSMEMBER, OBJECT, SUBSTANCE} PLACE/VIRTUAL: {MEDIA, ARTICLE, SITE} PERSON: {POSITION, INDIVIDUAL, PEOPLE, INDIV.GROUP, POSIT.GROUP, MEMBER, MEMBERGROUP} VALUE: {CURRENCY, CLASSIFICATION, QUANTITY} ABSTRACTION: {DISCIPLINE, STATE, IDEA, NAME} MASTERPIECE: {WORKOFART, REPRODUCED, PLAN} {GENERIC, DURATION, FREQUENCY, HOUR, TIME: INTERVAL, DATE}
From R EMBRANDT annotations to geographic document signatures Each document annotated with R EMBRANDT contains a list of NEs that might convey explicit or implicit geographic evidence. We can now generate rich geographic document signatures Dsig by adding NEs that have explicit geographic evidence, together with the placenames that are associated as implicit geographic evidence for other NEs. We divide the 47 NE sub-categories into 3 levels of eligibility, as depicted in Table 1: 1. Sub-categories having explicit geographic evidence: all sub-categories under the main category PLACE, with the exception of the sub-categories PLACE/HUMAN/CONSTRUCTION and PLACE/VIRTUAL/*. The category PLACE mostly spans the administrative domain and physical domain, but the PLACE/VIRTUAL/* sub-categories span virtual places, such as web sites or TV programs, and therefore are not eligible. In HAREM, the subcategory PLACE/HUMAN/CONSTRUCTION is included in the PLACE main category, precisely because of its strong geographic connotation. However, since it is not an explicit geographic entity, it is included in the next level. 2. Sub-categories having implicit geographic evidence: the main categories ORGANIZATION, EVENTS and sub-category PLACE/HUMAN/CONSTRUCTION. The category ORGANIZATION spans institutions and corporations, such as city halls, schools or companies, which are normally related to a defined geographic area of interest. The category EVENTS spans organised events that normally take place in a defined place, such as tournaments, concerts or conferences. 3. Sub-categories not having geographic evidence: the remaining categories. This eligibility table of NE classifications into Dsig signatures is a far from consensual simplification. It is questionable whether categories such as PERSON can also convey a significative geographic evidence for document signatures. For instance, the NE “Nelson Mandela” is associated by R EMBRANDT to “South Africa” as an implicit geographic evidence because the Wikipedia page of Nelson Mandela (en.wikipedia. org/wiki/Nelson_Mandela) contains the category “Presidents of South Africa.” Yet, since not all documents mentioning “Nelson Mandela” have the South African territory as their geographic scope, adding this geographic evidence may produce noisy document scopes. On the other hand, we are assuming that all captured geographic evidence is relevant for the Dsig , but this is neither always true. For example, the NE “Empire State Building” conveys an implicit location when it is addressed in a context of office headquarters, but is not relevant as a geographic scope when it is addressing an architectural style.
“Tall buildings in the USA”
QuerCol
Terms
Tall, Buildings, USA Geo:Terms
USA
BRF Expansion
Ontology Expansion
biggest, tallest, skyscrapers, burj, towers, empire. united states, america california, seattle, washington, los angeles, chicago, san francisco, (...)
text:tall{1.0} | text:buildings{1.0} | text:usa{1.0} | text:skyscraper{0.9} | text:burj{0.8} | text:towers{0.7} | text:empire{0.6} | text:america{0.5} | local:usa{1.0} | local:california{0.5} | local:(los angeles){0.333}...
Fig. 2. QuerCol’s query reformulation strategy. 2.2
QuerCol
QuerCol’s query reformulation has two different procedures, illustrated in Figure 2 for the example query “Tall buildings in the USA.” First, it uses blind relevance feedback (BRF) to expand the non-stopwords tall, buildings and usa, and weights the expanded terms with the wt (pt -qt ) algorithm [9]. Secondly, it performs geographic query expansion for geographic terms, by exploring their relationships as described in a geographic ontology [10]. As such, QuerCol recognises the geographic term “USA” with the help of R EMBRANDT, and grounds it to the geographic concept ‘United States of America (country)’, triggering the ontology-driven geographic query expansion that searches for other geographic concepts known to be contained within the USA territory. The expanded geographic terms are then re-weighted according to the graph distance in the ontology between the node associated to the expanded concept and descendent nodes 1 by the formula distance−1 . For the given example, USA generates 50 states with a weight 1 of 2 and several cities with weight 13 (considering that the node distance in the ontology between states and countries is 1, and between cities and countries is 2). The final geographic terms represent the query geographic signature(Qsig ). In the end, all text and geographic terms are labelled with their targeted index field, and assembled in a final query string connecting them with OR operators (|). 2.3
Indexing and ranking in MG4J
MG4J is responsible for the indexing and retrieval of documents. MG4J indexes the text of CLEF documents into a text index field, while the Dsig of the documents is divided in two geographic indexes: the explicit local and implicit local index fields, according to each type of geographic evidence. Figure 2.3 presents an example of R EMBRANDT’s annotation and subsequent MG4J indexing steps. We define term similarity as the similarity between query terms and document terms computed with BM25 on the text index field only. Geographic similarity is the similarity between the geographic signatures of queries and documents (Qsig and Dsig ) computed by BM25 on the explicit local and implicit local index fields. MG4J enables the dynamic selection of the indexes to be used in the retrieval, and changing the weight of a field before retrieval. Unfortunately, the BM25 implementation included in MG4J does not support term weights, so all the terms weights were set to the default value of 1 for all the generated runs.
text
Source documents
I visited the Empire State Building New York , on my trip to the USA.
Indexes
i, building, empire, my, on, state, the(2), to, trip, usa, visited MG4J
Rembrandt
I visited the Empire State Building, on my trip to the USA.
Tagged documents
expl.local
usa impl.local
new york
Fig. 3. R EMBRANDT’s text annotation and MG4J’s indexing steps.
3
Runs
Before run generation, the BM25 parameters and the index field weights were tweaked to fit the GeoCLEF collection. Using the 2007 GeoCLEF topics and relevance judgements, we generated several runs with different parameters and weights and then selected the run with the highest MAP value, the initial run. Afterwards, we generated several final query strings with different blind relevance feedback (BRF) parameters (the number of top-ranked expanded terms, top-k terms, and the number of top-ranked documents, top-k docs) from the initial run, and again generated several runs with different parameters and weights. The run with the highest MAP value, the final run, corresponded therefore to the best BM25 parameters and index field weights for the GeoCLEF collection with the 2007 topics and relevance judgements. We submitted a total of 12 runs for each subtask, using slight variations of the parameter values from the best optimised runs. Table 2 gives the parameter values used for the official runs. The official runs are composed by initial runs (i.e., runs before BRF, #1 to #3) and final runs (i.e., runs after BRF, #4 to #12). We experimented different ratios of text / explicit local index weights, by increasing and decreasing the text index weight by 0.5. We observed that the BM25 optimisation for the Portuguese subtask presents many local optimal MAP values, so we used three BM25 configurations for the official runs, to increase the odds of standing near a global optimal BM25 parameter. For the English subtask, on the other hand, the BRF parameters were more influent for the optimal MAP values than the BM25 parameters, so its runs have different BRF parameter values. The implicit local index field did not improve MAP values in any optimisation scenario, and thus it was turned off on the submitted runs. Table 2. The configuration parameters used for the official runs. Initial Run BRF Final Run Run number BM25 opt. Index field weight top-k top-k BM25 opt. Index field weight Portuguese b k1 text exp.l. imp.l. terms docs b k1 text exp.l. imp.l. #1, #2, #3 0.4 0.9 {2.0, 2.5, 3.0} 0.25 0.0 #4, #5, #6 0.4 0.9 2.5 0.25 0.0 8 5 0.95 0.3 {2.0, 2.5, 3.0} 0.25 0.0 #7, #8, #9 0.4 0.9 2.5 0.25 0.0 8 5 0.65 0.35 {2.0, 2.5, 3.0} 0.25 0.0 #10,#11, #12 0.4 0.9 2.5 0.25 0.0 8 5 0.65 0.5 {2.0, 2.5, 3.0} 0.25 0.0 English #1, #2, #3 0.65 1.4 {1.5, 2.0, 2.5} 0.5 0.0 #4, #5, #6 0.65 1.4 2.0 0.5 0.0 8 15 0.65 1.4 {1.5, 2.0, 2.5} 0.5 0.0 #7, #8, #9 0.4 0.9 2.5 0.25 0.0 8 10 0.65 0.35 {1.5, 2.0, 2.5} 0.5 0.0 #10,#11,#12 0.4 0.9 2.5 0.25 0.0 8 5 0.65 0.5 {1.5, 2.0, 2.5} 0.5 0.0
Table 3. MAP values and optimising parameters for all GeoCLEF evaluations.
BM25 parameters b k1 MAP PT3 0.4 0.9 0.2222 EN6 0.65 1,4 0.2519
Best GeoCLEF 2008 runs Initial Run BRF Final Run Index field weight top-k top-k BM25 parameters Index field weight text exp.l. imp.l. MAP terms docs b k1 MAP text exp.l. imp.l. MAP 2.5 0.25 0.0 0.2234 2.0 0.5 0.0 0.2332 8 15 0.65 1.4 - 2.5 0.5 0.0 0.2755
PT 2006 0.4 0.4 0.1613 2.0 2007 0.4 0.9 0.273 2.5 2008 0.35 1.2 0.2233 4.0 EN 2006 0.3 1.6 0.2158 2.25 2007 0.65 1.4 0.2238 2.0 2008 0.65 1.6 0.2528 3.5
4
0.25 0.25 0.25 0.5 0.5 0.25
Past GeoCLEF evaluations 0.0 0.1810 16 5 0.55 0.9 0.1967 0.0 0.3037 8 5 0.3 0.95 0.3310 0.0 0.2301 12 15 0.5 1.0 0.2069 0.25 0.2442 0.0 0.2713 0.25 0.2641
16 8 12
5 15 10
0.8 0.65 0.75
2.0 2.5 1.5
1.25 0.25 0.25
0.5 0.2082 0.0 0.3310 0.0 0.2089
0.2 0.2704 0.75 1.4 0.2758 2.0 0.6 0.2809 2.0
0.25 0.5 0.25
0.5 0.2714 0.0 0.2758 0.0 0.2814
Results
Table 3 presents the official GeoCLEF 2008 results (top part) and the results for previous GeoCLEF evaluations with optimised parameters (bottom part). We observe that our best Portuguese run was in fact an initial run (with a MAP of 0.2234). The post-hoc optimisation corroborates the observation that the best MAP values for Portuguese are achieved by initial runs (with the best MAP value of 0.2301), which is somewhat unexpected. For the English subtask, the best run was a final run. It achieved a MAP value of 0.2755, which could be pushed further up to 0.2814 with optimised parameters. The results show that the use of explicit local index field on the retrieval process improves the results in all GeoCLEF evaluations, while the implicit local index field does not contribute at all to the improvement of the retrieval results. This means that the GIR prototype outperforms classic IR system consistently, but also contradicts the initial assumption that implicit geographic evidence would improve searches. In fact, we observe that the implicit local index field takes part only on the best MAP values for GeoCLEF 2006 topics. We believe that this can be explained by the fact that the implicit geographic evidence captured by R EMBRANDT is grounded to countries and continents, given that in GeoCLEF 2006 the geographic scopes of topics were mostly about countries and continents. A group of statistical significance tests (Wilcoxon signed-rank, Student’s t and randomization tests) comparing the 2008 official runs and the 2008 post-hoc runs show that their differences in MAP values are not statistically significant, meaning that the best official runs were obtained with near optimal BM25 parameter values and index field weights. This also shows that the best parameter values from the GeoCLEF 2007 optimisation were also good parameter values for GeoCLEF 2008, as they were not over-fitted to the GeoCLEF 2007 data.
5
Conclusions
We participated in GeoCLEF 2008 with the purpose of maturing the ideas that were introduced for the 2007 participation, namely the use of signatures for representing the geographic scopes of queries and documents. We focused on generating more comprehensive document geographic signatures, by capturing and using both explicit and implicit geographic evidence.
The results showed that our GIR prototype is consistently better when using the geographic indexes on the retrieval, meaning that our GIR approach outperforms a classic IR retrieval in every GeoCLEF evaluation scenario since 2006. For future work, we plan to improve R EMBRANDT’s strategy for capturing implicit geographic evidence. We believe that its naïve approach generated noisy signatures and was responsible for the futility of the implicit local index field. We also want to develop a new adaptive strategy for QuerCol, as the optimal QE parameters vary for each topic, and using the same configuration set for all topics generates sub-optimal expanded queries. Finally, we intend to optimise term weights on the BM25 implementation available in MG4J for document retrieval. Acknowledgements We thank David Cruz and Sebastiano Vigna for the modifications made to MG4J for the experiments, and to Marcirio Chaves for updating the geographic ontology. Our participation was jointly funded by the Portuguese government and the European Union (FEDER and FSE) under contract ref. POSC/339/1.3/C/NAC (Linguateca), and partially supported by grants SFRH/BD/29817/2006 and PTDC/EIA/73614/2006 (GREASE II) from FCT (Portugal), co-financed by POSI.
References 1. Cardoso, N., Cruz, D., Chaves, M., Silva, M.J.: Using Geographic Signatures as Query and Document Scopes in Geographic IR. In: Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of CLEF 2007. Volume 5152 of LNCS., Springer (2008) 802–810 2. Cai, G.: GeoVSM: An Integrated Retrieval Model for Geographic Information. In: Proceedings of GIScience’02. Volume 2478 of LNCS., Boulder, CO, USA, Springer (2002) 65–79 3. Mitra, M., Singhal, A., Buckley, C.: Improving Automatic Query Expansion. In: Proceedings of SIGIR’1998, Melbourne, Australia, ACM (1998) 206–214 4. Cardoso, N.: REMBRANDT - Reconhecimento de Entidades Mencionadas Baseado em Relações e Análise Detalhada do Texto. In Mota, C., Santos, D., eds.: Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM. Linguateca (2009) 5. Boldi, P., Vigna, S.: MG4J at TREC 2005. In: Proceedings of TREC’2005, NIST (2005) http://mg4j.dsi.unimi.it. 6. Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gull, A., Lau, M.: Okapi at TREC-3. In: Proceedings of TREC-3, Gaithersburg, MD, USA (1992) 21–30 7. Santos, D., Seco, N., Cardoso, N., Vilela, R.: HAREM: An Advanced NER Evaluation Contest for Portuguese. In: Proceedings of LREC’2006, Genoa, Italy (2006) 1986–1991 8. Santos, D., Carvalho, P., Oliveira, H., Freitas, C.: Second HAREM: new challenges and old wisdom. In: Proceedings of PROPOR’2008. Number 5190 in LNCS, Aveiro, Portugal, Springer (2008) 212–215 9. Efthimiadis, E.N.: A user-centered evaluation of ranking algorithms for interactive query expansion. In: Proceedings of SIGIR’93, Pittsburgh, PA, USA, ACM (1993) 146–159 10. Cardoso, N., Silva, M.J.: Query Expansion through Geographical Feature Types. In: 4th Workshop on Geographic Information Retrieval, Lisbon, Portugal, ACM (2007)