representation of perceptive world phenomena. .... and web pages in the case of SPIRIT and STEWARD systems. A ...... Paul Heckbert, Academic Press,. 4:363â ...
Fuzzying GIS Topological Functions for GIR Needs Christian Sallaberry
Mauro Gaio
Damien Palacio
Julien Lesbegueries
LIUPPA Lab. UPPA University BP 575 64012 PAU Cedex +33 5 59 40 80 67
LIUPPA Lab. UPPA University BP 1155 64013 PAU Cedex +33 5 59 40 75 70
LIUPPA Lab. UPPA University BP 1155 64013 PAU Cedex +33 5 59 40 76 52
LIUPPA Lab. UPPA University BP 1155 64013 PAU Cedex +33 5 59 40 77 93
christian.sallaberry@ univ-pau.fr
mauro.gaio@ univ-pau.fr
damien.palacio@ univ-pau.fr
Julien.lesbegueries@ gmail.com
to support semantic processes for geographic information automated indexing and, further, Geographic Information Retrieval (GIR) processes.
ABSTRACT Natural Language “schematizes” space; textual geographic information is usually a selection of certain aspects of a referent scene while neglecting others. Thus, an indexing process relying on such information obviously contains some degree of imprecision and uncertainty. The PIV prototype is a GIR system dedicated to geographic evocations tagging, geo-computing, indexing, querying and visualizing in wide corpora of travel books. The aim of this paper is to focus on the PIV spatial relationships management of vagueness for distance, direction and topology relationships. The proposed approach extends GIS operators with fuzzy spatial relationship functions like proximity and cardinal direction.
To this end, we have promoted two chains for an automatic semantic tagging of spatial aspects [16] but also temporal ones [20] in order to obtain more realistic geographical information within textual documents. Both chains produce an index where each feature is associated to one or more footprint [22]. We also propose a pattern based approach for summarizing spatial information [23]. This approach consists in associating “spatial patterns” to text-units. Therefore, we add three new patterns “point of view”, “itinerary” and “area comparison” to the core model. In this way, we summarize a text-unit by one (or few) prevailing spatial patterns. So, at any level of granularity of the document structure, spatial information may be indexed as instances of our core model.
Categories and Subject Descriptors H.3 [Information Storage And Retrieval]: Content Analysis and Indexing, Information Search and Retrieval, Digital Libraries
A textual semantic analyzing process supports the PIV spatial relationships like: in the vicinity of Laruns (village) interpretations and computes corresponding footprints. In the first version of the PIV prototype, Minimum Bounding Rectangles (MBR) [18] are associated to spatial relationships approximations. We did a first experimentation on a set of travel books coming from a digital library specialized in local cultural heritage: it ensured a good precision but a quite poor recall factor [26].
General Terms Algorithms, Experimentation, Standardization, Languages.
Keywords Spatial Feature, Spatial Relationship, Spatial Indexing and Retrieval, Textual Digital Library
Information
1. INTRODUCTION
In order to improve the PIV GIR processes, in this paper we focus on a better computational interpretation (i.e. modeling and reasoning) of spatial relationships for a more accurate approximation of the corresponding footprints. First, we want to better integrate the way that space and the things in it are schematized in natural language. Then, we need better suited GIS functions to approximate such relationships in order to infer more convenient footprints. Note that the geographic aspects of our region of interest involve relative small-scale and multi-scale. In agreement with [28] there are not approaches yet dealing with small-scaled information in multi-scaled contexts.
Textual expression of spatial information is a particular cognitive representation of perceptive world phenomena. Moreover, textual description ways may be multiple and usually contain some context induced information. The PIV (Virtual Itineraries in the Pyrenean Mountains) prototype manages textual repositories and tags the contents of documents. A first version focuses on digital libraries and proposes to extend basic services of existing Library Management Systems with new ones dedicated to geographic Information Extraction and Retrieval. Geographic information in such repositories is composed of a Spatial Feature (SF), a temporal feature and a thematic one. “Musical instruments in the vicinity of Laruns in the 19th century” is an example of a complete geographic feature. We have proposed adaptive core models [22]
The following excerpt shows some examples of spatial syntagms that the PIV GIR system has to deal with: in the Ossau valley, south of Laruns village and forests near Eaux-Chaudes town. These syntagms are characteristic of the ones used in the corpus.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. GIR’08, October 29–30, 2008, Napa Valley, California, USA. Copyright 2008 ACM 978-1-60558-253-5/08/10...$5.00.
“[…] Tout le ciel a ce matin la belle couleur d'ambre des raisins d'automne et la journée est belle, malgré le vent qui souffle perpétuellement dans la vallée d’Ossau que nous empruntons pour nous diriger au sud de Laruns. Par un couloir de roches nues, nous atteignons les forêts près des Eaux-Chaudes ; le vent y est encore violent et rend fort désagréable ce séjour. […]”
1
Our proposition is to add to GIS spatial functions a set of parameters in order to be able to compute geometries from various qualitative spatial relationships. The following examples show how qualitative spatial relationships can be inferred from context at different levels: (i) At the syntagm/sentence level, the properties of named entities can be used. For example, the proximity relationship in a textual expression like: near Eaux-Chaudes town, can be specified thanks to the size, the shape or other known properties of the Eaux-Chaudes town. (ii) A more precise representation can be inferred when the nature of spatial objects is explicitly mentioned in the text. For example, the representation of: the forests near EauxChaudes town, could consist in aggregating forests objects present in geographic resources that intersect the first computed representation. (iii) At higher levels, more global contexts can be considered. These different contexts might come from the paragraph, the section, the document or at least the whole corpus (by default) and can be used in order to limit the outreach of qualitative relationships. For example, in the excerpt, we can use the presence of the Ossau valley in the document title (document context) or the implicit area of Pyrenean Mountains because the corpus comes out from a local cultural heritage digital library (corpus context) in order to limit the size of the south of Laruns representation.
and STEWARD systems deal with absolute SF (named entities and addresses) only. Another particular characteristic concerns the size of the managed information units: textual paragraphs of a domain specific corpus (in our use case cultural heritage of the Pyrenees) and web pages in the case of SPIRIT and STEWARD systems. A refined markup process and a spatial information interpretation are applied both at the PIV information units indexing stage and in its users‟ query interpretation stage. As we work on specific digital library collections and as these collections are quite stable and not too large, the hard back-office spatial process seems to be suitable [22]. Therefore, the cost of such refined spatial aware indexing is reasonable: first, the PIV process tags absolute SF like Pau and relative SF like near Pau, at about 10 km south of Pau city, between Pau and Biarritz… then, it evaluates the semantics associated to such spatial relationships and computes corresponding footprints. The question is how to better manage the resulting footprints, since they are computed from textually expressed spatial information that might be under-specified or context-dependent? From a cognitive point of view, the answer may lie in the way that “natural language structures space” a title of Leonard Talmy's article published in 1983 [30]. He considers the foundational role played in linguistic space descriptions by schematization as “a process that involves the systematic selection of certain aspects of a referent scene...” while disregarding the remaining aspects. He also argues that Natural Language characterizes object's spatial disposition in terms of another's. The first object's site, path, or orientation is thus indicated in terms of distance from or relation to the geometry of the second object. From a computational modeling and processing point of view, a good response for these particular aspects (i.e. how natural languages represent spatial relationships) can be found in qualitative spatial reasoning (QSR) approaches.
In order to exploit such context semantics, the PIV has to integrate some Qualitative Spatial Reasoning (QSR) in its GIS-based indexing process. We have therefore to determine and to manage vagueness for distance, direction and topology relationships. For these, a set of parameters has to be defined in order to graduate the vagueness; for example the well within, near inside and near outside properties, in case of a distance relationship from the crisp original object.
Cohn [13, 12] and Egenhofer [9, 15, 14] models are well-known answers to manage such forms of QSR. They consist in defining spatial atomic features and relationships between them. Cohn [13] categorizes these relationships according to the following aspects: ontology, topology, modes of overlap and uncertainty. To reach the same levels of meaning of natural language expressions, the topology aspect, in particular, is often associated with metric spatial representations (orientation, distance and size, shape). From this categorization and region-based atomic features, the models define such relationships as: disjoint, meet, equal, covered by, inside, overlap … [14].
The paper is organized as follow. Section 2 briefly presents some existing geographic IR systems and then develops QSR approaches within such systems. In section 3, we make propositions to interpret spatial relationships tagged in textual documents and associate them with one or more accurate footprints. We detail specific functions we have developed and illustrate their integration in the PIV system.
2. RELATED WORKS Various effective geographic processing frameworks for IR engines are proposed in the SPIRIT project, the Geosearch system or the GEO-IR system; they are outlined in [8, 31, 25]. GRID [32] and GIPSY [34] systems have the same goals. The STEWARD [24] and SPIRIT [19] spatio-textual indexing engines seems to be the most similar to that of the PIV. They use Natural Language Processing techniques, Part-Of-Speech and Named-Entity Recognition taggers.
On the other hand, GIS systems are appropriate for spatial indexing processes and for managing quantitative geographic information. Indeed, indexing methods need an easy and efficient way to store spatial (or spatiotemporal) data. Qualitative representations can only be retrieved using getter functions (as those implementing the “Egenhofer model” [14]). These functions give a state (A meets B, A is inside B) and determine a qualitative relationship between two well-defined quantitative SF.
The PIV approach main difference relies on the back-office semantic spatial reasoning used for both absolute Spatial Features (SF) and relative SF interpretation and indexing. Indeed, the PIV spatial and temporal models support complex information formulated in natural language. A SF such as Biarritz district is well-known named place. We call it “absolute” SF. Complex SF as Biarritz vicinity or South of Biarritz district has to be interpreted and, therefore, need some spatial reasoning processes. Such features are called “relative” SF. We associate each relative SF to one or more spatial relationships (adjacency, inclusion, distance, cardinal direction) for a recursive definition [22]. For instance, the SPIRIT
Bennett's [4, 3] proposition consists in storing qualitative relationships in another index, looking like a triplet for each couple of SF: (SF1, SF2, kind of relationship), based on inference rules. With this method, the GIS could give satisfactory answers for qualitative queries without quantitative computing. For example, the rule inside (Lake A, Forest F) returns F for the query Where is the lake A? This proposition is unfortunately not well suited for our Geographic Information Retrieval (GIR) system. In principle, one
2
could propose to infer such kind of index once and for all because of heavy mechanisms due to inference rules. For example, if we want to know: in which county is the X museum located?, we can imagine the following rules: Include(X, S street), Include(S, T town), Include (T, C county) in order to answer the query. But such a precomputed method is not a suitable process for managing various querying contexts. For example, the proximity relationship between two SF could be considered near for a given context or not for another one. A more convenient process involving GIS operators could be implemented since a footprint is associated to any SF. Therefore an intersection calculus between the counties layer and the buildings one allows giving more suitable results.
Moreover, even if we manage sharp geometries, it could be pertinent to manage indeterminate objects since a named entity can be used within some under-specified context. Referring to the EggYolk‟s proposition [11], an indeterminate boundary could be managed in some cases in our approach. It will be shown in particular how we consider indetermination in the adjacency case. In order to perform an efficient and accurate indexing process, we propose to extend GIS operators with spatial relationship functions like proximity and cardinal direction. These functions‟ parameters may be valued or not after the analysis of the textual evocations. Anyway, they aim to give more accurate representation approximations of such relationships.
Many Geographic Information Retrieval (GIR) projects usually manage their spatial indexes using Minimum Bounding Rectangles (MBR) or convex-hulls [21] paradigms. Such solutions compute SFs‟ footprint approximations for large scale digital libraries indexes; they address index and query time performance. However MBR can cause imprecision problems particularly when complex SFs are managed with more accurate representations in a smaller geographical scale (over-estimated area, misrepresented shape). For example, the syntagm near Laruns needs to be geographically represented by a buffer with a hole inside (see Figure 2(a)). Indeed, an MBR will cause a “false positive” at the information retrieval stage (in Figure 1 Laruns being considered as relevant since it is included in the near Laruns MBR).
3. SPATIAL RELATIONSHIPS IN TEXTUAL EVOCATIONS The sample data used for training and testing the PIV system contains 10 OCRised books dealing with the Pyrenean cultural heritage [26]. The books are split into paragraphs constituting about ten thousand document units. The PIV prototype found 9835 candidate SF in these document units. A first statistic survey pointed out that the extracted absolute Spatial Features (SF) may be classified into such main categories as Town, Mountain, Valley, Forest, Habitat, Road, Watercourse, Water layout, etc. Another study of this corpus pointed out an intensive use of topological and cardinal direction spatial relations. We propose hereafter to illustrate two spatial relationships analyses. We are going to combine adjacency (the most used of the topological relations [5] within textual documents) and distance relationships within a larger concept of proximity. We will focus on proximity relationships and give some examples of cardinal direction relationships representations. The aim of this work is to build more accurate footprints corresponding to such relative SF automatically.
Finally, some interesting works propose models to integrate fuzzy information in GIS. T. Beaubouef [2], for example, proposes a rough set modeling for object-oriented databases, defining lower and upper approximations for concepts that can then be brought together. Bordogna [6] presents an ontologically based model and membership functions managing uncertainty, vagueness and errors, which are imperfections occurring when spatial information is used. These functions could be an interesting way to manage qualitative relationships, even if in a GIR process, these relationships have to be indexed. Moreover only one general context is taken into account, based on the size of the considered map and the knowledge on the phenomenon under consideration.
Before calling these proximity and cardinal functions, a semantic analysis of the relative SF is necessary: all the parameters of the function are valued at this stage. Therefore, the process considers the: Spatial relation key-words (cf. rsf/label in Figure 1) Type of the absolute SF (cf. asf/type in Figure 1) Size of the absolute SF geometry Form of the absolute SF geometry (polygon, line, point) General context of the document part in which the relative SF is mentioned Then, the relative SF footprint may be computed. Let us note that a GIS and GIS layers support spatial resources, operations and results visualization.
These fuzzy object-based data models propose to manage vagueness at different levels, taking into account: fuzzy objects that represent vague phenomena, indeterminate objects that represent crisp phenomena, the observation of which introduces indeterminacy, incompleteness or imprecision, indeterminate spatial relationships that are topological and metric relationships between pairs of objects. Our work addresses such vague or context dependent spatial relationships since we consider that formulated spatial information might be composed of named entities (absolute SF) that can be geolocalized using GIS resources. These relationships can be managed thanks to Qualitative Spatial Reasoning (QSR). In order to explore such semantics, the PIV has to integrate some qualitative spatial reasoning in its GIS-based indexing process. We have therefore to determine and break up vagueness in these spatial relationships. Bordogna [6] proposes to manage vagueness for distance, direction and topology relationships. For these, a fuzzy membership function is defined in order to graduate the well within, near inside and near outside properties, using a distance from the crisp original object. Such a three-level gradation then seems to be pertinent. In our case, however, we will have to manage it in a simpler way since this gradation must be integrated in an automated information indexing and retrieval process.
Figure 1. Spatial relationships extraction and interpretation.
3
interpretation area depending on the context. The last parameter “N” is used when we can specify a specific nature for the relative SF representation. Figure 2(b) shows a possible representation for the forests near Laruns. The far relationship (Figure 2(c)) makes the d1 parameter represent the disaffection distance.
3.1 The PIV prototype Each SF is described within instances of our spatial model: specific algorithms [22] process relationships‟ semantics analyses and compute corresponding approximated geometries. It is quite easy to ask a GIS to retrieve a footprint for an absolute SF. However, it is more complex to compute the geocoded footprint for a relative SF like in the east of the vicinity of Laruns or nearby Laruns village. The PIV proposes an algorithm that consists in recursively carrying out geometrical transformations on MBR. As a relative SF is always constructed using an absolute SF‟s footprint, our algorithm begins by retrieving the geometry of the absolute SF included in the relative SF and then transforms it into an MBR. Then, exploring the relationships that define the relative SF recursively, it makes geometrical transformations on the original MBR (translations, homotheties, etc.) (cf. Figure 1). This method may give different representations and it ensures the proportionality between an approximated relative SF and its original absolute SF [27].
Figure 2. (a) right next to Laruns relative SF general representation, (b) nature specific representation, (c) far from Laruns representation. If we consider the automated information extraction process of the PIV prototype, let us note that it may return details about the context related to the relative SF. For instance, it tags the proximity spatial relationship where a relative SF context is fully described (i.e. Figure 1). It may also return a relative SF context that is partially described (with no information about the SF's nature or the local, regional, national, and world wide scale context).
The relationship functions described in the following sections manage qualitative information thanks to quantitative data. The qualitative part is interpreted thanks to parameters given to the functions. These parameters are evaluated from a preceding linguistic process that evaluates the spatial context in order to propose a more precise representation. The linguistic process, supported by geographic resources, retrieves the type of the spatial feature (point, line, and polygon), its size and its nature (city, forest, river, etc.). This last characteristic can be captured in the text itself whenever it is mentioned (forests near Laruns, cf. Figure 1 or Mont Perdu peak, cf. Figure 3). In this case we call the nature of the SF an "endogenous nature" (line 7 Figure 1). Otherwise, external geographic resources (gazetteers) are necessary to search for such information: i.e. Aberouate named entity nature is refuge, Ayous named entity nature is peak). Here, we call the nature of the SF an "exogenous nature". Sometimes, we may find neither endogenous nor exogenous natures; a larger context analysis is necessary.
So, the proximity function may be called with default values: d1 distance and d2 range are computed automatically ensuring the proportionality between parameters values and the original absolute SF's size (i.e. about two fifth for d2 – right-next level). Otherwise, a same context may be applied to a set of documents: d1 and d2 values are fixed during all the information extraction process of this set of documents. One considers that the set of documents concerns a district, a town, a valley, a county or a country and then fixes d1 and d2 values for all the extraction process (i.e. about two fifth of a county for d2 – right-next level). Finally, the proximity function may also be called with dynamically computed parameters values. Here, we consider that the previous steps of the information extraction process have automatically extracted the spatial scope related to the SF's, its document's and its corpus's context. Therefore, the process computes a new scope based on the average of these three ones. Then, d1 and d2 values are computed according to this information (i.e. about two fifth of this new scope for d2 – rightnext level).
The following functions address spatial relationships georeferencing. Although, they propose basic services, the originality of the approach is that they take place in an automated spatial information indexing and retrieval process. Those functions behavior depends on the parameters values. For instance, the local spatial context and/or a larger document or corpus spatial context, computed during the linguistic process, could give some more information to limit spatial relationships vagueness.
3.2.2 Absolute SF shapes and representations of proximity
3.2 Proximity spatial relationships
Proximity-computed shapes correspond to an enlarging of the embedded absolute SF.
These relationships are managed by a function interpreting three levels of proximity. Actually this function addresses the (i) rightnext level, (ii) near level but also (iii) far-from level. We are going now to see parameters involved in the definition of this proximity function.
3.2.1 Proximity function parameters This function has four parameters: absolute SF geometry, overlap distance (d1), range (d2) and nature (N). We illustrate them in Figure 2. For instance, the Laruns village is the absolute SF composing the near Laruns relative SF: here, the first parameter is the geocoded footprint of the Laruns absolute SF. Figure 2(a) illustrates the d1 and d2 parameters for the near Laruns relative SF. d1 stands for the potential overlap of the absolute SF, used for the (i) right-next level. It‟s default value is zero. d2 stands for the
Figure 3. Rigth-next proximity representations for (a) polygonal, (b) punctual and (c) linear geometries.
4
The most complex case is when the geometry needing enlarging is a polygon (Figure 3(a)). One needs to take into account the fact that boundaries of an relative SF are not precisely defined in the document text unit. However, limits have to be deduced. For example, when the context defines the right-next level, an overlap and a range distance are given (Figure 3(a)). For simpler geometries (Figure 3(b) and 3(c)), only the range distance (d2) is used.
For function (ii), the third step of Figure 4 is omitted. For function (iii), the process is quite similar. (d1+d2) is computed to define the upper boundary and a geometric difference removes ASF geometry neighborhood.
3.2.3 Proximity functions implementation
3.3 Integration of new functions within the PIV prototype
In a similar manner, we have defined a cardinal function and parameters in order to automatically interpret cardinal direction spatial relationships.
Proximity functions implementation calls GIS standard operators so that they can be easily integrated in GIS. The main operator used in proximity functions is the Buffer (geometry, distance) GIS function. It returns a new geometry composed of points that have a distance from the original geometry that is less or equal to distance. It may be combined with Union(), Intersection() and Difference() area functions. For the three cases of defined “proximity” ((i) right next, (ii) near, (iii) far from), the following algorithms are defined:
We have compared MBR representations commonly used in geographic information retrieval projects (i.e. the first version of PIV) with representations computed thanks to these new functions. For instance, Figure 5(a) shows three representations of the near Pau syntagm tagged within a textual document: a first indexation granularity proposes the corresponding MBR (a1), a second interpretation computes the new approximated geometry around the town of Pau (a2 buffer), finally, a third geometry corresponding to the towns related to Pau (nature parameter = “town”) is proposed (a3). We can observe that approximations a2 and a3 seem to be more accurate. Moreover, the nature characteristic (town in Figure 5, forest in Figure 4) associated to the spatial relationship may improve consequently the precision of the approximated corresponding footprint.
(i)
Geom = Union( Difference(ASF_geom, Buffer(ASF_geom, -d1)), Buffer(ASF_geom, d2) ) (ii) Geom = Buffer(ASF_geom, d2) (iii) Geom = Difference( Buffer(ASF_geom, d1+d2), Buffer(ASF_geom, d1) ) Geom is the final geometry and ASF_geom is the geometry of the original absolute SF. We illustrate the main stages of the Proximity function (i) through Figure 4 and the following example: Forests right next to Laruns. From this syntagm (step 1) the system extracts the named entity Laruns village and the relationship of proximity and it calls the adequate function (step 2). Steps 3, 4 and 5 break up the (i) proximity function: a first ring around the named entity geometry (using d1), one other larger ring (d2) and finally the union of the two rings.
Figure 5. Indexation: (a) near Pau representations Information retrieval: (b) south of Pau query interpretation. Figure 5(b) illustrates two representations of south of Pau spatial query within an information retrieval process: a first query interpretation proposes the corresponding MBR (b1), a second interpretation computes the new approximated geometry south of the town of Pau (b2), finally, a third geometry corresponding to the towns related to the south of Pau could also have been proposed. We can observe that the MBR representation (b1) of the query is the least relevant: it intersects Laroin town and near Espechede spatial representations (within the PIV index) whereas the new geometry (b2) does not. The ongoing experimentation gives encouraging results. The first test collection for this study is composed of 10 books and 100 significant selected document units (paragraphs of 1 to 20 sentences): the information extraction process indexed 235 absolute SF and 74 relative SF. The smaller spatial vector of a document unit is composed of 1 SF and the bigger one is composed of 15 SF. A SF is described in two different indexes: PIVv1 index associates SF with MBR footprints and PIVv2 index associates SF with buffer like or polygonal footprints (Figure 5). We detail two queries submitted to the PIVv1 and PIVv2 prototypes: Q1 focuses around Bayonne area and Q2 focuses near Jurançon area. We judged the spatial relevance of the retrieved document units using an approach like the relevance scheme proposed by P-D. Clough [10]: highly relevant; relevant; not relevant.
Figure 4. Computed representation of Forests right next to Laruns (using (i) algorithm). When we know the nature N of the relative SF (like Forests in this example), we retrieve all the spatial features of nature N that intersect with Geom, within the additional step, step 6.
5
Table 1. Precision averages for PIVv1 and PIVv2 experimentations with Q1 and Q2 queries PIVv1 GIR
Table 2. Methods for computing spatial similarity
PIVv2 GIR
P@5
P@8
P@14
R@14
P@5
P@8
P@14
R@14
0.70
0.75
0.60
0.64
0.90
0.93
0.87
1.0
Reference Hill, 1990s
Then, we computed precision (P@) and recall (R@) averages for the PIVv1 and the PIVv2 GIR prototypes (Table 1). The results confirm the assumption that the new spatial relationships approximation functions enhance retrieval accuracy by rankordering more document units for relevance. For instance at top 5, precision reaches 90% for the PIVv2, whereas it is of 70% for the PIVv1. In the same way, recall rate reaches 100% for the PIVv2 at top 14. However, to get a more detailed evaluation, this experimentation should be extended to a larger test collection and use independent judges (at least two or three) to examine the influence of the parameters d1 and d2.
Walker et al., 1992
Formula Range = 2
𝑂 𝑄+𝐶 𝑂 𝑂
Range = MIN( , ) 𝑄 𝐶
Case 1: Q contains C - Range = Beard and Sharma, 1997
In the context of geographic information retrieval (GIR) for digital libraries, we aim at extending spatial indexes with different levels of precision: we propose to associate fuzzy degrees to spatial relationships.
Sallaberry et al., 2006
Case 2: Q and C overlap – Range = 𝑂 𝑄% (1−𝑂 𝐶%)+100
Case 3: Q contained in C - Range =
PIV prototype (v1) - Range =
Where: Q = area of query region C = area of candidate SF O = area of overlap for Q,C d = distance between Q,O centroids D = Q area radius
3.4 Fuzziness in spatial relationships Traditional information retrieval scores and rankings are based on the statistical properties of terms in a collection. On the other hand, geographic information retrieval relies on spatial scores and rankings based on geospatial characteristics such as size, shape, location and distance. A spatial similarity score is derived from the extent of overlap between a candidate SF and the query region: the greater the overlap, the greater the assumed relevance of the candidate SF to the query. A variety of spatial scores based on overlap is discussed in the literature [1, 17, 33, 26] and presented in Figure 6 and Table 2; adapted excerpts from [21].
𝐶 𝑄
𝑄 𝐶
𝑂 𝑂 + 𝐶 𝑄 𝑑 2+ 𝐷
Range (for all): 0 = no similarity 1 = identical
We reuse the following continuous function [29] to compute the relevance weight of any part of a relative SF footprint: 𝑊𝑐 𝑑 =
(1 − 𝑑)
2𝑐 − 2𝑐 1 − 𝑑 + (1 − 𝑑)
(1)
The W weight depends on the d distance from the SF: the farther a point from the original SF, the less important its weight. The c parameter influences the shape of the curve (Figure 7). This weighting formula allows us to store a unique footprint.
Figure 6. Schematic illustration of a PIV overlapping region. For ill-defined regions we suggest to compute footprints composed of bounded sub-areas representing a degree of certainty (Figure 7). These sub-areas will impact on the evaluation of the spatial scores by adding a degree of relevance when computing the overlap. The most relevant one has a maximum rank level of n; the least relevant one has a rank level about 0. Those levels may be respectively translated as terms like well, nearly, not completely, not at all, etc [6] spatially related to a SF. This approach integrates fuzziness in indexes and, further, will allow to extend a query with few results or to limit an important result set to the most relevant ones.
Figure 7. (a) South direction and (b) near adjacency representations and (c) weighting within PIVv2 index.
6
The PIV prototype proposes to take into account three sub-areas for each footprint resulting of the discretization of function (1).
area management at the stage of both the indexing and the information retrieval processes.
Therefore, three fuzzy levels (Figure 7, Figure 8) have to be considered in the new algorithm: 𝑂 𝑂 𝑊𝐴𝑆𝐹 = 𝑖=1..3 𝑊𝑄𝑖 × ( + ) × 𝑊𝐴
The local spatial context and/or a larger one, coming from a whole document or a corpus, provide precisions but, sometimes, may be hard to capture. When the context is difficult to extract automatically, proximity or cardinal direction functions parameters values are computed using the original SF nature and scale. Although these functions propose rather basic services, the originality of the approach is that they take place in an automated spatial information indexing and retrieval process. Note that the spatial features scale is at the region level1: i.e. towns, forests are represented with polygons; rivers, roads with lines; peaks, bridges with points.
𝐶𝐴𝑆𝐹
𝑊𝑅𝑆𝐹 =
𝑖=1..3 𝑊𝑄𝑖
×
𝑗 =1..3
𝑄𝑖
𝑂
(
𝐶𝑆𝐹 𝑗
𝑂
+ ) × 𝑊𝑆𝐹𝑗
(2) (3)
𝑄𝑖
WA = 3; is the maximum weight (absolute SF) 𝑊𝑄𝑖 is the weight (1, 2 or 3) corresponding to the i fuzzy level of the query 𝑊𝑆𝐹𝑗 is the weight (1, 2 or 3) corresponding to the j fuzzy level of the candidate SF 𝑂 𝐶𝑆𝐹
Now, we have to implement other functions like inclusion (in the center) and distance (at 10 miles, at about 5 minutes, walk? bike? car?). GIS may integrate such new operators: they could support new interfaces and scenarios of spatial data querying. Moreover, they could support information indexing and retrieval works on textual, audio, iconographic, video document corpora.
is the query and candidate SF overlap area divided by the
candidate SF area within the i fuzzy level of the query and the j fuzzy level of the SF 𝑂 𝑄𝑖
is the query and candidate SF overlap area divided by the area
We also plan to combine our candidate spatial feature weighting algorithm with the logistic regression model of information retrieval evaluated in [21]. Merging the information retrieval paradigms used in traditional GIS and in traditional IR is promoted in [7]. Such an approach aims at accumulating features frequencies within larger spatial areas and to propose new weighting methods following statistical information retrieval approaches. Following this, a consistent experimentation, comparing MBR and context-aware spatial relationships approximations with different ranking approaches, should be carried out.
of the query within the i fuzzy level of the query and the j fuzzy level of the SF Figure 8 illustrates a basic example of weighting relevant SF. For instance, Wa, Wb, Wc and Wd are respectively about 6, 10, 5.5 and 3. 1
1
4
1
3
1
Wa = 1 ∗ ( + ) ∗ 3 + 2 ∗ ( + ) ∗ 3 + 3 ∗ ( + ) ∗ 3 = 10 100 10 70 10 70 6.2 4
8
9
1
1
Wc = 1 ∗ (1 ∗ + 2 ∗ + 3 ∗ ) + 2 ∗ (1 ∗ + 2 ∗ + 0) + 0 10 10 10 10 10 = 5.6
5. ACKNOWLEDGMENTS Our thanks to Essafi, W., Cazanave B. and Tran Thi H.N. computer science Master‟s students who mulled over these problems with us.
6. REFERENCES [1] Beard, K., Sharma, V., 1997. Multidimensional ranking for data in digital spatial libraries. International Journal of Digital Libraries, 1(2):153–160 [2] Beaubouef, T., Petry, F.E. and Ladner, R., 2007. Spatial data methods and vague regions: A rough set approach, Applied Soft Computing, 7 (1): 425–440
Figure 8. "South of Laruns" fuzzy query management.
4. CONCLUSION AND FUTURE WORKS
[3] Bennett, B., 1996. The application of Qualitative Spatial Reasoning to GIS, in Proc. of the First Int. Conf. on GeoComputation, 44–47
This work focuses on geo-referencing complex spatial features within an information retrieval process: we make propositions for proximity and cardinal direction relationships interpretation. The corresponding proximity and cardinal direction functions involve parameters that incorporate results of specific context analysis performed by a preceding linguistic process. These functions return representations that are still approximations. However, they compute better footprints than MBR classically used in GIR systems and in the first version of the PIV geographic information indexing and retrieval prototype. As these functions incorporate context aware parameters, they compute more accurate footprints for the corresponding relative spatial feature. For example, distance parameter represents the targeted area radius, it depends on the form, size and nature of the original spatial feature; nature parameter explicitly or implicitly mentioned represents the nature of the targeted area(s). Moreover, these functions support fuzzy
[4] Bennett, B., Isli, A. and Cohn, A.G., 1997. A Logical Approach to Incorporating Qualitative Spatial Reasoning into GIS (extended abstract), in Proc. of the Inte. Conf. on Spatial Information Theory, 503–504 [5] Blake, L., 2007. Spatial relationships in GIS - Geospatial topology basics. OSGeo Journal, Vol. 2, http://www.osgeo.org/journal/volume2
1
7
We use the IGN (French Geographic National Institute) resources to validate named entities and associate them footprints
[6] Bordogna, G., Chiesa, S. and Geneletti, D., 2006. Linguistic modelling of imperfect spatial information as a basis for simplifying spatial analysis, Information Sciences, Recent advancements of fuzzy sets, 176(4): 366–389
Libraries. R. Heery and L. Lyon (Eds), ECDL 2004, LNCS 3232, 45–56 [22] Lesbegueries, J., Gaio, M., Loustau, P., and Sallaberry, C., 2006. Geographical information access for non-structured data, in Proc. of 21st ACM Symp. on Applied Computing Advances in Spatial and Image based Info. Systems track, Dijon, 83–89
[7] Cai, G., 2002. GeoVSM: An Integrated Retrieval Model For Geographical Information, GIScience, Eds M.J. Egenhofer and D.M. Mark, LNCS 2478, 65–79 [8] Chen, Y-Y., Suel, T., Markowetz, A., 2006. Efficient query processing in geographic web search engines, Proc. of the ACM SIGMOD, 277–288
[23] Lesbegueries, J., Sallaberry, C., and Gaio, M, 2006b. Associating spatial patterns to text-units for summarizing geographic information. 29th Annual Int. ACM SIGIR, Geographic Information Retrieval Workshop, 40–43
[9] Clementini, E., Sharma, J. and Egenhofer, M. J., 1994. Modeling topological spatial relations: Strategies for query, processing, Computers and Graphics 18 (6): 815–822
[24] Lieberman, M. D., Samet, H. and Sankaranarayanan, J., 2007. STEWARD: Architecture of a Spatio-Textual Search Engine, In Proc. of the 15th ACM GIS Conf., Seattle, 186– 193
[10] Clough, P.D., Joho, H. and Purves, R., 2006. Judging the Spatial Relevance of Documents for GIR, In Proc. of the 28th European Conf. on IR Research, London, LNCS 3936, 548–552.
[25] Martins, B., M. Silva, M-J., and Andrade, L., 2005. Indexing and ranking in Geo-IR systems. In Proc. of the 2nd Int. Workshop on Geo-IR (GIR), 31–34
[11] Cohn, A.G. and Gotts, N.M., 1996. The „Egg-Yolk‟ Representation of Regions with Indeterminate Boundaries, in Proc. of GISDATA Specialist Meeting on Geographical Objects with Undetermined Boundaries, 171–187
[26] Sallaberry, C., Baziz, M., Lesbegueries, J., Gaio, M., 2007a. Towards an IE and IR system dealing with spatial information in digital libraries - Evaluation Case Study, in Proc. 9th Int. Conf. on Enterprise Information Systems - H-C I Area / Geographical Info. Systems, H-C I Volume, INSTICC, 190–197
[12] Cohn, A. G., 1997. Qualitative spatial representation and reasoning techniques, in Proc. of the 21st Annual German Conf. on Artificial Intelligence, London, 1–30
[27] Sallaberry, C., Gaio, M., Lesbegueries, J. and Loustau, P., 2007b. A Semantic Approach for Geospatial Information Extraction from Unstructured Documents”. Book Chapter in The Geospatial Web book, Advanced Information and Knowledge Processing Series, Springer , 93–105
[13] Cohn, A. G., and Hazarika, S. M., 2001. Qualitative spatial representation and reasoning: An overview. Fundamenta Informaticae, 46(1-2):1–29 [14] Egenhofer, M., 1991. Reasoning about Binary Topological Relations, Second Symposium on Large Spatial Databases, Zurich, LNCS 525, 143–160.
[28] Shockaert S. and De Cock M., 2007. Neighbourhood Restrictions in Geographic IR, in Proc. of the 30th Annual Int. ACM SIGIR Conf., Amsterdam, 167–174
[15] Egenhofer, M. J., 2002. Toward the semantic geospatial web, in Proc. of the 10th ACM Inter. Symposium on Advances in geographic information systems, ACM Press, 1–4
[29] Slick, C., 1994. A fast alternative to Phong‟s Specular Model, Graphics Gems, ed. Paul Heckbert, Academic Press, 4:363–366
[16] Gaio, M., Sallaberry, C., Etcheverry, P., Marquesuzaà, C., and Lesbegueries, J., 2008. A Global Process to Access Documents‟ Contents from a Geographical Point of View. JVLC, Elsevier, 19(1):3–23
[30] Talmy, L., 1983. How language structures space. Spatial Orientation: Theory, Research, and Application, ed. by Herbert L. Pick, Jr. and Linda P. Acredolo, Plenum Press, NY, 225–282
[17] Hill, L.L., 1990. Access to Geographic Concepts in Online Bibliographic Files: effectiveness of current practices and the potential of a graphic interface. PhD thesis, Pittsburgh
[31] Vaid, S., Jones, C. B., Joho, H., and Sanderson, M., 2005. Spatio-textual indexing for geographical search on the web, in Proc. of the 9th Int. Symp. on Spatial and Temporal Databases, LNCS 3633, 218–235
[18] Hill, L. L., 2000. Core elements of digital gazetteers: Placenames, categories, and footprints, in Proc. of the 4th European Conf. on Research and Advanced Technology for Digital Libraries, Springer-Verlag, 280–290
[32] Valcartier, 2006. GRID - Geospatial Retrieval of Indexed Document, Defence R&D Canada Valcartier, http://www.valcartier.drdc-rddc.gc.ca/poolpdf/e/269_e.pdf
[19] Jones, C.B., Abdelmoty, A.I., Finch, D., Fu, G., Vaid, S., 2004. The spirit spatial search engine: Architecture, ontologies and spatial indexing, in Proc. 3th Inter. Conf. Geographic Information Science, Adelphi, 125–139
[33] Walker, D., Newman, I., Medyckyj-Scott, D. and Ruggles, C., 1992. A system for identifying datasets for GIS users. International Journal of Geographical Information Systems, 6(6):511–527
[20] Lacayrelle, A., Gaio, M., Sallaberry, C., 2007. La composante temps dans l‟information géographique, Revue Document Numérique, Entreposage de documents et données semi-structurées, Hermès-Lavoissier, 10(2): 129–148
[34] Woodruff, A.G., Plaunt, C., 1994. GIPSY: Automated Geographic Indexing of Text Documents. Journal of the American Society for Information Science, 45(9): 645–655
[21] Larson, R. R., Frontiera, P., 2004. Spatial Ranking Methods for Geographic Information Retrieval (GIR) in Digital
8