An Approach for Spatial Search Using SOLR Divakar Yadav1, Sonia Sanchez-Cuadrado2, Jorge Morato3, Juan Bautista Llorens Morillo4 1
Department of Computer Science & Engineering, Jaypee Institute of Information Technology, Noida India 2,3, 4 Department of Computer Science & Engineering, Universidad Carlos III, Madrid, Spain 1
[email protected], 2sonia.sanchez.cuadrado@uc,3m.es
[email protected]
ABSTRACT A large fraction of documents on W3 (World Wide Web) contains geo-spatial context, but conventional search engines treat the query place name in the same way as any other keyword and retrieve documents that include the specified name. Such results, retrieved through conventional approach may be sufficient for many users but there are many occasions when user is interested in documents that are related to the region of space as user specified in his query, but which might not actually include the place name. This could generally happen in the situation when there were documents that used alternative names, or referred places that were in or nearby the particular place. In the current work an approach for indexing the document along with its spatial information is being discussed that can help to search the spatial data as per ones’ need. Further it has also been explored how Solr, an open source software can be a helpful platform for developing a spatial information retrieval tool. KEYWORDS Information retrieval, World Wide Web (W3), inverted indexing, spatial information retrieval (SIR). 1. INTRODUCTION 32.7 % world population use internet to access some or other kind of information [1]. In fact, in year 2000 this data was 6.2 %. So there is a tremendous growth of about 528% in internet users, from year 2000 to 2012 and similar growth rate is also expected in future. Similar to the internet users, size of internet is also increasing exponentially and there are approximately 50 billion [2] publically accessible web documents, which used to be around 1 billion in year 2000. Almost every internet user’s use search engines to search the required information [3], where search engines are an information retrieval tool. A Search engine is a tool to locate information that is composed by three mayor components: indexer, query processor and web crawlers. Fig 1 shows a general architecture of a search engine, where indexer processes web document and create index to represent these documents. Inverted index scheme is one of the most
popular techniques used to generate the index. Query processor receives user query, process it to extract the key words from it and based on these key words returns the matched documents in an order decided by the ranking algorithms. Web Crawlers’ which are also known as spiders, robots or wonderers, are software agents that download web documents from the World Wide Web (W3) based on the initial URLs, also known as seed URLs supplied to them. In fact search engines use Web Crawlers to find what is on the Web. McCurly [9] estimated that approximately 10% web documents on W3 contains place name in form of zip code, telephone number etc. In similar line Himmelstien [15] states that approximately 20% documents on W3 contain one or more easily understandable and unambiguous geographic identifier such as postal address etc. Also a large fraction of users query contain place name. Sanderson et al. [14] estimated that 13-15% of web queries submitted to search engines contain place name in one or other form. Further as per Zhang et al. [16], who done analysis of approximately four million samples of user queries, found that approximately 12.7% sampled queries contains a place name. In past researchers have done some work in the area of geo-spatial search and developed few products. Vicinity products and Google location search engines etc [13-16] are few examples of such products but not much information were been published about the technology they used to develop these systems. However, there are some literatures available about the research efforts accomplished related to the development of geo-spatial search engine functionality [17-18]. The Vicinity Company product is accessible at Mapblast [19] website and Northern Light search engine [14]. By using this system to search the desired information, user are required to enter addresses of USA or Canada, partially or in full along with concern category and radius in which he is interested to perform the search. Further it give the impression that the system decipher the address information to a corresponding map coordinate and further with the help of digital map inflate the search to embrace the other places within the
Ƈ 202 Ƈ
specified radius. The Mirago [15] is another such tool which supports a regional web search facility but its focus is limited to four European countries, UK, Germany, France and Spain. This tool provides a search facility such that users can choose the particular region of a particular country to focus their search operation. An experimental geographical search engine is also developed by Egnor, D. [20]. This tool takes the help of US Bureau of Census TIGER/Line digital mapping data for extracting and translating street addresses, present in corpus of text to geographical coordinates. Further these coordinates, along with a conventional keyword index of corpus are indexed in a two dimensional index. With the help of
code, place name and telephone numbers. Once the geographical contexts are extracted, they are mapped to one of a limited set of point-referenced map locations. Searching of geographical information is commenced by the user posing to find websites that refer to places in the surrounding area of a currently displayed web site. Global Atlas search engine [23] indexes maps, images and HTML documents on the Web. In this tool, for the available web documents, indexes are maintained as per their geo-print besides to the conventional keyword and categories used by most of search engines. Further queries are represented as rectangles drawn on a map in addition to the traditional keyword filters. Gazetteers such as Getty Thesaurus of Geographic Names [24] are helpful for making the registration of document footprints into an Oracle based spatial database.
Fig. 1. Basic Architecture of a typical search engine the two dimensional index, a query processor is able to process user queries that require documents, matching with certain keywords or have addresses within a certain radius of a specified goal address. Another such work which thinks about the specified problem of determining the geographical contexts with regards to a web document is being proposed by [22]. In their work, authors have used a gazetteer to detect to the presence of place names in a web document, before analyzing their frequency. Buyukokkten et al. [18] has proposed an approach for developing location specific referencing of web data. In the approach they suggested to associate IP address of domain name with that of telephone area code. As per their method, the postal address of the web site or network administrator can be used to generate a zip code and later this zip code can be mapped with geographical coordinates. A top level domain mechanism, based on geographical referencing is proposed by the Stanford Research Institute, in which domain name refers to a strict hierarchy of quadrilateral cells. These cells are defined by latitude and longitude. To make it possible, all existing domain names would first register themselves with a geo domain server. For a set of cells, the domain name server would store all registered websites that relate to each of the given cells. McCurley [9] proposed an experimental system for geographical navigation. In this system he proposed array of techniques capable to extract the geographical context from a web page. Extractions of geographical context from web pages are based on the occurrence of text addresses and postal
One of the major problems, still existing with such information retrieval tools, is that they do not provide up to mark results when one is looking information, within a specific geographical location. The key problem in providing the desired result in such cases is missing support for the geographical dimension. For example if one is interested to get the documents on “terror attack” in “India” then search engine will retrieve all documents containing the phrase “terror attack” and the word “India”. Documents with the word “Dehi” or “Mumbai” will however not necessarily be retrieved since it is difficult for search engine to establish a relationship between these two cities and the country India 2. SPATAIAL SEARCH METHODOLOGY Generally in all popular search engines, pure text indexing scheme is used to index the web documents. Under pure text indexing an inverted text scheme, consisting a lexical file is used. Each record set of lexical file contain fields for an item/term, also sometimes known as keyword of text and a pointer to an entry in posting file. Each posting files contain the list of the documents in which the term has occurred. For example if N terms are required to be indexed, then with respect to each term a dedicated posting file is maintained, containing documents in which the term has occurred as shown in fig. 2. So, if many terms occur in the same document, the document is replicated in each term’s posting list. The basic problem with pure text indexing is that this scheme alone is not much helpful in case one is interested to retrieve/search spatial documents. The spatial search queries are being classified under four categories [6] as shown in table 1. For the purpose of spatial search it is necessary to identify the place names from the web documents. In such cases, one can seek the help of Gazetteer kind documents for selecting the candidates for
Ƈ 203 Ƈ
place names. Gazetteer is a kind of document which provides all relevant place names together with their geographical positions. So while analysing the document for indexing, it can be stored the geographical position of the place name along with its other description details.
Fig. 2. Pure Inverted Indexing Table 1. Spatial query types Query Type
Example
Distance
Engineering Institutions within 10 Km of Parichowk, Noida Schools near Jaypee Institute in Sect-62 Noida Hospitals in Noida Marriage resort north of Delhi Hospitals in northern India
Topological Directional Imprecise region
Fig. 3. Spatial index of documents X
D1
D5
D6
D20
Fig. 4. Pure text indexing Scheme
D29
X
R1(D1,D6)
R2(D5, D20)
R5(D 29)
R6(D 29)20
Fig. 5. Text followed by spatial index scheme In [7], a spatial primary indexing scheme is suggested for indexing the spatial information. Under this indexing scheme the geographical coverage of place name specified in document is divided into a set of regular grid cells. Corresponding to each cell, an inverted text index is constructed, similar to pure text indexing scheme but the document sets, present in the posting list contain those documents which belongs from the cell. If same document belongs from more than one cell, then it will be represented in multiple cells’ posting list [7]. For example if a set of 32 documents (D1, D2, D3, -----, D32) is distributed over a document space divided into 9 cell (R1, R2, -----, R9) as shown in fig. 3. Further let SR represents the whole documents space, then respective subdivisions of the cells SR1, SR2, SR3, SR4, SR5, SR6, SR7, SR8 and SR9 contains posting list as: SR1= (D1, D6, D12, D17, D23), SR2= (D2, D7, D9, D12, D13 ), SR3=(D3, D9, D13, D16, D24,) SR4=(D4, D11, D17, D10, D23, D28, , D30) SR5=(D7, D9, D10, D22, D23, D25 D29 D30,) SR6=(D8, D9, D16, D19, D25, D29 D31,) SR7=(D10, D15, D18, D26, D28,) SR8=(D8, D10, D14, D21, D22, D26, D27, D32,) SR9=(D5, D8, D19, D20, D21,). While spatial primary indexing is good for pure spatial search queries but not good choice if the search query consists of both, spatial as well as textual terms. For such queries, text followed by spatial index type is preferable. Under this indexing scheme, a two-stage index is constructed [7]. In first stage inverted index for text collection is constructed followed by for every inverted list, a spatial index is created. For example, for document space divided into nine spatial cells, shown in fig 3, let we assume that corresponding to a term ‘X’ in pure text indexing scheme, the inverted index is as shown in fig 4, if the term X is present in the documents D1, D5, D6, D20 and D29. Under, text followed by spatial index scheme, the inverted index for the term ‘X can be re-arranged as shown in fig 5 if the documents (D1, D6), (D5, D20), D29 and D29 belong to the region R1, R2, R3 and R4 respectively. So while producing the result against search queries consisting both, spatial as well as textual term, the better search approach will be textual followed by spatial search using above mentioned indexing scheme. In next section our approach is to explore the Solr from spatial search point of view. The basic reason for selecting Solr for the problem is as follows: • Solr is a well-known and widely used search engine that deals with this type of entities. One of the advantages one get is that indexing, and collecting
Ƈ 204 Ƈ
• •
geospatial features are time consuming. Solr give us a first solution, well developed, fast and free, and the support of a large community of developers Solr has good integration capabilities with NLP tools, like Annie, that deals with “imprecise regions. Place names often have alias, Solr have good capabilities to handle phonetic similarities (Megaphone), linguistic variations and synonyms.
Although some named entities, other than geo, are not supported in Solr there are developments in UIMA (UIMA addons-OpenCalais, and UIMA-opennlp namfind). 3.
DOCUMENT FIELDS ANALYSIS AND GEOSPATIAL FEATURES OF SOLR Solr is an open source search platform integrated in the Apache Lucene project. Solr is written in Java and runs as a standalone full-text search server. As a search engine it includes full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Other strengths are scalability and fault tolerant, distributed indexing, replication and load-balanced querying, automated failover and recovery. The concept of Solr is similar to any of the search engine. It means, more the information/document we feed to it, enhance the chance that later we can find the required information against our query. Feeding the information is called indexing and asking question to get the information is called querying. In Solr, various fields, which stand for more specific pieces of information, are used to represent documents [8]. Depending on kind of data, fields too can be of different type. For example name field is text; height field might be a floating point number so that it can store values like 160 and 160.5.
tokenizer followed by various filters. Under this scheme output of one tokenizer/filter is used as input for subsequent filter. The white space tokenizer demarcates words by identifying the white spaces around them. The function filters are: creation of synonyms with the help of a external file ‘synonym.txt”, splitting tokens at word delimiters, converting the tokens into lower case, applying SnowballPorter stemming algorithm to converts the tokens into its root form (stems) and removing duplicate tokens respectively. In similar way one can add more number of analyzers as per their requirement. This kind of analysis on filed values are applied at two phase, at indexing time as well as at query time. The analysis at both ends i.e. at indexing ends and at query ends may or may not be similar depending on the requirements. Solr is rich in terms of tokenizers and filters. It consists of a large number of token analysers and filters such as, Standard Tokenizer, Classic Tokenizer, Lower Case Tokenizer, N-Gram Tokenizer, Edge N-Gram Tokenizer, Regular Expression Pattern Tokenizer, UAX29 URL Email Tokenizer, White Space Tokenizer, Shingle Filter, Synonym Filter, Trim Filter, English Minimal Stem Filter, Snowball Porter Stemmer Filter, Numeric Payload Token Filter etc. For example if we include standard tokenizer as shown in fig. 7 then output received for input “ Please, Email:
[email protected] by 01-05-2013, regard: Divakar Yadav“ will be “please”, “Email”, “
[email protected]”, “by”, “01-05-2013”, “regards”, “Divakar”, “Yadav”.
One can make clear to Solr about the kind of a data a field contains by specifying its field type. Field type basically tells Solr how to interpret the field and how to query it. When a document is added, Solr extracts the information in the document’s field to add that information to an index and later this index can quickly be consulted to provide the matching documents as result for a given query. Through field analysis, we tell Solr what to do with incoming data while indexing it. For example the incoming document may contain a lot of words such as: ‘a’, ‘an’, ‘the’, ‘is’, ‘are’, ‘of’ ‘to’, ‘for’ and many others which are not considered for indexing. So during field analysis all these terms should be discarded. One of the core components of SOLR is schema.xml file, which includes fields and field types. Field type contains the information that tells what kind of processing are applied on incoming field values, representing the document. One such example is shown in fig. 6 in which field type is “TextField” and there after analysis is performed using white space
Ƈ 205 Ƈ
Fig. 6 . Filed type definition in SOLR Fig. 7. Standard Tokenizer
Similarly we can do the analysis of other token analyzers. The result for field type “text_en_spliting_tight” through Solr at query time is shown in as fig. 9. In similar way the analysis details for other filed types can also be generated to analyse the working details of the filters and tokenizers applied on these fields. One can add/delete these filters and tokenizers as per their requirements.
Fig. 8. N-gram Tokenizer In similar way if we include n-gram tokenizer as shown in fig. 8 then output for input “computer” With an n-gram size range of 4 to 5 are “comp”, “ompu”, “mput”, “pute”, “uter”, “compu”, “omput”, “mpute”, and “puter”.
Fig. 9. Text analysis through Solr at query end for field type “text_en_splitting_tight Solr represents features, helpful for spatial search. With the help of these features one can represent the spatial information along with textual information into index, can filter data by location while retrieving it from index, and can sort it based on distance. To incorporate all above mentioned requirement, SOLR has three inbuilt tools, namely a geospatial filter, geospatial boundary box and a geospatial distance function. With the help of geo-filter one can retrieve all relevant documents within distance from a given point, e.g. retrieving all documents within 10 km radius from a given latitude/longitude as shown in fig. 10. Boundary box is helpful in filtering out the results within a specified area around a given points. It also includes the documents, lying outside the circle but the square covering
the circle as shown in fig. 11. Geo-spatial distance function is helpful to sort the results on the basis of distance of documents with respect to the query points.
Ƈ 206 Ƈ
10 KM
Fig. 10. Documents within 10 km of radius
[5]
10 KM
[6]
Fig. 11. Documents within square covering 10 km radius
[7]
4. CONCLUSION From all above discussion it is very much clear that normal indexing technique used by most of the general information retrieval tools are not enough to support spatial searching. Spatial information retrieval (SIR) therefore is concerned with improving the quality of spatial specific information retrieval with a focus on access to unstructured documents found on web. So there is a significant need to improve SIR in respect to following capabilities. •
To detect geographical references in the form of place names and associated spatial natural language qualifiers within text documents.
•
To disambiguate place names to determine which particular instance of a name is intended.
•
To clarify indistinguishable spatial terminology and spatial interpretation of the meaning of indistinguishable place names, such as the ‘Midlands’ and of indistinguishable spatial language such as ‘near’, many of the place names that users use when searching on the Web are of an informal or vernacular nature, often without precise boundaries.
So to develop a SIR which incorporates all above features, Solr is one of the most suitable platforms because apart from having a rich in terms of tokenizers and filter it also support function for indexing the spatial information. 5. [1] [2] [3]
[4]
REFERENCES Internet world stats: http://www.internetworldstats.com/stats.htm he size of the World Wide Web (The Internet): http://www.worldwidewebsize.com/ Divakar Yadav, AK Sharma, and JP, Gupta, “Users Search Trends on WWW and Their Analysis”, First International Conference on Intelligent Interactive Technologies and Multimedia, IIIT Allahabad, pp. 6168, 28-30 Dec 2010. Divakar Yadav, AK Sharma, Sonia SanchezCuadrado, Jorge Morato, “An Approach to Design Incremental Parallel WebCrawler”, Journal of
[8] [9]
[10]
[11]
[12]
[13] [14] [15] [16] [17]
[18]
[19]
Ƈ 207 Ƈ
Theoretical and Applied Information Technology, Vol. 43. No. 1, pp 008 – 029, 2012 Cho, Junghoo, Los, Angeles, and Hector, GarciaMolina, “Effective Page Refresh Policies for Web Crawlers”, ACM Transactions on Database Systems, Volume 28, Issue 4, pp. 390 – 426, December 2003. Subodh Vaid, Christopher B. Jones, Hideo Joho and Mark Sanderson, “Spatio-Textual Indexing for Geographical Search on the Web”, Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases, Pages 218-235, 2005. Ross S. Purves et al. “The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet”, International Journal of Geographical Information Science, Volume 21 Issue 7, Pages 717-745, Jan 2007. “Apache Solr Reference Guide” by Lucid Imagination. McCurley, S.K., Geospatial Mapping and Navigation of the Web. In Proceedings of the 10th International WWW Conference, Hong Kong, 1-5 May, ACM Press, pp.221-229, 2001. Sanderson, M. and Kohler, J. Analyzing geographic queries. In Proceedings of the 2004 Workshop on Geographic Information Retrieval, 29 July 2004, Sheffield, UK. Available online at: http://www.geo.uzh.ch/,rsp/gir/abstracts/sanderson.pdf 2004. Himmelstein, M., Local Search: The Internet Is the Yellow Pages, IEEE Computer Society Journal, 00189162/05, pp.26-34, 2005. Zhang, V.W. Rey, B. Stipp. E. and Jones, R., Geomodification in Query Rewriting. In Proceedings of the 2006 Workshop on Geographic Information Retrieval, Seattle, USA, pp. 23-27., 2006. Vicinity.com. http://home.vicinity.com/us/mappoint.htm. Northern Light. http://www.northernlight.com/index.html. Mirago: Mirago the UK Search Engine. http://www.mirago.co.uk/. Google Local. http://local.google.com/lochp. Subodh Vaid, Christopher B. Jones, Hideo Joho and Mark Sanderson, “Spatio-Textual Indexing for Geographical Search on the Web”, Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases, Pages 218-235, 2005. Buyukokkten, J. Cho, H. Garcia-Molina, L. Gravano, and N. Shivakumar. Exploiting Geographical Location Information of Web Pages. In Proceedings of Workshop on Web Databases (WebDB’99) held in conjunction with ACM SIGMOD’99, pages 91–96. ACM Press, 1999. Mapblast. http://www.mapblast.com
[20] Egnor, D. http://www.google.com/programmingcontest/winner.html. [21] GeoURL ICBM Adress Server. http://geourl.org/. [22] J. Ding, L. Gravano, and N. Shivakumar. Computing Geographical Scopes of Web Resources. In Proceedings of the 26th Very-Large Database (VLDB) Conference, pages 546–556. Morgan Kaufmann, 2000.
[23] S. Bressan, B.C. Ooi, and F. Lee. Global Atlas: Calibrating and Indexing Documents from the Internet in the Cartographic Paradigm. In Proceedings of the 1st International Conference on Web Information Systems Engineering, volume 1, pages 117–124, 2000. [24] Getty Thesaurus of Geographic Names. http://www.getty.edu/research/ conductingresearch/vocabularies/tgn/ index.html.
Ƈ 208 Ƈ