International Journal of Geographical Information Science
ISSN: 1365-8816 (Print) 1362-3087 (Online) Journal homepage: http://www.tandfonline.com/loi/tgis20
A comprehensive methodology for discovering semantic relationships among geospatial vocabularies using oceanographic data discovery as an example Yongyao Jiang, Yun Li, Chaowei Yang , Kai Liu, Edward M. Armstrong, Thomas Huang, David F. Moroni & Christopher J. Finch To cite this article: Yongyao Jiang, Yun Li, Chaowei Yang , Kai Liu, Edward M. Armstrong, Thomas Huang, David F. Moroni & Christopher J. Finch (2017): A comprehensive methodology for discovering semantic relationships among geospatial vocabularies using oceanographic data discovery as an example, International Journal of Geographical Information Science, DOI: 10.1080/13658816.2017.1357819 To link to this article: http://dx.doi.org/10.1080/13658816.2017.1357819
Published online: 31 Jul 2017.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tgis20 Download by: [George Mason University]
Date: 01 August 2017, At: 07:28
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2017 https://doi.org/10.1080/13658816.2017.1357819
ARTICLE
A comprehensive methodology for discovering semantic relationships among geospatial vocabularies using oceanographic data discovery as an example Yongyao Jianga, Yun Lia, Chaowei Yang a, Kai Liua, Edward M. Armstrongb, Thomas Huangb, David F. Moronib and Christopher J. Finchb
Downloaded by [George Mason University] at 07:28 01 August 2017
a
NSF Spatiotemporal Innovation Center, Dept. of Geography & GeoInformation Science, George Mason University, Fairfax, VA, USA; bJet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA ABSTRACT
It is challenging to find relevant data for research and development purposes in the geospatial big data era. One long-standing problem in data discovery is locating, assimilating and utilizing the semantic context for a given query. Most research in the geospatial domain has approached this problem in one of two ways: building a domain-specific ontology manually or discovering automatically, semantic relationships using metadata and machine learning techniques. The former relies on rich expert knowledge but is static, costly and labor intensive, whereas the second is automatic and prone to noise. An emerging trend in information science takes advantage of large-scale user search histories, which are dynamic but subject to user- and crawler-generated noise. Leveraging the benefits of these three approaches and avoiding their weaknesses, a novel methodology is proposed to (1) discover vocabulary-based semantic relationships from user search histories and clickstreams, (2) refine the similarity calculation methods from existing ontologies and (3) integrate the results of ontology, metadata, user search history and clickstream analysis to better determine their semantic relationships. An accuracy assessment by domain experts for the similarity values indicates an 83% overall accuracy for the top 10 related terms over randomly selected sample queries. This research functions as an example for building vocabulary-based semantic relationships for different geographical domains to improve various aspects of data discovery, including the accuracy of the vocabulary relationships of commonly used search terms.
ARTICLE HISTORY
Received 30 June 2016 Accepted 17 July 2017 KEYWORDS
Query augmentation; semantic search; web mining; search history; click behavior; big data
1. Introduction Recent advances in satellite-based and other sensors have introduced the 5V (i.e. volume, velocity, veracity, variety and value) characteristics of big data into geospatial data (Yang et al. 2017b). These characteristics make it challenging for researchers to both discover and access data (Vatsavai et al. 2012, Yang et al. 2017a). In response, big
CONTACT Chaowei Yang
[email protected]
© 2017 Informa UK Limited, trading as Taylor & Francis Group
Downloaded by [George Mason University] at 07:28 01 August 2017
2
Y. JIANG ET AL.
geospatial data have been made available through online web discovery and access applications to improve both accessibility (Lee and Kang 2015) and understandability (Jiang et al. 2016b). For instance, the National Aeronautics and Space Administration (NASA) built the Common Metadata Repository to archive and distribute all NASA Earth Observing System Data and Information System data. In specific domains, polar cyberinfrastructure (Jiang et al. 2016c) and the National Oceanic and Atmospheric Administration’s one-stop discovery framework (Casey 2016) were developed to facilitate data usage. Such efforts have improved the accessibility of geospatial science data, but discovering and accessing the relevant data for geospatial research remains a significant challenge (Li et al. 2014). Most catalogs and portals rely on keyword-based searches, and at the core of these techniques is Apache Lucene, an open source information retrieval tool (McCandless et al. 2010). Lucene provides powerful full-text search capabilities but fails to address the ‘intent gap’ (i.e. the gap between its representation of users’ queries vs. their true intent). By combining semantically related terms, query modification has been proposed as a solution to bridge this gap (Mangold 2007, Hua et al. 2013). For example, querying ‘sea surface temperature’ using Lucene-based technology, the query terms are executed as the new Boolean query ‘sea AND surface AND temperature’. The search results likely contain the terms ‘sea’, ‘surface’ and ‘temperature’ within their textual content but may not result in documents containing related concepts as ‘ocean temperature’. In this case, ‘sea surface temperature’ and ‘ocean temperature’ are related terms in ocean science; therefore, a human would expect that documents containing highly related concepts would be retrieved and ranked. However, the challenge in accomplishing this type of search lies in how and where to obtain the semantic context for a given query. The objective of this research is to fill the ‘intent gap’ by developing an automated approach for discovering semantic relationships among geospatial vocabulary terms and improving data discovery through these newly discovered semantic relationships.
2. Literature review Semantic similarity measures have a long tradition in fields such as information retrieval, artificial intelligence and cognitive science (Janowicz et al. 2011). Traditionally, research has focused on either the manual creation/utilization of an ontology or automated document-clustering techniques. Recently, an emerging trend in semantic search is to build knowledge bases by mining user behavior data. A significant effort in the geospatial community has focused on manual creation of semantic ontologies because general purpose ontologies such as WordNet (Miller 1995) are generally inadequate when applied to specific domains. The concept of a Geospatial Semantic Web was introduced in 2002 (Egenhofer 2002), and GeoSPARQL was developed to explore spatial knowledge and enable spatial reasoning (Battle and Kolas 2012). Geospatial ontologies such as the Semantic Web for Earth and Environmental Terminology (SWEET) (Raskin and Pan 2003) capture concepts and relations in the geospatial domain. In various geospatial subdomains, ontologies, e.g. GeoLink for oceanographic data, are used to facilitate interoperability between data centers (Krisnadhi et al. 2015); these ontologies contain rich geospatial knowledge input by domain
Downloaded by [George Mason University] at 07:28 01 August 2017
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
3
experts. To calculate concept similarities from ontologies, previous work considers only subclasses or equivalent classes and excludes cases where the path involves multiple types of predicates (Liu et al. 2014). We propose an improved method for calculating concept similarity that uses both equivalent class and subclass. Another approach to this challenge has been applied through document-clustering and dimension-reduction techniques such as Latent Semantic Analysis (LSA) (Dumais 2004) and Latent Dirichlet allocation (LDA) (Blei et al. 2002). The advantage of these solutions lies in their automaticity and human and language independence. Li et al. (2014) developed a geospatial semantic search algorithm integrating LSA in the broad domain of Earth science, and Hu et al. (2015) performed topic modeling using LDA in geospatial portals. We analyze the geospatial dataset metadata using LSA and incorporate it as an important component of a comprehensive approach. Mining user behavior data (i.e. mining web usage data in the information sciences) has achieved remarkable successes in areas such as website personalization, system improvement and business intelligence (Srivastava et al. 2000). However, only a few research papers have explored the possibility of using large-scale user behavior data for semantic searches. Two types of user behavior data – user search history and clickstream – are considered in this research. User search history refers to a log of past user search requests, whereas clickstream data refer to a user’s mouse click series while visiting a website. A pioneer study, the query augmentation algorithm of AlJadda et al. (2014), which used user search history, performed well with experimental data, demonstrating that web usage mining is a promising avenue for improving semantic searches. The limitation is that this algorithm requires two user groups (i.e. job seeker and recruiter) to achieve high accuracy, a condition that most websites are unable to satisfy. This research addresses this shortcoming by integrating user history analysis results with results from other methods to improve accuracy. Another preliminary study presented an algorithm to discover relationships between a user’s query and search items by capitalizing on clickstream data (Hua et al. 2013). This paper extends the clickstream-data approach by discovering the relationships among queries by analyzing clickstream data. There is a need to integrate the results of different methods to address three priorities: (1) supporting all possible user search terms, (2) reducing the unique noise of each method that originates from the hypothesis or the data (Hua et al. 2013, AlJadda et al. 2014, Li et al. 2014, Liu et al. 2014) and (3) better determining the final similarity between queries and search results by comparing the results from multiple sources when results are inconsistent or missing. Most past studies have attempted to solve this problem by applying a linear combination of different component results, notably the weighted sum (Andrea Rodriguez and Egenhofer 2004). However, this technique fails in our approach when similarities can be derived only from certain approaches. For instance, many user-generated queries are not captured in a domain ontology. This research proposes an integrative strategy to address this ‘missing value’ problem by introducing an increment adjustment component that combines the best results from different approaches. Here, an oceanographic data discovery domain is the test case. Oceanography includes and integrates data from the solid Earth, from surface water and from the atmosphere; a successful approach using these domains is extensible to other disciplines. Moreover, many oceanographic repositories are not only rigorous in their
4
Y. JIANG ET AL.
architecture and richness but also interoperable. The basic data gathering mechanism in oceanography is via surface ships and satellites, which share many unique features and have complex relationships among datasets, people and publications. The remainder of this paper is organized as follows. Section 3 discusses the preprocessing steps for each data source. Section 4 describes the mining and integration method. Section 5 introduces the system architecture and Section 6 presents the mining and evaluation results. Finally, Section 7 offers key findings and possibilities for future research.
3. Data preparation
Downloaded by [George Mason University] at 07:28 01 August 2017
The data include three different sources: web logs, an existing ontology and metadata (Figure 1).
3.1. Web logs Web log files from the PO.DAAC website (https://podaac.jpl.nasa.gov/) are used to mine user behavior. PO.DAAC serves oceanographic data to the Earth science community and has more than 500 unique datasets from more than 30 missions and from satellite sensors. PO.DAAC exposes at least two service types; a web service provides searches while data distribution services (e.g. FTP and OPeNDAP) support downloading data. This experiment uses web logs from 2014 containing 150 million records in 30 GB. To retrieve the user search history and clickstream, the raw web logs were preprocessed using four steps: user identification, crawler/robot detection, session identification and structure reconstruction. The details of these data preparation steps are described in Jiang et al. (2016a). Figure 2 shows a few sample queries from the final web log preprocessing results conducted by User A during the study period. The clickstream result shows a sample clickstream in which dataset ‘navo-12p-avhrr19_g’ was viewed and downloaded after the query ‘sst’ was executed (‘sst’ is a common abbreviation for ‘sea surface temperature’).
Figure 1. Data preparation workflow.
Downloaded by [George Mason University] at 07:28 01 August 2017
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
5
Figure 2. Sample of web log data preparation results.
3.2. Existing ontology The SWEET ontology is a publicly available, upper level Earth science ontology developed by NASA’s Jet Propulsion Laboratory. It consists of thousands of terms spanning broad domains of Earth system sciences and related concepts and uses the Web Ontology Language (OWL). SWEET is a foundation for oceanographic and marine ontologies (Bermudez et al. 2006). Here, we extracted ocean-related OWL files from SWEET (e.g. phenOcean.owl, realOcean.owl). Triples with predicates such as ‘subClassOf’ (Hyponymy) and ‘equivalentClass’ (Synonym) were extracted from the OWL files. For example, ‘polynya’ is the subclass of ‘sea ice’ and ‘Littoral Drift’ is an equivalent class of ‘Longshore Drift’. A total of 145 triples were extracted from SWEET and are available as an important supplement to the final results.
3.3. Metadata All the publicly available collection-level metadata (563 files) were harvested from the PO. DAAC website using its web services in Directory Interchange Format (DIF), a standard format in NASA’s Earth Science Data Systems (Devarakonda et al. 2011). The DIF format usually includes more than 10 unique data-descriptive attributes (e.g. ID, Title, Related URL, Summary, Organization, DatasetParameter-Category, DatasetParameter-Topic and DatasetParameter-Term), and each contains rich oceanographic and geospatial concepts. For instance, ‘DatasetParameter-Category’ could be ‘Earth Sciences’, and ‘DatasetParameterTopic’ could be ‘Ocean’. However, many attributes do not describe the actual content (e.g. ‘Metadata_Name’, and ‘Metadata_version’). To reduce noise, rather than using all the metadata attributes, we selected only a few human-understandable attributes (e.g. Title, Science Keywords, ISO topic category, Instrument, Platform and Project).
3.4. Additional steps Before addressing the processing section, three additional steps were executed (1) we normalized words/tokens by transforming each into its lowercase form; (2) we eliminated stop words (Newman 2005), which occur commonly and rarely affect the meaning
6
Y. JIANG ET AL.
of a phrase (e.g. ‘the’, ‘a’ and ‘an’) and (3) we acquired word stems (Korenius et al. 2004) by removing the differences between the inflected forms of a word, reducing each to its root. For example, ‘ocean winds’ was reduced to ‘ocean wind’.
4. Methodology Figure 3 presents our approach to analyzing the four different types of input data (i.e. search history, clickstream, ontology and metadata).
Downloaded by [George Mason University] at 07:28 01 August 2017
4.1. Search history analysis This approach starts by collecting queries entered by each user from web logs. The hypothesis is that a relationship exists among the various queries searched by each individual user. The more frequently two queries co-occur in an individual user’s search histories, the more similar they are likely to be. To measure this type of similarity, each query is represented as a binary vector. Given a user query q ∈ Q, q ¼ uq1 ; uq2 ; . . . ; uqn ;
(1)
uqi ¼ 1 if query q has been searched by user i, otherwise, uqi ¼ 0. Although other methods suggest using the occurrence frequency of query q searched by user i instead of binary weights, this approach is not justifiable in the context of web searches (AlJadda et al. 2014). For example, although both ‘sst’ and ‘sea surface temperature’ are likely to co-occur in a user’s search history, they do not both necessarily have the same
Figure 3. Multisource data processing workflow.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
7
frequency depending on the user’s search habits. This is similar to the conclusion that the amount of time spent on a web page is not a good indicator of a user’s interest in that page (Konstan et al. 1997). Before calculating the similarity, a threshold that represents the minimum number of distinct users is applied to filter rarely searched queries. Such queries are removed because there is no information to capture their characteristics. After the filtering process is complete, the similarity between queries t and s is defined using the following equation:
Downloaded by [George Mason University] at 07:28 01 August 2017
jt\ sj simðt; sÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi j t j j sj
(2)
This similarity between two queries is the co-occurrence frequency normalized by the square root of the multiplication of the occurrence of each. The resulting similarity ranges from 0 (orthogonality) to 1 (an exact 1:1 match). This metric is a special form of cosine similarity (Salton et al. 1975) as observed in a binary data case, and the general form is discussed in the following section. A conceptual example of user history analysis with only three distinct users (columns) and three user queries (rows) is presented (Figure 4). ‘Ocean temperature’ and ‘sea surface temperature’ are shared among all three users, while ‘ocean temperature’ and ‘ocean wind’ share only one user. This can be construed as meaning that ‘ocean temperature’ is more similar to ‘sea surface temperature’ than is ‘ocean wind’. According to Equation (2), the similarity of ‘ocean temperature’ (i.e. 1, 1, 1) to ‘sea surface temperature’ (i.e. 1, 1, 1) and ‘ocean wind’ (i.e. 0, 0, 1) is 1.0 and 0.58, respectively.
4.2. Clickstream analysis This method extracts clickstream data from web logs. In this example, the focus is on three types of user behaviors: searching, viewing and downloading. Although a data search of a specific website is used as an example, the method is adaptable to other search engines (e.g. ‘downloading’ could be replaced with ‘add to cart’ or ‘purchase’ for a generic e-commerce site).
Figure 4. Conceptual example of user search history analysis.
8
Y. JIANG ET AL.
The hypothesis is that similar queries result in similar clicking behaviors. The intuition is that if two queries are similar, the viewed and downloaded data will be similar in the context of large-scale user behaviors. To measure this similarity, we took the following steps:
Downloaded by [George Mason University] at 07:28 01 August 2017
(1) Construct a query–data matrix. As with user search history analysis, rarely searched queries are filtered first; however, instead of using a user as a feature, each query is represented as a data vector. Given a user query q ∈ Q, q ¼ d1q ; d2q ; . . . ; dnq
(3)
diq ¼ viewqdi þ α downloadqdi ðα 1Þ
(4)
where diq is the frequency of data i clicked after query q, view qdi is the frequency of data i viewed after query q, and download qdi is the frequency of data i being downloaded after query q. Equation (4) considers the click frequency as a combination of the viewing and downloading frequency. The coefficient α, which is usually >1, indicates that the downloading behavior is stronger than viewing, analogous to the user finding the desired data. A query–data matrix is created by putting all the query vectors together, where each row and column represents a query and data, respectively. A conceptual example of clickstream analysis with three distinct data (columns) and three user queries (rows) is presented (Figure 5). The numbers shown are the results of Equation (4) when coefficient α is set to 2. Take the number at row 1, column 2 as an example: assuming that data2 has been viewed once and downloaded twice after a query for ‘ocean temperature’, the click frequency of data2 after the query ‘ocean temperature’ would be 5 based on Equation (4).
(2) Perform LSA with the query–data matrix. In practice, the query–data matrix can be extremely large, and it includes considerable random noise because of the large number of queries and data. The LSA is a feature normalization and dimension reduction technique consisting of Term Frequency-Inverted Document Frequency normalization, Singular Value Decomposition and dimension reduction (Dumais
Figure 5. Conceptual example of clickstream analysis.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
9
2004, Li et al. 2014). The assumption is that LSA produces a set of concepts using the query–data matrix by projecting both the query and data vectors onto a new concept space whose dimensions are independent. One important product of LSA is that the resulting query–concept matrix has fewer dimensions than the original query–data matrix. (3) Calculate the cosine similarity (Tous and Delgado 2006) between query pairs. Given query t and s, the corresponding vectors from the query–concept matrix, the result of LSA, are used to calculate the cosine similarity as follows: *
*
Downloaded by [George Mason University] at 07:28 01 August 2017
ts simðt; sÞ ¼ * * t s
(5)
Unlike Euclidean distance, cosine similarity measures the angle between two vectors. The resulting values range from −1 (i.e. opposite) to 1 (i.e. identical), where 0 = no correlation. In the conceptual example (Figure 5), the first row (i.e. 2, 5, 3) represents the vector ‘ocean temperature’, and it is more similar to the second row (i.e. 2, 5, 4), which represents ‘sea surface temperature’, than to the third row (i.e. 5, 0, 0). This result means that ‘ocean temperature’ is again more similar to ‘sea surface temperature’ than is ‘ocean wind’. According to Equation (5), the similarity of ‘ocean temperature’ to ‘sea surface temperature’ and ‘ocean wind’ is 0.99 and 0.32, respectively.
4.3. Extracting knowledge from an existing ontology Because the goal of this research is to improve data discovery through query modification, two types of relations are addressed: ‘SubClassOf’ (Hyponymy) and ‘equivalentClass’ (Synonym) are represented as edges in Figure 6, while the nodes represent concepts. To measure the similarity between concepts, the approach of Liu et al. (2014) is revised using the following equations: simðX ! Y Þ ¼ DistðX ! Y Þ ¼
e DistðX ! Y Þ þ e X
EdgeðTypei Þ;
(6) (7)
i
where e is a constant to adjust the final similarity, DistðX ! Y Þ is the distance from X to Y and EdgeðTypeÞ is a function that returns 1 when the predicate type is ‘SubClassOf’ and returns 0 when the predicate type is ‘equivalentClass’. The resulting similarity values range from 0 (i.e. no relation) to 1 (i.e. identical). There are several advantages to this approach. First, the constant e allows us to accommodate ontologies with varying resolutions. For example, in a sparse ontology, the middle level (concepts B and C in Figure 6) might be missing; the constant provides the flexibility to adjust the similarity result depending on the application. Second, the combination of Equations (6) and (7) allows us to calculate the similarity in cases where the path involves multiple types of predicates.
Downloaded by [George Mason University] at 07:28 01 August 2017
10
Y. JIANG ET AL.
Figure 6. Ontology structure example.
4.4. Metadata analysis The LSA is also applied to the metadata to discover latent semantic relationships among the metadata terms. The metadata analysis uses the bag-of-words model to simplify the representation of the metadata file where the words are not ordered. The processing steps are similar to those used for the clickstream analysis. The main difference is the initial input. A term–data matrix is constructed in which each row represents a term and each column represents metadata. The value of each cell equals the number of occurrences of the corresponding term in the metadata document. Thereafter, the cosine similarity between any term pair is calculated using the term–concept matrix generated by LSA as in clickstream analysis. For example, given the three distinct metadata files (columns) and three terms (rows) shown in Figure 7, according to Equation (5), the similarity of ‘ocean temperature’ (i.e. 2, 5, 3) to ‘sea surface temperature’ (i.e. 2, 5, 4) and ‘sst’ (i.e. 1, 5, 4) is 0.99 and 0.94, respectively.
4.5. Integration To leverage the knowledge of data consumers, producers and domain experts to discover and determine the semantic relationships among domain-specific vocabularies, we propose an integration method consisting of the following three steps. (1) Specify a weight for each of the four methods (e.g. 2, 2, 1 and 2), because the importance of each individual method varies with different systems. For example, some systems have better documented metadata or more active user communities, while others have more comprehensive ontologies. (2) For a given term (e.g. ocean temperature), select the top-N related terms from each method’s result.
Downloaded by [George Mason University] at 07:28 01 August 2017
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
11
Figure 7. Conceptual example of metadata analysis.
(3) For a given relationship from Step 2, different methods produce different similarity values, and some methods may not produce the top-N results. Therefore, the majority voting rule is adopted to assign the similarity to this term pair using the following equation: P simðX; Y Þ ¼ maxðsim1 ; . . . ; simi Þ þ
i wi
θ β θ
(8)
where wi is the weight of method i, simi is the similarity of ðX; Y Þ in method i, θ is a threshold representing the minimum sum of the method weights that make the relationship a majority and β is a constant that represents the majority-rule change rate. Note that while wi and θ are not required to have scales of 0 to 1, they do need to be comparable. The first part of this equation determines the maximum similarity of all the methods in which the relationship exists. The reason for choosing the maximum is that a large similarity appears to be more reliable when different similarity values exist. The second part is an adjustment increment: when the target relationship exists in more than θ methods as the top-N results, it is positive and strengthens the relationship by making it larger. On the other hand, the increment can degrade the relationship when the relationship exists in fewer than θ methods. Note that the result will be set to 1 when it becomes larger than 1 and to 0 when it becomes negative. Given this methodology, the best result from the different methods is combined such that when more of these four approaches reveal a strong relationship, it becomes stronger in the final list. Conversely, the results of different methods can validate each other, because relationships that are strong in only a few methods have weaker relationships in the final list. According to this integration approach, the similarities of ‘ocean temperature’ with the other terms are calculated (Figures 4–7). Assuming (1) that the weights of search history, clickstream, metadata and ontology are set to 2, 2, 1 and 2, respectively, in series, (2) that ‘ocean temperature’ is not found in the ontology and (3) that θ and β are set to 2 and 0.05, respectively, the similarity of ‘ocean temperature’ with ‘sea surface temperature’, ‘sst’, and ‘ocean wind’ is 1.0, 0.89 and 0.63, respectively, in series (Table 1).
12
Y. JIANG ET AL.
Table 1. Conceptual example of the integration strategy. Query Ocean temperature
Search history Sea surface temperature (0.99), ocean wind (0.58)
Clickstream Sea surface temperature (0.99), ocean wind (0.32)
Metadata Sea surface temperature (0.99), sst (0.94)
Ontology None
Integrated list Sea surface temperature (1.0) sst (0.89), ocean wind (0.63)
5. System design and experiment
Downloaded by [George Mason University] at 07:28 01 August 2017
5.1. System design The system starts by preprocessing the raw web logs, metadata and ontology (Jiang et al. 2016a). Thereafter, the search history and clickstream data are extracted from the raw logs, selected properties are extracted from the metadata and ocean-related triples are extracted from the SWEET ontology. These four types of processed data are assigned to their corresponding processer (discussed in the previous section). After the processors have completed, the results of the different methods are integrated to produce a final list of most-related terms (Figure 8). Both the final and intermediate results are stored in Elasticsearch (https://github.com/ elastic/elasticsearch), an open source database solution. This is a relatively new database technology that provides a distributed, scalable database and a search engine over the web supporting schema-free Javascript Object Notation documents. In addition, MLlib (http://spark.apache.org/mllib/), Spark’s machine learning library, is adopted to conduct LSA.
Figure 8. System workflow and architecture.
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
13
Downloaded by [George Mason University] at 07:28 01 August 2017
5.2. Experimental design This study used 1 year of web log data. To filter rarely searched queries, the minimum number of distinct users was set to 10, meaning that at least 10 people had searched for each of the queries in the final list. The α in Equation (4) was set to 3, indicating that downloading behavior is 3 times more important than viewing behavior. The e in Equation (6) was set to 3 to adjust for ontology similarity. In the integration stage, the weights of search history, clickstream, metadata and ontology were set to 2, 2, 1 and 2, in series (Equation 8), based on several features. First, the weights of user-behavior generated data (i.e. clickstream and search history) were set to 2 because vocabulary relationships have a relatively high reliability that continues to increase as the data expand in size. Second, the metadata weight was set to 1 because only 563 pieces of metadata were used in the LSA. LSA usually requires a large number of documents; therefore, such a small set of data might lead to relatively low accuracy. Third, the ontology weight was set to 2, equivalent to that of user behavior data, because SWEET is a well-recognized information source in geographical science. Finally, θ and β were set to 2 and 0.05, respectively, to adjust the final integrated similarity based on the majority voting rule.
6. Results and evaluation After the filtering stages, 546 unique terms were added to the final results, including 389 unique user queries, 427 unique terms from the metadata and 47 unique concepts from the SWEET ontology. The total is smaller than the sum of all the sources because there are overlaps among the sources. We derived the accuracy evaluation from the following characteristics: (1) a qualitative analysis of the results of the selected user queries, (2) a calculation of the overall accuracy of the different sample groups and (3) an examination of the precision curve of the same sample groups from the second characteristic listed above.
6.1. Results of the selected user queries The top related terms of each individual method of the six selected user queries and the integrated list are reported in Table 2. Five are topical and are the most popular among all user queries. One is locational, selected to demonstrate the capability to handle locational concepts. The first column lists the six user queries, including ‘ocean temperature’, ‘ocean wind’, ‘sea surface topography’, ‘quikscat’, ‘grace’ and ‘pacific’. The last column shows the top five related terms in the integrated list. The final relationship value for each term is calculated from the top 20 related terms of the different methods, but the remaining 15 terms are omitted. Qualitatively, the final integrated list is better than any of the first four methods based on the number of terms and the similarity accuracy (Table 2). For example, in the first row, the relationship between ‘ocean temperature’ and ‘sst’ is 0.94 in the clickstream and 0.96 in the metadata methods. In lieu of the maximum similarity (0.96), the final relationship is 1.0 because it is recognized as a strong relationship: the sum of the method weights exceeds 2 (θ ¼ 2; β ¼ 0:05). Similarly, the relationship between ‘grace’ and ‘ocean mass’ is 1.0 in the
14
Y. JIANG ET AL.
Table 2. Top related terms of the four methods and an integrated list of the top five queries.
Downloaded by [George Mason University] at 07:28 01 August 2017
Queries
Search history
Clickstream
Metadata
SWEET
Ocean temperature
Sea surface Sea surface sst (0.96), ghrsst None temperature temperature (0.77), sea (0.66), sea surface (0.94), sst (0.94), surface topography group hightemperature (0.56), ocean wind resolution sea (0.72), surface (0.56), aqua surface temperature (0.49), ocean temperature (0.63), reynolds circulation (0.49) dataset (0.89), (0.58) ghrsst (0.87), caspian sea (0.74)
Ocean wind
Quikscat (0.65), Surface wind (0.94), ocean wind (0.93), wind temperature speed (0.89), (0.56), surface quikscat (0.78), wind (0.54), sea climatology surface monthly mean topography (0.5), wind (0.76) ocean circulation (0.48) Ocean temperature Sea surface height (0.56), ocean wind (0.92), lra (0.81), (0.5), ocean geogrian bay circulation (0.49), altimeter data quikscat (0.46), (0.8), jason 1 sea surface geodetic (0.8), temperature jason 1 sgdr (0.45) netcdf (0.79)
Sea surface topography
Wind speed (0.75), ocean wind vector (0.56), vector (0.56), wind data (0.56)
Ocean wave None (0.74), sea level anomaly (0.27), sea surface height anomaly (0.27), ssha (0.27), swh (0.26)
quikscat
Ocean wind (0.65), sea surface topography (0.46), ocean temperature (0.46), ascat (0.44), surface wind (0.41)
Seawind (0.9), wind Seawind (0.84), mozambique scatterometer channel (0.89), (0.54), wind global wind (0.52), ipcc (0.89), quikscat 2b (0.51), cmip (0.89), quikscat (0.51) reprocess (0.89)
Grace
Geodetic gravity (0.55), grace acc (0.36), gravity (0.3), ocean pressure (0.3), grace kbr (0.3)
Gravity (0.99), grace Ocean mass (1.0), None jpl (0.99), grace tellus (0.94), kbr (0.95), grace eof (0.7), gif (0.7), ocean l2 (0.95), geodetic bottom gravity (0.95) pressure (0.69)
Pacific
Australia (0.53), Goes 15 image northern atlantic (0.97), goes 15 (0.58), seawif (0.97), arctic sst (0.58), western (0.90), avhrr sst pacific ocean metop nar osisaf (0.41), envisat ra 2 l3c v1.0 (0.18), (0.41) avhrr sst metop osisaf l2p v1.0 (0.18)
Eastern atlantic (0.95), western pacific ocean (0.91), mtsat (0.92), mtsat2 (0.92), msg (0.91)
Integrated list
sst (1.0), sea surface temperature (1.0), ghrsst (1.0), group high-resolution sea surface temperature dataset (0.99), reynolds sea surface temperature (0.74) Mackerel Surface wind (1.0), breeze wind (0.98), (0.75), wind speed greco (0.94), quikscat (0.75), (0.88), gharbi climatology (0.75) monthly mean wind (0.76)
None
None
Sea surface height (1.0), lra (0.81), jason 1 geodetic (0.8), geogrian bay altimeter data (0.8), wind speed 80 m height brazil (0.79) Seawind (0.95), wind mozambique channel (0.89), quikscat reprocess (0.89), quikscat 2b (0.89), global wind (0.89) Gravity (1.0), grace kbr (1.0), grace gravity (1.0), grace acc (1.0), grace jpl (0.99), ocean mass (0.95) Goes 15 image (0.97), goes 15 (0.97), western pacific ocean (0.94), eastern atlantic (0.90), arctic sst (0.88)
metadata method but is only 0.95 in the integrated list because it occurs in only one method (the second-to-last row of Table 2). Please note that the reason the similarity between ‘ocean temperature’ and ‘sea surface temperature’ is close to 1.0 is that there are
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
15
Table 3. Overall accuracy of sample groups. Sample group
Overall accuracy (%)
Downloaded by [George Mason University] at 07:28 01 August 2017
Group 1 (the most popular 10 queries) Group 2 (the least popular 10 queries) Group 3 (randomly selected 10 queries)
88 61 83
few deep ocean temperature datasets at PO.DAAC. The proposed method would adapt to the changes if other types of ocean temperature data were added in the future. As for the locational concept ‘pacific’, the first two concepts are ‘goes 15’ which is related to the satellite that provides the most popular Pacific-region datasets (e.g. ‘GOES15-OSPO-L2P-v1.0’) according to the web logs. This knowledge is difficult to obtain without user behavior analysis, and this satellite is a concept highly related to users in the search results page; it allows users to find a desired dataset more efficiently. The remaining three concepts are ‘western pacific ocean’, ‘eastern atlantic’ and ‘arctic sst’, which are products from locations spatially close to ‘pacific’.
6.2. Overall accuracy of the sample groups To evaluate the final results, we selected three groups as follows: ● The 10 most popular user queries (Group 1): ocean temperature, ocean wind, sea
surface topography, quikscat, cross calibrate multi-platform ocean surface wind vector analysis field, sea surface temperature, grace, aquarius project, ocean circulation and saline density. ● The 18 least popular queries (Group 2): amsr, oscar, suomi npp, altika, dmsp f17, gfo, poseidon 2, gcom w1, msg and noaa 11. ● Ten randomly selected queries (Group 3): sea surface height, noaa 18, grace, seasat, pathfinder, ocean wave, ghrsst, sea ice, avhrr 2 and jason 1 geodetic. Ocean scientists from NASA’s Jet Propulsion Laboratory evaluated whether the similarity between each term pair was reasonable using a five-point scale: ‘Excellent’, ‘Good’, ‘OK’, ‘Bad’ and ‘Terrible’. To compute the metric, the relationships labeled ‘Good’ or better were considered ‘reasonable’, and labels below that were deemed ‘unreasonable’. The overall accuracy for each group was calculated as the ratio of the number of the reasonable term pairs to the total number of term pairs in that group. The overall accuracy of Group 1 approached 90%, which is satisfactory (Table 3). The accuracy of Groups 2 and 3 was 61% and 83%, respectively, whereas the accuracy of Group 3 lay between Groups 1 and 2. We propose that user behaviors are insufficient to describe and represent these least popular queries, meaning that the data vector is not sufficiently accurate to capture the similarity.
6.3. Precision curves of sample groups ‘Precision at K’ is a common metric for evaluating ranked results because it offers insight into the overall accuracy. This measure is calculated as the ratio of the number of reasonable results divided by the position of interest (K). For example, precision at 3 is
Downloaded by [George Mason University] at 07:28 01 August 2017
16
Y. JIANG ET AL.
Figure 9. Precision curves of three sample groups.
the number of results considered to be reasonable in the first three search results (no more than 3) divided by 3. By taking various values of K, a precision curve is produced by plotting the precision at K of each sample group to assess patterns in precision changes (Figure 9). The precisions from position 1 to 10 show that for Groups 1 and 3, the precisions at different positions do not change significantly: all fall between 0.8 and 1, indicating that the algorithm performs well. The precisions of Group 2 decrease from the beginning, but the first three positions suggest good results. Group 2 is consistently lower than the other two groups because there is less user behavior data to capture the underlying relationships in the least popular queries. In summary, high similarity values are more accurate than low similarity values, indicating that the proposed methodology finds highly related terms for a given query.
7. Discussion and conclusions Semantic relationships between domain-specific vocabularies are significant for moving beyond keyword-based searches to semantic-based searches. Using oceanographic data discovery as an example, this study offers a comprehensive approach to discovering semantic relationships among geospatial vocabulary from four different sources. The user search history analysis assumes that the more frequently any two queries co-occur among distinct users’ search histories, the more similar they are. Clickstream analysis assumes that if two queries are similar, the viewed and downloaded data will also be more similar. Metadata analysis assumes that semantically related terms appear in the same document more frequently. The ontology concept similarity calculation method addresses the limitation of traditional approaches by accounting for differences in relation types and ontology density. The integration method utilizes the results from the different sources to validate and supplement each other and was shown to be successful by achieving an overall accuracy above 83% over randomly selected samples.
Downloaded by [George Mason University] at 07:28 01 August 2017
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
17
Many geospatial applications would benefit from integrating similarity-based information retrieval techniques (Janowicz et al. 2008). This study has established the groundwork for further improving data discovery for several reasons. First, the most related terms (i.e. the top related terms in Table 2) when a large similarity threshold is selected (e.g. 0.9) can augment a regular user query. For instance, when a user searches for ‘ocean temperature’, the user query is expanded to ‘ocean temperature’ (1.0) OR ‘sea surface temperature’ (1.0) OR ‘sst’ (1.0) OR ‘ghrsst’ (1.0) OR ‘group high resolution sea surface temperature dataset’ (0.99). This query expansion approach improves both the search precision (the fraction of retrieved data that are relevant) and the recall (the fraction of relevant data that are retrieved) as determined in many previous studies (AlJadda et al. 2014, Li et al. 2014, Liu et al. 2014). Second, query expansion can also improve the search ranking and data recommendations. In the ‘ocean temperature’ search example, ‘sst’ has the same importance value as the original query, which boosts the rank of desired data in the ranking and recommendation relevance score calculations. Third, the terms with similarity within a defined range (e.g. from 0.7 to 0.9) can be presented as query suggestions in the search results page to help the user find data that are more desirable. There are several directions for future research. First, one limitation of the proposed methodology is that each term is regarded as an individual concept and is not able to resolve the word sense ambiguities when a term has multiple meanings. One resolution would be to build semi-supervised classifiers and collaborate with domain experts to create training data (Korayem et al. 2015). A second limitation is nested within the clickstream data, where the relative position of returned data is rated for clickstream events equally. As in previous studies, the result could be more relevant to the query when low ranking results are viewed or downloaded. Focusing on metadata analysis, one research avenue is to extract and incorporate concepts from the summary into the mining process. An example is the use of name–entity recognition techniques in natural language processing (Nothman et al. 2013). Finally, a better geospatial data discovery system is envisioned that combines vocabulary similarity, search ranking and data recommendation supported by user behavior.
Acknowledgments The research was partially carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. An earlier version of this paper is proofed by Prof. George Taylor.
Disclosure statement No potential conflict of interest was reported by the authors.
Funding This work was supported by the NASA AIST: [NNX15AM85G] and NSF: [ICER-1540998 and IIP1338925].
18
Y. JIANG ET AL.
ORCID Chaowei Yang
http://orcid.org/0000-0001-7768-4066
Downloaded by [George Mason University] at 07:28 01 August 2017
References AlJadda, K., et al., 2014. Crowdsourced query augmentation through semantic discovery of domain-specific jargon. In: Big Data (Big Data), 2014 IEEE International Conference on, 2014, Washington DC. New York City, NY: IEEE, 808–815. Andrea Rodriguez, M. and Egenhofer, M.J., 2004. Comparing geospatial entity classes: an asymmetric and context-dependent similarity measure. International Journal of Geographical Information Science, 18 (3), 229–256. doi:10.1080/13658810310001629592 Battle, R. and Kolas, D., 2012. Enabling the geospatial semantic web with parliament and geosparql. Semantic Web, 3 (4), 355–370. Bermudez, L., Graybeal, J., and Arko, R., 2006. A marine platforms ontology: experiences and lessons. In: Proceedings of the ISWC 2006 Workshop on Semantic Sensor Networks, 2006 Athens, GA, USA. Aachen, Germany: Sun SITE Central Europe (CEUR). Blei, D.M., Ng, A.Y., and Jordan, M.I., 2002. Latent dirichlet allocation. Advances in Neural Information Processing Systems, 1, 601–608. Casey, K., 2016. NOAA onestop data discovery and access framework project: progress, feedback, and alignment with the USGEO common framework on earth observation data [online]. ESIP Summer Meeting 2016: ESIP. Available from: https://2016esipsummermeeting.sched.com/event/6uHL/ noaa-onestop-data-discovery-and-access-framework-project-progress-feedback-and-alignmentwith-the-usgeo-common-framework-on-earth-observation-data [Accessed 14 May 2016]. Devarakonda, R., et al., 2011. Data sharing and retrieval using OAI-PMH. Earth Science Informatics, 4 (1), 1–5. doi:10.1007/s12145-010-0073-0 Dumais, S.T., 2004. Latent semantic analysis. Annual Review of Information Science and Technology, 38 (1), 188–230. doi:10.1002/aris.1440380105 Egenhofer, M.J., Toward the semantic geospatial web. In: Proceedings of the 10th ACM International Symposium on Advances in Geographic Information Systems, 2002, McLean, VA. New York City, NY: ACM, 1–4. Hu, Y., et al., 2015. Metadata topic harmonization and semantic search for linked-data driven geoportals: a case study using ArcGIS Online. Transactions in GIS, 19 (3), 398–416. doi:10.1111/ tgis.12151 Hua, X.-S., et al., 2013. Clickage: towards bridging semantic and intent gaps via mining click logs of search engines. In: Proceedings of the 21st ACM International Conference on Multimedia, 2013, Barcelona, Spain. New York City, NY: ACM, 243–252. Janowicz, K., et al., 2008. Semantic similarity measurement and geospatial applications. Transactions in GIS, 12 (6), 651–659. doi:10.1111/j.1467-9671.2008.01129.x Janowicz, K., Raubal, M., and Kuhn, W., 2011. The semantics of similarity in geographic information retrieval. Journal of Spatial Information Science, 2011 (2), 29–57. Jiang, Y., et al., 2016a. Reconstructing sessions from data discovery and access logs to build a semantic knowledge base for improving data discovery. ISPRS International Journal of GeoInformation, 5 (5), 54. doi:10.3390/ijgi5050054 Jiang, Y., Sun, M., and Yang, C., 2016b. A generic framework for using multi-dimensional earth observation data in GIS. Remote Sensing, 8 (5), 382. doi:10.3390/rs8050382 Jiang, Y., Xia, J., and Liu, K., 2016c. Polar CI portal: a cloud based polar resource discovery engine. In: T.C. Vance, et al., eds. Cloud computing in ocean and atmospheric sciences. Amsterdam: Academic Press, 163–185. Konstan, J.A., et al., 1997. GroupLens: applying collaborative filtering to Usenet news. Communications of the ACM, 40 (3), 77–87. doi:10.1145/245108.245126
Downloaded by [George Mason University] at 07:28 01 August 2017
INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE
19
Korayem, M., et al., 2015. Query sense disambiguation leveraging large scale user behavioral data. In: Big Data (Big Data), 2015 IEEE International Conference on, 2015, Santa Clara, CA. New York City, NY: IEEE, 1230–1237. Korenius, T., et al., 2004. Stemming and lemmatization in the clustering of Finnish text documents. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, 2004, Washington DC. New York City, NY: ACM, 625–633. Krisnadhi, A., et al., 2015. The GeoLink modular oceanography ontology. In: International Semantic Web Conference, 2015, Bethlehem, PA. Amsterdam, Netherlands: Elsevier, 301–309. Lee, J.-G. and Kang, M., 2015. Geospatial big data: challenges and opportunities. Big Data Research, 2 (2), 74–81. doi:10.1016/j.bdr.2015.01.003 Li, W., Goodchild, M.F., and Raskin, R., 2014. Towards geospatial semantic search: exploiting latent semantic relations in geospatial data. International Journal of Digital Earth, 7 (1), 17–37. doi:10.1080/17538947.2012.674561 Liu, K., et al., 2014. Using semantic search and knowledge reasoning to improve the discovery of Earth science records: an example with the ESIP semantic testbed. International Journal of Applied Geospatial Research (IJAGR), 5 (2), 44–58. doi:10.4018/ijagr.2014040104 Mangold, C., 2007. A survey and classification of semantic search approaches. International Journal of Metadata, Semantics and Ontologies, 2 (1), 23–34. doi:10.1504/IJMSO.2007.015073 McCandless, M., Hatcher, E., and Gospodnetic, O., 2010. Lucene in action: covers Apache Lucene 3.0. Greenwich, CT: Manning Publications. Miller, G.A., 1995. WordNet: a lexical database for English. Communications of the ACM, 38 (11), 39– 41. doi:10.1145/219717.219748 Newman, M.E., 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46 (5), 323–351. doi:10.1080/00107510500052444 Nothman, J., et al., 2013. Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence, 194, 151–175. doi:10.1016/j.artint.2012.03.006 Raskin, R. and Pan, M., 2003. Semantic web for earth and environmental terminology (SWEET). In: Proceedings of the Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data, 2003, Sanibel Island, Florida. Aachen, Germany: Sun SITE Central Europe (CEUR). Salton, G., Wong, A., and Yang, C.S., 1975. A vector space model for automatic indexing. Communications of the ACM, 18 (11), 613–620. doi:10.1145/361219.361220 Srivastava, J., et al., 2000. Web usage mining: discovery and applications of usage patterns from web data. ACM SIGKDD Explorations Newsletter, 1 (2), 12–23. doi:10.1145/846183.846188 Tous, R. and Delgado, J., 2006. A vector space model for semantic similarity calculation and OWL ontology alignment. In: International Conference on Database and Expert Systems Applications, 2006, Kraków, Poland. Berlin, Germany: Springer, 307–316. Vatsavai, R.R., et al., 2012. Spatiotemporal data mining in the era of big spatial data: algorithms and applications. In: Proceedings of the 1st ACM SIGSPATIAL international workshop on analytics for big geospatial data, 2012, Redondo Beach, CA. New York City, NY: ACM, 1–10. Yang, C., et al., 2017a. Big data and cloud computing: innovation opportunities and challenges. International Journal of Digital Earth, 10 (1), 13–53. doi:10.1080/17538947.2016.1239771 Yang, C., et al., 2017b. Utilizing cloud computing to address big geospatial data challenges. Computers, Environment and Urban Systems, 61, 120–128. doi:10.1016/j. compenvurbsys.2016.10.010