Ontology-driven data integration and visualization for ...

9 downloads 0 Views 2MB Size Report
modeling and visualization of a geologic time scale ontology of North America,. 16 interactive retrieval and display of fossil information, geologic map information ...
Accepted Manuscript Ontology-driven data integration and visualization for exploring regional geologic time and paleontological information Chengbin Wang, Xiaogang Ma, Jianguo Chen PII:

S0098-3004(17)30518-6

DOI:

10.1016/j.cageo.2018.03.004

Reference:

CAGEO 4103

To appear in:

Computers and Geosciences

Received Date: 6 May 2017 Revised Date:

27 February 2018

Accepted Date: 5 March 2018

Please cite this article as: Wang, C., Ma, X., Chen, J., Ontology-driven data integration and visualization for exploring regional geologic time and paleontological information, Computers and Geosciences (2018), doi: 10.1016/j.cageo.2018.03.004. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT 1 2 3

Ontology-Driven Data Integration and Visualization for Exploring Regional Geologic Time and Paleontological Information

4

Chengbin Wang1, 2, Xiaogang Ma2*, Jianguo Chen1

5

1

6

Earth Resources, China University of Geosciences, Wuhan 430074, China

7

2

8

* Corresponding Author E-mail: [email protected]

RI PT

State Key Laboratory of Geological Processes and Mineral Resources & Faculty of

SC

Department of Computer Science, University of Idaho, Moscow ID 83844, USA

9

Abstract: Initiatives of open data promote the online publication and sharing of large

11

amounts of geologic data. How to retrieve information and discover knowledge from

12

the big data is an ongoing challenge. In this paper, we developed an ontology-driven

13

data integration and visualization pilot system for exploring information of regional

14

geologic time, paleontology, and fundamental geology. The pilot system

15

(http://www2.cs.uidaho.edu/~max/gts/) implemented the following functions:

16

modeling and visualization of a geologic time scale ontology of North America,

17

interactive retrieval and display of fossil information, geologic map information query

18

and comparison with fossil information. A few case studies were carried out in the

19

pilot system for querying fossil occurrence records from Plaeobiology Database and

20

comparing them with information from the USGS geologic map services. The results

21

show that, to improve the compatibility between local and global geologic standards,

22

bridge gaps between different data sources, and create smart geoscience data services,

23

it is necessary to further extend and improve the existing geoscience ontologies and

24

use them to support functions to explore the open data.

AC C

EP

TE D

M AN U

10

1

ACCEPTED MANUSCRIPT 25 26

Key Words: Ontology; Local Geologic Time Standard; Paleontology; Geologic Map;

27

Open Data

RI PT

28 1 Introduction

30

In the past decade, the approaches of open data and big data have attracted increasing

31

attention, and they are now widely used in different knowledge domains. Many

32

governmental agencies and scientific organizations have published data on the

33

Internet for others to reuse. In the domain of geosciences, data of various subjects (e.g.

34

geologic map, mineral deposit, fossil, geochemistry, minerology and petrology, etc.)

35

are already made open and accessible online. Due to their disciplinary background,

36

those open geoscience data are usually stored in repositories each with its focused

37

subject and are short of inter-connections. There exist both challenges and

38

opportunities to make connections among those ‘silo’ data sources, develop

39

interactive data services, detect patterns in assembled datasets, and propose new

40

topics for knowledge discovery.

M AN U

TE D

EP

AC C

41

SC

29

42

To address challenges caused by the high volume, variety and velocity of big data,

43

Sheth (2014a, 2014b) utilized the semantic perception, agreement and continuous

44

semantics methods to transform big data into structured smart data, that is, data of

45

smaller volume and actionable information. He presented successful case studies in

46

personalized and actionable health information. Similar research topics of building

2

ACCEPTED MANUSCRIPT smart data and making knowledge discovery also exist in geosciences. The standards

48

developed by the World Wide Web Consortium (W3C) and the Open Geospatial

49

Consortium (OGC), such as XML (eXtensible Markup Language)1, RDF (Resource

50

Description Framework)2, RDFS (RDF Schema)3, SKOS (Simple Knowledge

51

Organization System)4, OWL (Web Ontology Language)5, WFS (Web Feature

52

Service)6, WMS (Web Map Service)7 and WCS (Web Coverage Service)8, provide

53

the fundamental building blocks for adding structures and meanings into online

54

datasets. In recent years, the geoscience community has made remarkable progress on

55

semantic models, ontologies and open data frameworks (Buccella et al., 2009; Ma and

56

Fox, 2013; Sen and Duffy, 2005; Zheng et al., 2015). Several geoscience data

57

frameworks, such as OneGeology9, USGIN (U.S. Geoscience Information Network)10,

58

AuScope11 and USGS (U.S. Geological Survey) Mineral Resources On-Line Spatial

59

Data12, have adopted semantic technologies or similar approaches to enrich the

60

discoverability, accessibility and interoperability of data services. Those efforts create

61

the space for exploring methods towards smart geoscience data and conducting

62

studies on information retrieval and knowledge discovery.

63

1

AC C

EP

TE D

M AN U

SC

RI PT

47

https://www.w3.org/XML/ https://www.w3.org/RDF/ 3 https://www.w3.org/TR/rdf-schema/ 4 https://www.w3.org/2004/02/skos/ 5 https://www.w3.org/OWL/ 6 http://www.opengeospatial.org/standards/wfs 7 http://www.opengeospatial.org/standards/wms 8 http://www.opengeospatial.org/standards/wcs 9 http://www.onegeology.org/ 10 http://usgin.org/ 11 http://www.auscope.org.au/ 12 https://mrdata.usgs.gov/ 2

3

ACCEPTED MANUSCRIPT Each ontology is the formal specification of the shared conceptualization of a domain

65

of study (Gruber, 1995). In the approach from big data to smart data, ontologies can

66

take an effective role to deal with the data heterogeneity through semantic enrichment

67

and concept mapping (Buccella et al., 2009; Duong et al., 2017; Sheth, 2014b;

68

Sotnykova et al., 2005; Su and Gulla, 2004). Mark et al. (2001) discussed that

69

ontologies can be categorized into geographic and conventional topics from the point

70

of view of geographers. In the domain of geosciences, we would say the focus of

71

geographic ontologies in Mark et al. (2001) is spatial information (cf. Buccella et al.,

72

2009; Buccella et al., 2011; Ma et al., 2011; Visser, 2005), and the conventional

73

ontologies are about domain-specific topics in geosciences, such as geologic time

74

scale (Cox, 2011; Cox and Richard, 2005; Ma et al., 2012; Ma and Fox, 2013),

75

geological modelling (Mastella et al.,2007; Perrin et al., 2005), geological structure

76

(Zhong et al., 2009), and rock deformation (Babaie and Davarpanah, 2018). Detailed

77

information about concepts and relationships within a focused domain can lead to

78

innovative functions in smart geoscience data, and will provide solid support to

79

geoscience researchers in data discovery and analysis.

SC

M AN U

TE D

EP

AC C

80

RI PT

64

81

The objective of this research is building an ontology for a local geologic time

82

standard and using it as a middle-ware to integrate and present information from

83

multiple sources that is hard to retrieve by using existing methods. In this work, we

84

first built an ontology for the local geologic time scale in North America. Second, we

85

developed an interactive visualization for the ontology. Third, we deployed the

4

ACCEPTED MANUSCRIPT ontology to integrate multi-subject information of fossils, geologic time and geologic

87

backgrounds, and we conducted case studies of focused topics. We developed

88

functions to implement the visualized ontology for interactive information query and

89

browsing on the user interface. To our knowledge, the work is the first example of

90

using a local geologic time scale ontology and data visualization to conduct efficient

91

and smart information retrieval. The presented research not only benefits the

92

integration and comparison of geologic time and fossil information, but also provides

93

practical experience on how to analyze and address the gaps between cross-domain

94

open data.

M AN U

SC

RI PT

86

95

The remainder of this paper is organized as follows: Section 2 describes the key

97

methods and technologies deployed in this research. Section 3 describes the

98

implementation of a pilot system and results. Section 4 analyzes the advantages and

99

limits of this study, and proposes a few topics for the future work. Finally, Section 5

EP

101

gives a brief conclusion.

AC C

100

TE D

96

102

2 Methods and Technologies

103

Geoscience is a discipline with heterogeneous terminologies being used in its various

104

sub-domains (Reitsma and Albrecht, 2005; Ma, 2015). This situation is also reflected

105

in the geoscience data. If a user is not familiar with the terminology used in a dataset,

106

it is hard for him to understand and use the data. There are already international

107

efforts on coordinating schemas, ontologies and vocabularies in geoscience, such as

5

ACCEPTED MANUSCRIPT the Commission for the Management and Application of Geoscience Information

109

within the International Union of Geological Sciences (CGI-IUGS)13. Nevertheless,

110

there is limited work on ontologies and vocabularies of local and regional standards.

111

On the other hand, legacy geoscience datasets are increasingly made accessible online

112

and many of them contain conceptual models or terminologies that are not part of

113

global standards. There are implicit connections between the local and global

114

standards, but in most cases we are short of a machine-readable model to represent

115

and describe such connections.

M AN U

SC

RI PT

108

116

Seeing both challenges and opportunities in the situation described above, we

118

designed and implemented a pilot study of ontology-driven data integration and

119

exploration focused on the local geologic time scale in North America. Technological

120

components included an ontology for the local geologic time scale, visualization of

121

the ontology, interfaces for accessing fossil occurrence records and geologic map

122

services, and interactive functions that use the ontology to support users to understand

123

and use datasets retrieved from multiple sources.

AC C

EP

TE D

117

124 125

2.1 An ontology for the local geologic time scale of North America

126

The geologic time scale, a contiguous framework of time intervals, is a system using

127

knowledges from stratigraphy, chronostratigraphy and paleontology to study the

128

Earth’s planet history (Gradstein et al., 2012). A global geologic time scale standard is

13

http://www.cgi-iugs.org 6

ACCEPTED MANUSCRIPT established and published by the International Commission on Stratigraphy (ICS)14. In

130

that standard, the lower boundaries of time intervals are in the process of being

131

defined by the Global Boundary Stratotype Section and Points (GSSP). In the global

132

geologic time scale, stage/age is the unit at the lowest level. The global stages are

133

defined by a continuous rock unit that contains biologic, geochemical, magnetic or

134

other methods for global correlation. Once a GSSP is defined, it will be marked as a

135

“Golden Spike”.

SC

RI PT

129

M AN U

136

Although the global standard has been accepted by geoscience researchers across the

138

world, national and regional geological surveys also refer to local geologic time

139

standards of tectono-stratigraphic divisions. For example, the local geologic time

140

scale in North America is established according to the specific strata sequences and

141

geologic evolution history of the region. There are three major differences between

142

them: (1) Different nomenclature and definition of intervals on the levels of Epoch

143

and Age. The geologic time scale in North America inherits the global standard on the

144

levels of Eon, Era, and Period, and establishes its unique standard on the levels of

145

Epoch and Age. Therefore, the divisions of Eon, Era and Period are identical in the

146

global and the North America standards. (2) Different start and end boundaries

147

between high- and low-level intervals in the local standard. In the global standard,

148

boundaries are coordinated, so high-level geologic time intervals usually share the

149

start and end boundaries with their low-level intervals (Fig. 1a). Some geologic time

AC C

EP

TE D

137

14

http://www.stratigraphy.org 7

ACCEPTED MANUSCRIPT intervals of Epoch and Age in the North America standard do not match exactly with

151

the Period intervals, which results in “cross-boundary” patterns in the local geologic

152

time scale (Fig. 1a, Fig. 4). (3) There are some unnamed intervals in the geologic time

153

scale of North America (Fig. 1b, Fig. 3). There could be many reasons for this

154

situation. For example, one reason could be the strata absence caused by sediment

155

hiatus in a period of geologic time, such as the absence of Triassic and Jurassic strata

156

(Brenner and Peterson, 1994). In another situation, although strata developed in the

157

geologic time, the limited work on nomenclature could also result in the unnamed

158

intervals. For example, the Devonian Period has named intervals on the Age level but

159

lacks intervals on the Epoch level. (Fig. 1b, Fig. 4).

M AN U

SC

RI PT

150

161

AC C

EP

TE D

160

162

Fig. 1 Comparison between global and North America geologic time standards. (a)

163

shows the “cross-boundary” pattern. In the global standard, intervals at the Period,

164

Epoch and Age levels share start and end boundaries in a coordinated framework. The

165

divided intervals of Epoch and Age in the local standard of the North America are

166

outside of such a framework and do not share boundaries with intervals in the global 8

ACCEPTED MANUSCRIPT 167

standard. (b) shows the interval without named intervals (missing data) on the Epoch

168

level in the local standard of the North America. The data used in this figure are from

169

Gradstein et al., (2012).

RI PT

170 The geologic time scale has both a hierarchal conceptual structure and an ordinal

172

temporal sequence (Cox and Richard, 2005; Michalak, 2005). Previous studies of

173

geologic time ontologies were mostly relevant to the semantic representation of the

174

global standard recommended by the ICS (Cox, 2011; Cox and Richard, 2015; Ma

175

and Fox, 2013). In those works, Semantic Web languages and schemas such as OWL,

176

RDF and SKOS were used to encode the hierarchal structure and ordinal sequence

177

(Cox and Richard, 2015; Ma et al., 2011; Ma and Fox, 2013). In this paper, we used

178

the JavaScript Object Notation for Linked Data (JSON-LD)15, a lightweight

179

data-interchange format based on the JavaScript Object Notation (JSON) to serialize

180

Linked Data and encode the hierarchal structure and temporal sequence of the local

181

geologic time scale of North America (Fig. 2). We referred to three major sources

182

(Haq, 2007; Rohde, 2005; TSCreator, 2017) for the list of local geologic time

183

intervals, their time boundaries, and the global-local geologic time interval mappings.

184

In particular, for Triassic and Jurassic there are no recorded intervals at Epoch and

185

Age levels in the North America standards. To avoid a big gap in the time scale, we

186

used those intervals of Triassic and Jurassic from the global standards. For the

AC C

EP

TE D

M AN U

SC

171

15

https://www.w3.org/TR/json-ld/ 9

ACCEPTED MANUSCRIPT 187

encoding part, we wrote the JSON-LD file of the ontology manually. The JSON-LD

188

file is accessible through the GitHub repository of this research16.

M AN U

SC

RI PT

189

190

Fig. 2 Part of JSON-LD code for representing hierarchal structure and temporal

192

sequence in the local geologic time scale of North America. (a) shows the code of a

193

“cross-boundary” pattern. The geologic time interval that crosses the boundary of two

194

parent nodes was divided into two sub-objects by the boundary. The two sub-objects

195

both inherit the same properties of the geologic time interval. (b) shows the code of an

196

unnamed geologic time interval. It was represented in a record without a node name.

197

Meaning of keywords: oid: object ID; name: node name of a geologic time interval;

198

rank: era rank; base: the start time boundary; top: the end time boundary; mid: the

199

middle of a geologic time interval; interval: time duration of geologic time interval. (c)

AC C

EP

TE D

191

16

https://github.com/xgmachina/geotimeNam/blob/master/Northamerica.json 10

ACCEPTED MANUSCRIPT 200

shows the definition of “context” in the JSON-LD encoding, which maps the

201

keywords to defined concepts in other existing ontologies and schemas.

202 As shown in Fig. 2, the braces in the JSON-LD file were used to tag the node records.

204

The properties were recorded in the braces and divided by commas. We employed the

205

bracket, colon, and default keyword “children” of JSON-LD to encode the hierarchal

206

structure of the geologic time scale. In the JSON-LD grammar, every node only has

207

one parallel parent node. In this research, a geologic time interval crossing the

208

boundary of two parent nodes was divided into two sub-objects by the boundary, and

209

the two sub-objects both inherit the same properties of that interval (Fig. 2a). To

210

create a complete framework for the ordinal-hierarchical structure, the unnamed

211

intervals were encoded by records with an empty node name (Fig. 2b), and they have

212

properties such as the start and end boundaries. To improve the interoperability of the

213

developed ontology, the “context” (Fig. 2c) maps the keywords to IRIs

214

(Internationalized Resource Indicator) of defined concepts in existing ontologies and

215

schemas.

AC C

216

EP

TE D

M AN U

SC

RI PT

203

217

2.2 Retrieving multi-source and multi-disciplinary geoscience data

218

The Paleobiology Database (PBDB)17 is an open database of paleontological data

219

(Peters and McClennen, 2015; Peters et al., 2014; Varela et al., 2015), which includes

220

184, 259 collections, 350, 487 taxa and 1,325,725 occurrences in early March, 2017.

221

It provides two types of data services: (1) explore the fossil information through a 17

https://paleobiodb.org/ 11

ACCEPTED MANUSCRIPT web browser or a mobile application and (2) retrieve fossil information through the

223

PBDB API (Application Program Interface). The former is for human users and the

224

latter is for machines. In this work, we developed functions to use information of the

225

named intervals in the local geologic time ontology of North America to retrieve

226

fossil information (e.g. fossil occurrence location , accepted name, taxonomy,

227

formation, reference, etc.) through the PBDB API. To enable the integration of more

228

information, the retrieved fossil information was displayed together with geologic and

229

geographic base maps. We developed functions to set up connections between the

230

ontology and those data sources so they can be queried and compared interactively.

M AN U

SC

RI PT

222

231

The Web Map Service18 (WMS) is a standard protocol released by the Open

233

Geospatial Consortium (OGC) for building geospatial data services on the Web. It is

234

widely used in the web GIS application development for setting up geospatial data

235

services, including geologic data. The Mineral Resources On-Line Spatial Data is an

236

open database developed by the USGS mineral program. It provides data services of

237

mineral resource, geology, geochemistry, geophysics, and more. WMS is one of the

238

many standards applied by that database. In this work, we utilized the WMS “GetMap”

239

and “GetFeatureInfo” functions to obtain the geologic map and feature properties

240

from the USGS database, respectively. The retrieved information was used as a

241

background map for the fossil information retrieved from PBDB. By integrating all

242

those datasets, the developed system enables interested researchers to explore further

AC C

EP

TE D

232

18

http://www.opengeospatial.org/standards/wms 12

ACCEPTED MANUSCRIPT 243

geologic information of a focused area.

244 2.3 Ontology as an interface between geologic time scale, paleontology and

246

fundamental geology

247

A specific function enabled by ontologies is the semantic inference which can reveal

248

new information and knowledge in a cross-domain context (Katifori et al., 2007). For

249

each specific science domain, there are characteristic entities, properties and

250

organizational structures that can be used in semantic reasoning and inference.

251

Paleontology covers topics of taxon, position, location, age, fossil strata, source

252

reference, and more. Fundamental geology, as revealed by a geologic map, usually

253

contains geologic units, boundary, age, lithology, strata, color, location and reference

254

information. Geologic time scale includes topics of age, early age, late age, duration

255

time, as well as the hierarchal conceptual structure and the temporal sequence

256

reflected in those concepts.

SC

M AN U

TE D

EP AC C

257

RI PT

245

258 13

ACCEPTED MANUSCRIPT Fig. 3 The relationship network of geologic time scale, paleontology and geologic

260

map service. Paleo: paleontology; GTS: geologic times-scale; GM: geologic map

261

service. The bold lines show the path used in this work to connect the Paleo, GM and

262

GTS together.

RI PT

259

263

With the help of the developed ontology, the common concepts among geologic time

265

scale, paleontology and geology were used to set up connections among the several

266

resources in this study (Fig. 3). Although there are several paths to link them together,

267

we selected the GTS-Age-Paleo-Location-GM route (bold solid line in Fig. 3) to build

268

the pilot system for information retrieval and knowledge discovery in this work. The

269

chosen route contains topics of time, location, fossil and fundamental geology, which

270

make it and is a good case study to make use of all the data resources and conduct

271

further exploration.

M AN U

TE D

272

SC

264

3 Implementation, Prototype System and Results

274

3.1 Interactive visualization for the local geologic time ontology of North America

275

To visualize a geologic time ontology, a good way is needed for representing both the

276

hierarchical conceptual structure and the ordinal time sequence (Cox, 2011; Cox and

277

Richard, 2005; Ma et al., 2012; Ma and Fox, 2013). In previous works, both

278

ActionScript and JavaScript languages have been used to visualize the global geologic

279

time scale ontology (Ma et al., 2012; Ma et al., 2016). To get a human-friendly

AC C

EP

273

14

ACCEPTED MANUSCRIPT interaction, we adapted the open code of John Czaplewski19 and visualized the

281

geologic time scale ontology of North America as an interactive partition displaying

282

the hierarchical structure, temporal sequence, and annotation of geologic time

283

intervals. D320 (Data-Driven Documents), an open-source JavaScript library to

284

produce dynamic, interactive data visualization, was used as the basic library in the

285

visualization. The resulting visualization was in JavaScript-driven SVG (Scalable

286

Vector Graphics) format, which provided a functional tool to deploy the JSON-LD

287

file of the ontology in an interactive web application (cf. Stefani et al., 2014).

M AN U

SC

RI PT

280

288

The resulting visualization shows the hierarchical and ordinal relationship of geologic

290

time intervals (Fig. 4). The nodes laying out from the left to right follow the Earth’s

291

history from the earlier to later. The width of every node represents the time duration

292

of the corresponding geologic time interval, which is read from the JSON-LD records.

293

From top to bottom, the hierarchical structure represents the node levels decrease

294

from the Eon in the first layer to Age in the fifth layer. The name of a node is

295

annotated on the node partition, which will be displayed when the node partition

296

zooms in and gets enough space for the text of the name, and will be hidden when the

297

node partition zooms out (Fig. 4).

AC C

EP

TE D

289

298

19 20

http://bl.ocks.org/jczaplew/7546689 https://d3js.org/ 15

RI PT

ACCEPTED MANUSCRIPT

299

Fig.4 Interactive visualization of the local geologic time scale ontology of North

301

America.

SC

300

M AN U

302 3.2 System interface, interaction and exploration

304

We developed a user-friendly pilot system21 that connects elements from geologic

305

time scale, paleontology, and WMS geologic map service together. The interface is

306

showed in Fig. 5. It includes the visualized geologic time scale ontology at the bottom,

307

a main window in the center displaying geologic and geographic base maps and fossil

308

locations, radio buttons at the top left corner for choosing layers for further

309

information query, a dropdown list to change the map window to different states in

310

U.S., and dynamic pop-up windows for displaying query results.

AC C

EP

TE D

303

311

21

http://www2.cs.uidaho.edu/~max/gts/ 16

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

312

Fig.5 User interface of developed pilot system. (a) shows the blank area in a geologic

314

map layer; (b) “undefined” formation indicates there is no information about the strata

315

that contains the fossil; (c) and (d) show the information of the same area from the

316

map layer and the fossil occurrence records, respectively. A topic of research interest

317

here is the differences between the information in (c) and (d), which may lead to new

318

studies about the area.

EP

AC C

319

TE D

313

320

The general workflow in the pilot system includes the following steps. First, a user

321

can navigate in the ontology visualization to find an interval of interest. Second, the

322

user can double click the node of the selected time interval, the system will retrieve

323

the base and top time boundaries of that interval and send them to PBDB for

17

ACCEPTED MANUSCRIPT retrieving relevant fossil occurrence records within that time coverage, and display the

325

records in the map window. There is also an information window on the top left

326

corner of the map window to show the selected interval and the base and top time

327

boundaries of it. Third, the user can use the radio buttons on the top left corner to

328

choose the object layer (USGS geologic map or Fossil) for querying attribute

329

information. For example, when the ‘USGS’ layer is selected, the user can retrieve the

330

geologic information of a place on the map by a mouse click. When the ‘Fossil’ layer

331

is selected, the user can click spots in the fossil occurrence layer to see attribute of

332

fossils. The retrieved information is displayed in a mini pop-up window at the mouse

333

click point. During the process, the user can also change the center of the map

334

window to different states in the U.S. by making selections in a dropdown list at the

335

top left.

SC

M AN U

TE D

336

RI PT

324

There are a few specific settings in the current pilot system. Through the PBDB API,

338

abundant attributes of fossil occurrence records could be retrieved, but in the current

339

system we only displayed a short list of information to compare with the geologic

340

information from the USGS open dataset. In the map window, the geologic maps of

341

different states will be automatically loaded when the zoom-in level of the window is

342

8 or higher. The user can also zoom out to see the nationwide distribution of fossil

343

occurrences in U.S., but the geologic map will disappear. To see the geologic map in

344

the background, the user need to zoom in to a focused area to ensure the zoom-in

345

level is higher than 8. To make the operation convenient, we created a dropdown list

AC C

EP

337

18

ACCEPTED MANUSCRIPT 346

of states in U.S to help users select areas of interest. The geologic map of the selected

347

state will be loaded automatically in the window.

348 Besides retrieving and integrating information from multiple sources, a more specific

350

goal in this study is using the ontology for semantic connection, reasoning and

351

knowledge discovery. In other words, we extracted the relevant datasets and then

352

structured and connected them as “smart data” rather than listing them as separate

353

items. Based on the retrieved information, we can use the semantic reasoning to

354

obtain further information of an entity and its related entities. For example, we

355

retrieved the fossil information within a period of time by using the start and end

356

boundaries of named intervals from the geologic time ontology, rather than just using

357

label match. Then, the fossil location information was used to retrieve the geologic

358

map information from the USGS Mineral Resources On-Line Spatial Data, such as

359

geologic unit, geologic age, petrology, and more. We were also able to compare the

360

information from the different sources to find the similarities and differences (Fig. 5c

361

and d).

SC

M AN U

TE D

EP

AC C

362

RI PT

349

363

In the geologic time scale ontology, each geologic time interval is an entity linked

364

with other entities. We can use the logical relationships in the ontology to find

365

relevant entities within the geologic time scale. For example, Permian is not only a

366

time interval from 298.9 Ma to 252.2 Ma (Gradstein et al., 2012), but is also a child

367

node of Paleozoic, and the parent node of Wolfcampian, Leonardian, Guadalupian

19

ACCEPTED MANUSCRIPT and Ochoan Epochs. Those relationships are useful in the development of functions

369

for information retrieval on the Web. For example, a user wants to find some typical

370

records whose occurrence time are within the interval of Permian. If the time

371

information is only recorded with the literal labels of geologic time intervals, then the

372

search cannot be only done with the label ‘Permian’ but also the labels of the child

373

intervals of Permian. The geologic time scale ontology can quickly provide all the

374

child intervals of Permian. Such functions were developed in one of our previous

375

studies (Ma et al., 2012).

M AN U

SC

RI PT

368

376

In this pilot system, the data source used for query is PBDB API, which provides

378

several channels for querying fossil occurrence records within a period of time22. The

379

first is through the input of the names of one or more geologic time intervals. The

380

PBDB system has a collection of geologic time terms and their corresponding start

381

and end time boundaries. If more than one label is input, then the time range used in

382

the query will be the contiguous period from the start of the earliest interval to the end

383

of the latest interval. The second query method in PBDB API is through the input of a

384

maximum age and/or a minimum age. A key part of our work is the local geologic

385

time scale of the North America. If we query with the names of time intervals in the

386

built ontology, for some intervals (e.g. Wolfcampian or Leonardian) there will be no

387

results. A possible reason is that those intervals are not included in the time term

388

collection of PBDB or alternative names (e.g. the label ‘Wolfcamp’ in Fig. 5d) are

AC C

EP

TE D

377

22

https://paleobiodb.org/data1.2/occs_doc.html 20

ACCEPTED MANUSCRIPT used. To address this issue, we used the second method enabled by PBDB to develop

390

the query function in the pilot system. When an interval is selected by the user (i.e.

391

double click in the ontology visualization), a function will obtain the base and top

392

boundaries of that interval in the ontology and then use them as input of maximum

393

and minimum ages to query fossil occurrence records. For example, we could retrieve

394

records from PBDB for Wolfcampian or Leonardian through this function.

RI PT

389

SC

395 4 Discussion

397

In this study, we used JSON-LD to encode the geologic time scale ontology of North

398

America, which proves the functionality of JSON-LD as a Human-Machine readable

399

and lightweight data-interchange format. We used the JSON-LD syntax, such as

400

braces, brackets and default keywords to encode the hierarchal structure and temporal

401

sequence of a geologic time ontology. More specifically, we used the “context” to

402

map the keywords in our ontology to IRIs of defined concepts in other existing

403

ontologies. In previous works, XML, OWL and SKOS have been used to encode the

404

geologic time scale ontologies (Raskin and Pan, 2005; Cox and Richard, 2005; Ma et

405

al. 2011; Ma et al., 2012; Ma and Fox, 2013). Compared with these works, the

406

JSON-LD code for geologic time scale ontology is a concise format for data

407

visualization, and can be used directly in the development of the user interface of

408

applications.

AC C

EP

TE D

M AN U

396

409

21

ACCEPTED MANUSCRIPT To address the needs of ontology visualization on the user interface, several specific

411

objects were added in the developed JSON-LD file. For example, as described in

412

section 2.1, two sub-objects of a geologic time interval were used to describe the

413

“cross-boundary” pattern and they both inherited the same properties of that interval.

414

While this helps layout the ordinal-hierarchical structure in the resulting visualization,

415

the two sub-object intervals are identical, which may lead to potential issues in logic

416

reasoning. For example, in this research the operations on the visualized ontology

417

were limited to single click for zooming into an interval and double click for querying

418

fossil occurrences within the time coverage of an interval. The data structure and the

419

content of the JSON-LD file met the needs of those operations. But for some other

420

operations, such as querying the number of intervals at the Epoch level, the

421

above-mentioned specific objects in the JSON-LD file will lead to incorrect results.

422

One approach for addressing this issue is to have separate ontologies to store precise

423

information, and then build the JSON-LD file for visualization as a compatible

424

adaptation from those ontologies (cf. Ma et al. 2016; Ma 2017).

SC

M AN U

TE D

EP

AC C

425

RI PT

410

426

As shown in the context of the JSON-LD file (Fig. 2c), this ontology was built on the

427

top of a few other existing ontologies developed by Cox and Richard (2015). The

428

focus of this research was the intervals in the local geologic time scale of the North

429

America and their mappings to the global geologic time scale. The resulting

430

JSON-LD file contained just the key information that addresses the needs of ontology

431

visualization and fossil information retrieval in the pilot system. In our previous

22

ACCEPTED MANUSCRIPT works, we had conducted logic reasoning through the ordinal-hierarchical structure of

433

the geologic time scale (Ma et al., 2012), and we had planned similar works in this

434

research. For example, in the developed pilot system if a user selects (i.e., double

435

click) ‘Permian’ to find fossil occurrence records in PBDB, there could be a function

436

to find all the child intervals of ‘Permian’ and use them together with ‘Permian’ as

437

input for query. However, in this research we found that some intervals in the local

438

geological time scale of the North America could not be recognized by PBDB.

439

Therefore, instead of using the logic relationships between the time intervals, we just

440

obtained the based and top time boundaries of a selected interval from the ontology

441

and then use them in the query sent to PBDB.

M AN U

SC

RI PT

432

442

A driving force of this study is the feedbacks (personal communication) from

444

geoscientists on our previous works (Ma et al., 2012; Ma et al., 2016). From

445

conversations with colleagues in the field of paleontology, geobiology and

446

stratigraphy, we found that Web applications with visualized geoscience knowledge

447

and map services are useful in their research and teaching work. A suggestion from

448

Prof. Miriam Katz on the visualization was that a horizontal or vertical layout is more

449

familiar to geoscientists than the sunburst layout in our previous work, and this pilot

450

system realized that layout. The system and a tutorial document were accessible

451

online (see footnote 21), and the source code of the system was also published23. We

AC C

EP

TE D

443

23

https://github.com/xgmachina/geotimeNam 23

ACCEPTED MANUSCRIPT 452

will continuously collect feedbacks and suggestions from colleagues, especially

453

geoscientists, and improve the functions of the system.

454 A well-organized ontology model is an efficient interface to link different research

456

domains and this poses a lot of potential for geoscience research. In this study, we

457

used the ontology model to link the structured data between geologic time scale,

458

paleontology and fundamental geology, and set up the functions of information

459

retrieval and inference. It was not our intention to redesign the wheel of the existing

460

functions in PBDB, but the collected intervals for the local geologic time scale of the

461

North America could complement the time term collection in PBDB. We also have a

462

plan to encode more local geologic time standards such as those listed in Haq (2007)

463

into formats for the Semantic Web and use them to build applications for querying

464

open data on the Web.

SC

M AN U

TE D

465

RI PT

455

Among the various open geoscience data on the Web, a topic of interest is to explore

467

the background of gaps among different data sources and study ways to bridge the

468

gaps in data integration. In this research, several gaps in the cross-domain open data

469

were found: (1) Data missing - there are empty properties for the fossil occurrences

470

and blank areas for the WMS geologic map (Fig. 5a and b). (2) Synonym expression -

471

there are significant differences in the Epoch and Age division and nomenclature of

472

geologic time scale between global and North America standards. (3) The

473

paleontology research and geologic mapping service refer to different geologic time

AC C

EP

466

24

ACCEPTED MANUSCRIPT scale standards. In this study, we found that there are different records describing the

475

same object (e.g. rock age, lithology) between PBDB and USGS databases (Fig. 5c

476

and d). There could be many reasons for such differences, and a further study of the

477

background information may lead to new topics for research. For example, in Fig. 5c

478

and d, the point on the geologic map is at the edge of eroding Permian sediments, so

479

the fossils in that alluvial can include those from Permian.

SC

480

RI PT

474

Some data gaps could be caused by the differences between local and global standards.

482

Geological investigation and interpretation is a scientific domain with subjectivity.

483

Geologic standards are designed to reduce the subjectivity and enhance the objectivity

484

at a national or local scale. Nevertheless, massive separately developed standards

485

raise challenges against data exchange at a global scale. To address this challenge, the

486

standards could be made open and published in machine-readable formats. We urge

487

more researches to encode local geologic data standards in Web-compatible formats

488

and publish them online for reuse. Once a big number of those local standards are

489

made openly accessible, calibrations and connections among those local standards as

490

well as between local and global standards can be made to promote the

491

interoperability of datasets.

TE D

EP

AC C

492

M AN U

481

493

There are several concepts can be used to explore the relationships between geologic

494

time scale, paleontology, and fundamental geology, as shown in Figure 3. They lay

495

out the space for extending and improving the developed system in the future. For

25

ACCEPTED MANUSCRIPT 496

example, in the developed pilot system, we used GTS-Age-Paleo-Location-GM route

497

to implement the functions of information retrieve and reasoning. We also can use

498

other routes to implement similar functions and conduct bidirectional query.

RI PT

499 The current system (http://www2.cs.uidaho.edu/~max/gts/) is in its prototype stage,

501

and several future works can be proposed. The first is to collect more local geologic

502

time standards, match them with the global geologic time standard, and use them to

503

enrich the built ontology. The pilot study in this paper is an example about the local

504

geologic time standard of North America. If the ontology is enriched with more local

505

geologic time standards, it will be more useful in data search and integration on the

506

Web. The second is to enrich the interactive functions on the user interface. This

507

include the ontology content, the visualization, the map window operations, and the

508

rendering and display of multisource information, such as geologic maps and fossil

509

occurrence records. We will invite colleagues in the geoscience community to use the

510

pilot system and provide feedbacks, then we will revise the system for its next version.

511

Beside user feedback, we also plan a few other updates. For example, in addition to

512

the visualized ontology, we can also build a free text and time boundary search, so a

513

user can input a time term or two numeric time boundaries to search fossil occurrence

514

records.

AC C

EP

TE D

M AN U

SC

500

515 516

5 Conclusion

26

ACCEPTED MANUSCRIPT Geologic data have been increasingly made open online, while methods and tools to

518

obtain knowledge from them are underdeveloped. Based on the open geologic data,

519

we employed an ontology model to design and implement a pilot system crossing the

520

domains of local geologic time scale of North America, paleontology, and

521

fundamental geology. The pilot system (http://www2.cs.uidaho.edu/~max/gts/)

522

realized the following functions: visualization of the geologic time ontology of North

523

America, interactive fossil information retrieving and displaying, and query and

524

comparison of fossil information and geologic map information. It is an interactive

525

and integrated system for fossil, geologic map service, and geologic time scale of

526

North America, and is proved useful in helping researchers explore information of

527

interest and propose new research topics.

M AN U

SC

RI PT

517

TE D

528

The compatibility between local and global geoscience standards and gaps between

530

different data sources will be a long-term challenge for the application of ontology

531

models in geoscience knowledge discovery. To address these challenges, it is

532

necessary to extend and improve the existing geologic ontology models to address

533

broad compatibility, and call for more ontologies of local data standards (e.g. those

534

listed in Haq, 2007 and Rohde, 2005) being developed and connected to improve the

535

interoperability of datasets from different sources.

AC C

EP

529

536 537

Acknowledgement

538 539

This work was partly supported by the National Science Foundation (NSF) through the NSF Idaho EPSCoR Program (award number IIA-1301792) and the W. M. Keck 27

ACCEPTED MANUSCRIPT 540 541 542 543 544 545

Foundation through the grant “The Co-Evolution of the Geo- and Biospheres: An Integrated Program for Data-Driven Abductive Discovery in Earth Sciences”. We thank USGS and Geophysical Laboratory at Carnegie Institution for Science for financial supports to our attendance at the 2017 USGS-DTDI workshop at Reston, VA. We also thank two anonymous reviewers and the editor Prof. Gregoire Mariethoz for their constructive comments on an earlier version of the manuscript.

RI PT

546 547

References

548

Babaie, H.A., Davarpanah, A., 2018. Semantic modeling of plastic deformation of

550

polycrystalline rock. Computs & Geosciences, 111, 213-222.

SC

549

Brenner, R. L., Peterson, J.A.,1994. Jurassic sedimentary history of the northern portion of the Western Interior Seaway, USA. In: Caputo, M.V., Peterson, J.A.,

552

Franczyk, K. J. (eds.) Mesozoic Systems of the Rocky Mountain Region: The

553

Rocky Mountain Section SEPM, Denver, pp. 217–232.

554

M AN U

551

Buccella, A., Cechich, A., Fillottrani, P., 2009. Ontology-driven geographic information integration: A survey of current approaches. Computs & Geosciences,

556

35, 710-723.

TE D

555

Buccella, A., Cechich, A., Gendarmi, D., Lanubile, F., Semeraro, G., Colagrossi, A.,

558

2011. Building a global normalized ontology for integrating geographic data

559

sources. Computs & Geosciences, 37, 893-916.

561 562

AC C

560

EP

557

Cox, S., 2011. OWL representation of the geologic timescale implementing stratigraphic best practice, Proceedings of AGU 2011 Fall Meeting, San Francisco,

abstract IN31B-1440.

563

Cox, S.J.D., Richard, S.M., 2005. A formal model for the geologic time scale and

564

global stratotype section and point, compatible with geospatial information

565

transfer standards. Geosphere 1(3), 119-137. 28

ACCEPTED MANUSCRIPT

567 568 569 570

Cox, S.J.D., Richard, S.M., 2015. A geologic timescale ontology and service. Earth Science Informatics, 8(1), 5-19. De Donatis, M., Bruciatelli, L., 2006. MAP IT: The GIS software for field mapping with tablet pc. Computs & Geosciences, 32, 673-680.

RI PT

566

Duong, T.H., Nguyen, H.Q., Jo, G.S., 2017. Smart Data: Where the Big Data Meets

the Semantics. Computational Intelligence and Neuroscience, 2017: 6925138. doi:

572

10.1155/2017/6925138

574 575 576

Gradstein, F.M., Ogg, J.G., Schmitz, M., Ogg, G., 2012. The Geologic Time Scale.

M AN U

573

SC

571

Elsevier, Kidlington, UK, 1176 pp.

Gruber, T.R., 1995. Toward principles for the design of ontologies used for knowledge sharing. International Journal Human–Computer Studies, 43 (5–6), 907–928. Haq, B.U., 2007. The Geological Time Table, 6th ed. Amsterdam: Elsevier. 1 p.

578

Ma, X., 2015. Geoinformatics in the Semantic Web. In: Schaeben, J., Delgado, R.T.,

TE D

577

van den Boogaart K.G., van den Boogaart, R. (eds.) Proceedings of the IAMG

580

2015 Annual Conference, Freiberg, Germany, 9 pp.

582 583 584

Ma, X., 2017. Linked Geoscience Data in practice: where W3C standards meet

AC C

581

EP

579

domain knowledge, data visualization and OGC standards. Earth Science

Informatics, 10(4), 429-441.

Ma, X., Carranza, E.J.M., Wu, C., van der Meer, F.D., 2012. Ontology-aided

585

annotation, visualization, and generalization of geological time-scale information

586

from online geological map services. Computs & Geosciences, 40, 107-119.

587

Ma, X., Carranza, E.J.M., Wu, C., van der Meer, F.D., Liu, G., 2011. A SKOS-based

29

ACCEPTED MANUSCRIPT 588

multilingual thesaurus of geological time scale for interoperability of online

589

geological maps. Computs & Geosciences, 37, 1602-1615.

591

Ma, X., Fox, P., 2013. Recent progress on geologic time ontologies and considerations for future works. Earth Science Informatics, 6, 31-46.

RI PT

590

Ma, X., Fu, L., Fox, P., Liu, G., 2016. An integrated golden spike information portal

593

enabled by data visualization and semantic web technologies. In: Raju, N.J. (ed.)

594

Geostatistical and Geospatial Approaches for the Characterization of Natural

595

Resources in the Environment. Springer, Cham, Switzerland, pp. 829-833.

596

Mastella, L.S., Abel, M., De Ros, L.F., Perrin, M. and Rainaud, J.F., 2007. Event

M AN U

SC

592

ordering reasoning ontology applied to petrology and geological modelling. In:

598

Castillo, O., Melin, P., Ross, O.M., Cruz, R.S., Pedrycz, W., Kacprzyk, J. (eds.)

599

Theoretical Advances and Applications of Fuzzy Logic and Soft Computing,

600

Springer, Berlin/Heidelberg, pp. 465-475.

601

TE D

597

Mark, D.M., Skupin, A., Smith, B., 2001. Features, objects, and other things: Ontological distinctions in the geographic domain. In: Montello, D.R. (ed.)

603

Proceedings of the International Conference on Spatial Information Theory, Morro

605 606 607

AC C

604

EP

602

Bay, CA, pp. 489-502.

Michalak, J., 2005. Topological conceptual model of geological relative time scale for geoinformation systems. Computs & Geosciences, 31, 865-876.

Perrin, M., Zhu, B., Rainaud, J.F. and Schneider, S., 2005. Knowledge-driven

608

applications for geological modeling. Journal Petroleum Science and Engineering

609

47(1), 89-104.

30

ACCEPTED MANUSCRIPT 610

Peters, S.E., Zhang, C., Livny, M., Re, C., 2014. A machine reading system for

611

assembling synthetic paleontological databases. PLoS One 9, e113523. doi:

612

10.1371/journal.pone.0113523 Raskin, R.G., Pan, M.J., 2005. Knowledge representation in the semantic web for

RI PT

613 614

Earth and environmental terminology (SWEET). Computs & Geosciences, 31,

615

1119-1125.

Reitsma, F. and Albrecht, J., 2005. Modeling with the Semantic Web in the

SC

616

Geosciences. IEEE Intelligent Systems, 20(2), pp.86-88.

618

Rohde, R.A., 2005, Introduction to the GeoWhen Database.

M AN U

617

619

http://www.stratigraphy.org/bak/geowhen/index.html (Accessed on December 15,

620

2017)

622 623

Sen M, Duffy T.,2005. GeoSciML: development of a generic geoscience markup

TE D

621

language. Computs & Geosciences, 31(9),1095-103. Sheth, A., 2014a. Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities. In: Proceedings of the 2014 IEEE

625

International Conference on Big Data, Washington, DC, pp. 2-3.

AC C

EP

624

626

Sheth, A., 2014b. Transforming big data into smart data: Deriving value via

627

harnessing volume, variety, and velocity using semantic techniques and

628 629 630 631

technologies, In: Proceedings of the 30th IEEE International Conference on Data

Engineering (ICDE), Chicago, IL, pp. 2-2. Sotnykova, A., Vangenot, C., Cullot, N., Bennacer, N., Aufaure, M.-A., 2005. Semantic mappings in description logics for spatio-temporal database schema

31

ACCEPTED MANUSCRIPT 632

integration. In: Spaccapietra, S., Zimanyi (eds.) Lecture Notes in Computer

633

Science (vol. 3534). Springer, Berling/Heidelberg, pp. 143-167. Stefani, C., Brunetaud, X., Janvier-Badosa, S., Beck, K.v., De Luca, L., Al-Mukhtar,

635

M., 2014. Developing a toolkit for mapping and displaying stone alteration on a

636

web-based documentation platform. Journal of Culture Heritage, 15, 1-9.

RI PT

634

Su, X., Gulla, J.A., 2004. Semantic enrichment for ontology mapping. In: Meziane, F.,

638

Metais, E. (eds.) Proceedings of the 9th International Conference on Application

639

of Natural Language to Information Systems (NLDB 2004), Salford, UK, pp.

640

217-228.

M AN U

641

SC

637

TSCreator, 2017. Time Scale Creator.

https://engineering.purdue.edu/Stratigraphy/tscreator/index/index.php. (Accessed

643

on February 22, 2018)

TE D

642

Varela, S., González-Hernández, J., Sgarbi, L.F., Marshall, C., Uhen, M.D., Peters, S.,

645

McClennen, M., 2015. paleobioDB: an R package for downloading, visualizing

646

and processing data from the Paleobiology Database. Ecography, 38, 419-425.

EP

644

Visser, U., 2005. Intelligent information integration for the Semantic Web. Springer.

648

Zheng, J.G., Fu, L., Ma, X., Fox, P., 2015. SEM+: tool for discovering concept

649 650 651

AC C

647

mapping in Earth science related domain. Earth Science Informatics, 8, 95-102.

Zhong, J., Aydina, A. and McGuinness, D.L., 2009. Ontology of fractures. Journal of Structural Geology, 31(3), 251-259.

652

32

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

Build and visualize an ontology for the local geologic time scale of North America Ontology-driven retrieval and display of fossil occurrences and geologic maps Multi-source information query and comparison to enable exploratory analysis A successful case study towards smart geoscience data service

Suggest Documents