Earth Science research often requires analyzing multiple, diverse datasets together. ... Big-Data database technology can scale to the volume of the data by.
IN51C-1869
Scaling to Diversity: The DERECHOS Distributed Infrastructure for Analyzing and Sharing Data Michael Lee Rilee1,3 Kwo-Sen Kuo2,3 Thomas Clune3
Oloso3,4
Amidu Paul Geoffrey Brown5
1Rilee Systems Technologies LLC 2Bayesics LLC
4Science Systems and Applications Inc.
3NASA Goddard Space Flight Center
5Paradigm4
v Supports data placement alignment and diversity in SciDB.
Ø Original Sloan Digital Sky Survey (right-justified) format adapted to new distributed storage environment.
“The Treacherous Last Mile”
Why SciDB?
v Earth Science research often requires analyzing multiple, diverse datasets together. v Existing systems are only scalable for storing, searching, and providing vast volumes and varieties of data.
Resource Consumption Advantages v Minimize download and local data management v Free end-user resources for research and science Performance Advantages v Better suited for scientific data than relational databases with array data model v Better optimization than Spark with tightly coupled analysis layer and storage layer allowing better
Ø Researchers as end-users search, order, and download data to their local systems.
v Effort and time spent marshaling, downloading, managing, and locally combining data means increased hardware/software costs and less time for research.
Ø Researchers must pay for the expensive “last mile” to make the data useful and obtain scientific results. Ø Automating such data management and eliminating the need for such enduser data preparation increases resources available for research.
v Big-Data database technology can scale to the volume of the data by using distributed, scalable compute and data storage systems but innovative infrastructure is needed for the variety of Earth Science data.
R, Python, Matlab, Julia,…
MPP Database
v The hierarchical triangular mesh (HTM) supports integration of diverse data on SciDB distributed array database. v Using data "in place" on scalable systems increases research productivity by eliminating the costs associated with the treacherous last mile for the end-user.
Current Focuses in Making Earth Science Data Usable
Triangles N0-red, N01green, N012-purple, N0123-cyan.
Big analytics without big hassles
Ø Spatiotemporal indexing is a critical cross-cutting component.
Ø HTM enables the compute and storage affinity needed to efficiently perform co-located and conditional analyses minimizing data transfers.
Left-justified HTM Bit Format
Array data model
The new left-justified HTM bit format enables multiresolution integer intervals to represent geometries on distributed computers. Right-justified does not support distribution via an integer index because triangles are redundantly spread across the number line.
Complex analytics
Initial Forays into HTM-based Earth Science Analysis on SciDB
Commodity clusters or cloud P. Brown, GeoInfo 2014
But… Array indices are “abstract” – every array has its own!
How to align diverse datasets on its shared nothing architecture to maximize performance? Earth Sicence Data is Diverse – Even in logical model
v Producing and archiving products v Cataloging Metadata for discovery v Browsing v Ordering & Download v Common formats through APIs
Common Earth Science Data Models
Query-based analysis
m l
Combined
Hierarchical Triangular Mesh Indexing
Versatile, efficient geometric calculations
Summary
https://developer.earthdata.nasa.gov/cmr/user-guide
LAST MILE
Data Centers
Acknowledgement Funding for this research is provided by NASA Earth Science Technology Office (ESTO) through the Advanced Information Systems Technology (AIST) program, for which we are very grateful.
v Big Data technologies help eliminate “last mile” data costs to researcher end-users. v Co-location/data placement required for efficient use of distributed, cluster/cloud resources. v HTM built on efficient geometric calculations, compact representation, balanced tradeoffs. v Updated HTM lays foundation for combining diverse data via geometric metadata and distributed “in place” computation using SciDB.