LAST MILE Data Centers

IN51C-1869

Scaling to Diversity: The DERECHOS Distributed Infrastructure for Analyzing and Sharing Data Michael Lee Rilee1,3 Kwo-Sen Kuo2,3 Thomas Clune3

Oloso3,4

Amidu Paul Geoffrey Brown5

1Rilee Systems Technologies LLC 2Bayesics LLC

4Science Systems and Applications Inc.

3NASA Goddard Space Flight Center

5Paradigm4

v Supports data placement alignment and diversity in SciDB.

Ø Original Sloan Digital Sky Survey (right-justified) format adapted to new distributed storage environment.

“The Treacherous Last Mile”

Why SciDB?

v Earth Science research often requires analyzing multiple, diverse datasets together. v Existing systems are only scalable for storing, searching, and providing vast volumes and varieties of data.

Resource Consumption Advantages v Minimize download and local data management v Free end-user resources for research and science Performance Advantages v Better suited for scientific data than relational databases with array data model v Better optimization than Spark with tightly coupled analysis layer and storage layer allowing better

Ø Researchers as end-users search, order, and download data to their local systems.

v Effort and time spent marshaling, downloading, managing, and locally combining data means increased hardware/software costs and less time for research.

Ø Researchers must pay for the expensive “last mile” to make the data useful and obtain scientific results. Ø Automating such data management and eliminating the need for such enduser data preparation increases resources available for research.

v Big-Data database technology can scale to the volume of the data by using distributed, scalable compute and data storage systems but innovative infrastructure is needed for the variety of Earth Science data.

R, Python, Matlab, Julia,…

MPP Database

v The hierarchical triangular mesh (HTM) supports integration of diverse data on SciDB distributed array database. v Using data "in place" on scalable systems increases research productivity by eliminating the costs associated with the treacherous last mile for the end-user.

Current Focuses in Making Earth Science Data Usable

Triangles N0-red, N01green, N012-purple, N0123-cyan.

Big analytics without big hassles

Ø Spatiotemporal indexing is a critical cross-cutting component.

Ø HTM enables the compute and storage affinity needed to efficiently perform co-located and conditional analyses minimizing data transfers.

Left-justified HTM Bit Format

Array data model

The new left-justified HTM bit format enables multiresolution integer intervals to represent geometries on distributed computers. Right-justified does not support distribution via an integer index because triangles are redundantly spread across the number line.

Complex analytics

Initial Forays into HTM-based Earth Science Analysis on SciDB

Commodity clusters or cloud P. Brown, GeoInfo 2014

But… Array indices are “abstract” – every array has its own!

How to align diverse datasets on its shared nothing architecture to maximize performance? Earth Sicence Data is Diverse – Even in logical model

v Producing and archiving products v Cataloging Metadata for discovery v Browsing v Ordering & Download v Common formats through APIs

Common Earth Science Data Models

Query-based analysis

m l

Combined

Hierarchical Triangular Mesh Indexing

Versatile, efficient geometric calculations

Summary

https://developer.earthdata.nasa.gov/cmr/user-guide

LAST MILE

Data Centers

Acknowledgement Funding for this research is provided by NASA Earth Science Technology Office (ESTO) through the Advanced Information Systems Technology (AIST) program, for which we are very grateful.

v Big Data technologies help eliminate “last mile” data costs to researcher end-users. v Co-location/data placement required for efficient use of distributed, cluster/cloud resources. v HTM built on efficient geometric calculations, compact representation, balanced tradeoffs. v Updated HTM lays foundation for combining diverse data via geometric metadata and distributed “in place” computation using SciDB.