Implementing a Data Quality Strategy to simplify access to data

12 downloads 343 Views 4MB Size Report
Data Profiling. 11. Data Quality Assessment. 12. ... Quality Strategy (DQS). 1. Underlying High Performance Data (HPD) ... Python, R,. MatLab, IDL. Visualisation.
IN43D-07 AGU Fall Meeting 2016

Implementing a Data Quality Strategy to simplify access to data Kelsey Druken, Claire Trenham, Ben Evans, Clare Richards, Jingbo Wang, & Lesley Wyborn National Computational Infrastructure, Canberra, Australia

NCI Australia • NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections • Spanning data collections from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences

© NCI Australia 2016

[email protected]

nci.org.au

NCI Australia Key to maximizing benefit of NCI’s collections and computational capabilities:  Ensuring seamless interoperable access to these datasets

© NCI Australia 2016

[email protected]

nci.org.au

Data Quality Strategy (DQS) Key to maximizing benefit of NCI’s collections and computational capabilities:  Ensuring seamless interoperable access to these datasets

© NCI Australia 2016

[email protected]

nci.org.au

The Goal • Combining data • Visualising • How can we enable this type of easy access and use?

© NCI Australia 2016

[email protected]

nci.org.au

How data collections are accessed Collections are being accessed and utilised from a broad range of options • Direct access on filesystem • Web and data services • Data portals • Virtual labs (e.g., virtual desktops)

eReefs online analysis portal

© NCI Australia 2016

[email protected]

nci.org.au

Key Challenges •

Application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used



Within these disciplines, data span a wide range of: - Gridded - Non-gridded (i.e., trajectories/profiles, point data) - Coordinate reference projections - Resolutions

© NCI Australia 2016

[email protected]

nci.org.au

Motivation: Data Management Maturity Program http://dataservices.agu.org/dmm/ DMM Capability – 25 Processes to Perform, Manage, Define Shelley Stall Assistant Director, Enterprise Data Management Program

1. Data Management Strategy Process Area 1. 2. 3. 4. 5.

Data Management Strategy Communications Data Management Function Grant Strategy/Business Case Funding

2. Data Governance Process Area 6. 7. 8.

Governance Management Vocabulary/Glossary Metadata Management

3. Data Quality Process Area 9. 10. 11. 12.

© NCI Australia 2016

Data Quality Strategy Data Profiling Data Quality Assessment Data Cleansing and Curation

[email protected]

4. Data Operations Process Area 13. Data Requirements Definition 14. Data Lifecycle Management 15. Contribution / Provider Management

5. Platform and Architecture Process Area 16. 17. 18. 19. 20.

Architectural Standards Architectural Approach Data Management Platform Data Integration / Data Linking Data Archiving and Preservation

6. Infrastructure Support Practices 21. 22. 23. 24. 25.

Measurement and Analysis Process Management Process Quality Assurance Risk Management Configuration Management

nci.org.au

Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.

Underlying High Performance Data (HPD) file format

2.

Close collaboration with data custodians and managers •

Planning, designing, and assessing the data collections

3.

Quality control through compliance with recognised community standards

4.

Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016

[email protected]

nci.org.au

Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.

Underlying High Performance Data (HPD) file format

2.

Close collaboration with data custodians and managers •

Planning, designing, and assessing the data collections

3.

Quality control through compliance with recognised community standards

4.

Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016

[email protected]

nci.org.au

Where to start? 1. Climate/ESS Model Assets and Data Products 2. Earth and Marine Observations and Data Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management and Hydrology Collections Data Collections

Approx. Capacity

CMIP5, CORDEX, ACCESS Models

5 Pbytes

Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR

2 Pbytes

Digital Elevation, Bathymetry Onshore/Offshore Geophysics

1 Pbytes

Seasonal Climate

700 Tbytes

Bureau of Meteorology Observations

350 Tbytes

Bureau of Meteorology Ocean-Marine

350 Tbytes

Terrestrial Ecosystem

290 Tbytes

Reanalysis products

100 Tbytes

© NCI Australia 2016

[email protected]

nci.org.au

NetCDF/HDF Common Data Formats 1. Climate/ESS Model Assets and Data Products 2. Earth and Marine Observations and Data Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management and Hydrology Collections Data Collections

Approx. Capacity

CMIP5, CORDEX, ACCESS Models

5 Pbytes

Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR

2 Pbytes

Digital Elevation, Bathymetry Onshore/Offshore Geophysics

1 Pbytes

Seasonal Climate Bureau of Meteorology Observations Bureau of Meteorology Ocean-Marine

NetCDF common data format

700 Tbytes 350 Tbytes 350 Tbytes

Terrestrial Ecosystem

290 Tbytes

Reanalysis products

100 Tbytes

© NCI Australia 2016

[email protected]

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL

Climate & Weather Science Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

VGL

Globe Claritas

VHIRL

Open Nav Surface

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS

Models

Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

ANDS/RDA Portal

AODN/IMOS Portal

Tools

TERN Portal

Data. gov.au

AuScope Portal

Digital Bathymetry & Elevation Portal

Data Portals

National Environmental Research Data Interoperability Platform (NERDIP) Open DAP

OGC W*TS

OGC SWE

OGC W*PS

OGC

netCDF-CF

WCS

OGC WFS

OGC WMS

RDF, LD

Data Conventions

Fast “whole-of-library” catalogue

CS-W

Direct Access

Services Layer

Vocab Service

PROV Service

ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL

API Layers

NetCDF-4 Climate

HP Data Library Layer

© NCI Australia 2016

NetCDF-4 Weather

NetCDF-4 Oceans

NetCDF-4 Bathy

NetCDF-4 EO

[HDF4EOS]

[Airborne Geophysics]

[SEG-Y]

[FITS]

[LAS LiDAR]

HDF5 ??

HDF5 Lustre

[email protected]

Other Storage (e.g., HDFS)

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL

Climate & Weather Science Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

VGL

Globe Claritas

VHIRL

Open Nav Surface

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS

Models

Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

ANDS/RDA Portal

AODN/IMOS Portal

Tools

TERN Portal

Data. gov.au

AuScope Portal

Digital Bathymetry & Elevation Portal

Data Portals

National Environmental Research Data Interoperability Platform (NERDIP) Open DAP

OGC W*TS

OGC SWE

OGC W*PS

OGC

netCDF-CF

WCS

OGC WFS

OGC WMS

RDF, LD

Data Conventions

Fast “whole-of-library” catalogue

CS-W

Direct Access

Services Layer

Vocab Service

PROV Service

ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL

API Layers

NetCDF-4 Climate

HP Data Library Layer

© NCI Australia 2016

NetCDF-4 Weather

NetCDF-4 Oceans

NetCDF-4 Bathy

NetCDF-4 EO

[HDF4EOS]

[Airborne Geophysics]

[SEG-Y]

[FITS]

[LAS LiDAR]

HDF5 ??

HDF5 Lustre

[email protected]

Other Storage (e.g., HDFS)

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL

Climate & Weather Science Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

VGL

Globe Claritas

VHIRL

Open Nav Surface

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS

Models

Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

ANDS/RDA Portal

AODN/IMOS Portal

Tools

TERN Portal

Data. gov.au

AuScope Portal

Digital Bathymetry & Elevation Portal

Data Portals

National Environmental Research Data Interoperability Platform (NERDIP) Open DAP

OGC W*TS

OGC SWE

OGC W*PS

OGC

netCDF-CF

WCS

OGC WFS

OGC WMS

RDF, LD

Data Conventions

Fast “whole-of-library” catalogue

CS-W

Direct Access

Services Layer

Vocab Service

PROV Service

ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL

API Layers

NetCDF-4 Climate

HP Data Library Layer

© NCI Australia 2016

NetCDF-4 Weather

NetCDF-4 Oceans

NetCDF-4 Bathy

NetCDF-4 EO

[HDF4EOS]

[Airborne Geophysics]

[SEG-Y]

[FITS]

[LAS LiDAR]

HDF5 ??

HDF5 Lustre

[email protected]

Other Storage (e.g., HDFS)

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL

Climate & Weather Science Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

VGL

Globe Claritas

VHIRL

Open Nav Surface

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS

Models

Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

ANDS/RDA Portal

AODN/IMOS Portal

Tools

TERN Portal

Data. gov.au

AuScope Portal

Digital Bathymetry & Elevation Portal

Data Portals

National Environmental Research Data Interoperability Platform (NERDIP) Open DAP

OGC W*TS

OGC SWE

OGC W*PS

OGC

netCDF-CF

WCS

OGC WFS

OGC WMS

RDF, LD

Data Conventions

Fast “whole-of-library” catalogue

CS-W

Direct Access

Services Layer

Vocab Service

PROV Service

ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL

API Layers

NetCDF-4 Climate

HP Data Library Layer

© NCI Australia 2016

NetCDF-4 Weather

NetCDF-4 Oceans

NetCDF-4 Bathy

NetCDF-4 EO

[HDF4EOS]

[Airborne Geophysics]

[SEG-Y]

[FITS]

[LAS LiDAR]

HDF5 ??

HDF5 Lustre

[email protected]

Other Storage (e.g., HDFS)

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL

Climate & Weather Science Lab

eMAST Speddexes

eReefs

AGDC VL

All Sky Virtual Observatory

VGL

Globe Claritas

VHIRL

Open Nav Surface

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS

Models

Fortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

Visualisation Drishti

ANDS/RDA Portal

AODN/IMOS Portal

Tools

TERN Portal

Data. gov.au

AuScope Portal

Digital Bathymetry & Elevation Portal

Data Portals

National Environmental Research Data Interoperability Platform (NERDIP) Open DAP

OGC W*TS

OGC SWE

OGC W*PS

OGC

netCDF-CF

WCS

OGC WFS

OGC WMS

RDF, LD

Data Conventions

Fast “whole-of-library” catalogue

CS-W

Direct Access

Services Layer

Vocab Service

PROV Service

ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL

API Layers

NetCDF-4 Climate

HP Data Library Layer

© NCI Australia 2016

NetCDF-4 Weather

NetCDF-4 Oceans

NetCDF-4 Bathy

NetCDF-4 EO

[HDF4EOS]

[Airborne Geophysics]

[SEG-Y]

[FITS]

[LAS LiDAR]

HDF5 ??

HDF5 Lustre

[email protected]

Other Storage (e.g., HDFS)

nci.org.au

Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.

Underlying High Performance Data (HPD) file format

2.

Close collaboration with data custodians and managers •

Planning, designing, and assessing the data collections

3.

Quality control through compliance with recognised community standards

4.

Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016

[email protected]

nci.org.au

Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.

Underlying High Performance Data (HPD) file format

2.

Close collaboration with data custodians and managers •

Planning, designing, and assessing the data collections

3.

Quality control through compliance with recognised community standards

4.

Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016

[email protected]

nci.org.au

Many levels of metadata Collection & dataset-levels

Collection

(e.g., parent-child metadata) ISO-19115, ANZLIC, etc.

Datasets

File (granule)-level Contains 2 types of metadata: (1) Variable-level (CF-Convention)

Datasets

Datasets

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

(2) Global file-level (ACDD**) **Can link to collection/dataset metadata

© NCI Australia 2016

File-level

Variablelevel(s)

[email protected]

nci.org.au

NCI’s Current NetCDF Data Holdings

FORMAT By collection

© NCI Australia 2016

[email protected]

nci.org.au

NCI’s Current NetCDF Data Holdings

The motivation: reduce ‘none’

FORMAT By collection

© NCI Australia 2016

[email protected]

nci.org.au

Existing Community Standards CF Conventions • Climate and Forecast Conventions and Metadata: http://cfconventions.org/

ACDD • Attribute Convention for Data Discovery: http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery

Together, these two standards define several categories of metadata ensuring: Usage, discoverability, and understanding of the data contents

© NCI Australia 2016

[email protected]

nci.org.au

Compliance checker • Want to adopt or utilise existing community checkers if possible • Two main options: • UK Reading (CF-Convention website links to this one) • IOOS (growing fast, designed to be modified and extended)

Our own modifications • Needed our own wrapper to enable collection-level scans • Tailor our output and reporting

© NCI Australia 2016

[email protected]

nci.org.au

Compliance checker Summarised version on the compliance status.

The break down… compliance scores and also measure of consistency across the collection

Providing attack plan for improvements: Make it easy for data managers to efficiently address and meet baseline compliance © NCI Australia 2016

[email protected]

nci.org.au

The result: “win-win” for all Data Quality Strategy In Action • Progressive improvement in the quality of the data across the different subject domains • Improves the ease by which users can access, utilise and combine the datasets from across NCI's holdings

© NCI Australia 2016

[email protected]

nci.org.au

Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.

Underlying High Performance Data (HPD) file format

2.

Close collaboration with data custodians and managers •

Planning, designing, and assessing the data collections

3.

Quality control through compliance with recognised community standards

4.

Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016

[email protected]

nci.org.au

Functionality tests • Extend to test “usability” across wide spectrum of scientific tools and data services • Commonly used libraries (e.g., netCDF, HDF, GDAL, etc.) • Accessibility by data servers (e.g., THREDDS, Hyrax, GeoServer) • Validation against scientific analysis and programming platforms (e.g., Python, Matlab, R, QGIS) • Visualization tools (e.g., ParaView, IDV, WMS-viewers)

© NCI Australia 2016

[email protected]

nci.org.au

Functionality tests

Primary motivation: Positive experience for our users. Expectation that advertised collections and services are usable.

© NCI Australia 2016

[email protected]

nci.org.au

Bonus results Bonus results: • Feedback to the local and international communities The more we test and test, the more we learn • Functionality tests lead to reference and training material for our user community

© NCI Australia 2016

[email protected]

nci.org.au

Bonus: User reference material

© NCI Australia 2016

[email protected]

nci.org.au

Bonus results Bonus results: • Feedback to the local and international communities The more we test and test, the more we learn • Functionality tests lead to reference and training material for our user community • Benefits of standardised and interoperable data formats © NCI Australia 2016

[email protected]

nci.org.au

Summary/Future Work What’s next? • Automating and extending these measures and tests across our full collection • What about the broader file formats? • Staying connected and working with international communities • E.g., NSF Funded “Advancing netCDF-CF for the Geoscience Community” (EarthCube) https://www.earthcube.org/content/advancing-netcdf-cf-geoscience-community

© NCI Australia 2016

[email protected]

nci.org.au

Suggest Documents