IN43D-07 AGU Fall Meeting 2016
Implementing a Data Quality Strategy to simplify access to data Kelsey Druken, Claire Trenham, Ben Evans, Clare Richards, Jingbo Wang, & Lesley Wyborn National Computational Infrastructure, Canberra, Australia
NCI Australia • NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections • Spanning data collections from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences
© NCI Australia 2016
[email protected]
nci.org.au
NCI Australia Key to maximizing benefit of NCI’s collections and computational capabilities: Ensuring seamless interoperable access to these datasets
© NCI Australia 2016
[email protected]
nci.org.au
Data Quality Strategy (DQS) Key to maximizing benefit of NCI’s collections and computational capabilities: Ensuring seamless interoperable access to these datasets
© NCI Australia 2016
[email protected]
nci.org.au
The Goal • Combining data • Visualising • How can we enable this type of easy access and use?
© NCI Australia 2016
[email protected]
nci.org.au
How data collections are accessed Collections are being accessed and utilised from a broad range of options • Direct access on filesystem • Web and data services • Data portals • Virtual labs (e.g., virtual desktops)
eReefs online analysis portal
© NCI Australia 2016
[email protected]
nci.org.au
Key Challenges •
Application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used
•
Within these disciplines, data span a wide range of: - Gridded - Non-gridded (i.e., trajectories/profiles, point data) - Coordinate reference projections - Resolutions
© NCI Australia 2016
[email protected]
nci.org.au
Motivation: Data Management Maturity Program http://dataservices.agu.org/dmm/ DMM Capability – 25 Processes to Perform, Manage, Define Shelley Stall Assistant Director, Enterprise Data Management Program
1. Data Management Strategy Process Area 1. 2. 3. 4. 5.
Data Management Strategy Communications Data Management Function Grant Strategy/Business Case Funding
2. Data Governance Process Area 6. 7. 8.
Governance Management Vocabulary/Glossary Metadata Management
3. Data Quality Process Area 9. 10. 11. 12.
© NCI Australia 2016
Data Quality Strategy Data Profiling Data Quality Assessment Data Cleansing and Curation
[email protected]
4. Data Operations Process Area 13. Data Requirements Definition 14. Data Lifecycle Management 15. Contribution / Provider Management
5. Platform and Architecture Process Area 16. 17. 18. 19. 20.
Architectural Standards Architectural Approach Data Management Platform Data Integration / Data Linking Data Archiving and Preservation
6. Infrastructure Support Practices 21. 22. 23. 24. 25.
Measurement and Analysis Process Management Process Quality Assurance Risk Management Configuration Management
nci.org.au
Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.
Underlying High Performance Data (HPD) file format
2.
Close collaboration with data custodians and managers •
Planning, designing, and assessing the data collections
3.
Quality control through compliance with recognised community standards
4.
Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016
[email protected]
nci.org.au
Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.
Underlying High Performance Data (HPD) file format
2.
Close collaboration with data custodians and managers •
Planning, designing, and assessing the data collections
3.
Quality control through compliance with recognised community standards
4.
Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016
[email protected]
nci.org.au
Where to start? 1. Climate/ESS Model Assets and Data Products 2. Earth and Marine Observations and Data Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management and Hydrology Collections Data Collections
Approx. Capacity
CMIP5, CORDEX, ACCESS Models
5 Pbytes
Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR
2 Pbytes
Digital Elevation, Bathymetry Onshore/Offshore Geophysics
1 Pbytes
Seasonal Climate
700 Tbytes
Bureau of Meteorology Observations
350 Tbytes
Bureau of Meteorology Ocean-Marine
350 Tbytes
Terrestrial Ecosystem
290 Tbytes
Reanalysis products
100 Tbytes
© NCI Australia 2016
[email protected]
nci.org.au
NetCDF/HDF Common Data Formats 1. Climate/ESS Model Assets and Data Products 2. Earth and Marine Observations and Data Products 3. Geoscience Collections 4. Terrestrial Ecosystems Collections 5. Water Management and Hydrology Collections Data Collections
Approx. Capacity
CMIP5, CORDEX, ACCESS Models
5 Pbytes
Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR
2 Pbytes
Digital Elevation, Bathymetry Onshore/Offshore Geophysics
1 Pbytes
Seasonal Climate Bureau of Meteorology Observations Bureau of Meteorology Ocean-Marine
NetCDF common data format
700 Tbytes 350 Tbytes 350 Tbytes
Terrestrial Ecosystem
290 Tbytes
Reanalysis products
100 Tbytes
© NCI Australia 2016
[email protected]
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL
Climate & Weather Science Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
VGL
Globe Claritas
VHIRL
Open Nav Surface
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS
Models
Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
ANDS/RDA Portal
AODN/IMOS Portal
Tools
TERN Portal
Data. gov.au
AuScope Portal
Digital Bathymetry & Elevation Portal
Data Portals
National Environmental Research Data Interoperability Platform (NERDIP) Open DAP
OGC W*TS
OGC SWE
OGC W*PS
OGC
netCDF-CF
WCS
OGC WFS
OGC WMS
RDF, LD
Data Conventions
Fast “whole-of-library” catalogue
CS-W
Direct Access
Services Layer
Vocab Service
PROV Service
ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL
API Layers
NetCDF-4 Climate
HP Data Library Layer
© NCI Australia 2016
NetCDF-4 Weather
NetCDF-4 Oceans
NetCDF-4 Bathy
NetCDF-4 EO
[HDF4EOS]
[Airborne Geophysics]
[SEG-Y]
[FITS]
[LAS LiDAR]
HDF5 ??
HDF5 Lustre
[email protected]
Other Storage (e.g., HDFS)
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL
Climate & Weather Science Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
VGL
Globe Claritas
VHIRL
Open Nav Surface
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS
Models
Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
ANDS/RDA Portal
AODN/IMOS Portal
Tools
TERN Portal
Data. gov.au
AuScope Portal
Digital Bathymetry & Elevation Portal
Data Portals
National Environmental Research Data Interoperability Platform (NERDIP) Open DAP
OGC W*TS
OGC SWE
OGC W*PS
OGC
netCDF-CF
WCS
OGC WFS
OGC WMS
RDF, LD
Data Conventions
Fast “whole-of-library” catalogue
CS-W
Direct Access
Services Layer
Vocab Service
PROV Service
ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL
API Layers
NetCDF-4 Climate
HP Data Library Layer
© NCI Australia 2016
NetCDF-4 Weather
NetCDF-4 Oceans
NetCDF-4 Bathy
NetCDF-4 EO
[HDF4EOS]
[Airborne Geophysics]
[SEG-Y]
[FITS]
[LAS LiDAR]
HDF5 ??
HDF5 Lustre
[email protected]
Other Storage (e.g., HDFS)
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL
Climate & Weather Science Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
VGL
Globe Claritas
VHIRL
Open Nav Surface
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS
Models
Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
ANDS/RDA Portal
AODN/IMOS Portal
Tools
TERN Portal
Data. gov.au
AuScope Portal
Digital Bathymetry & Elevation Portal
Data Portals
National Environmental Research Data Interoperability Platform (NERDIP) Open DAP
OGC W*TS
OGC SWE
OGC W*PS
OGC
netCDF-CF
WCS
OGC WFS
OGC WMS
RDF, LD
Data Conventions
Fast “whole-of-library” catalogue
CS-W
Direct Access
Services Layer
Vocab Service
PROV Service
ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL
API Layers
NetCDF-4 Climate
HP Data Library Layer
© NCI Australia 2016
NetCDF-4 Weather
NetCDF-4 Oceans
NetCDF-4 Bathy
NetCDF-4 EO
[HDF4EOS]
[Airborne Geophysics]
[SEG-Y]
[FITS]
[LAS LiDAR]
HDF5 ??
HDF5 Lustre
[email protected]
Other Storage (e.g., HDFS)
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL
Climate & Weather Science Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
VGL
Globe Claritas
VHIRL
Open Nav Surface
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS
Models
Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
ANDS/RDA Portal
AODN/IMOS Portal
Tools
TERN Portal
Data. gov.au
AuScope Portal
Digital Bathymetry & Elevation Portal
Data Portals
National Environmental Research Data Interoperability Platform (NERDIP) Open DAP
OGC W*TS
OGC SWE
OGC W*PS
OGC
netCDF-CF
WCS
OGC WFS
OGC WMS
RDF, LD
Data Conventions
Fast “whole-of-library” catalogue
CS-W
Direct Access
Services Layer
Vocab Service
PROV Service
ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL
API Layers
NetCDF-4 Climate
HP Data Library Layer
© NCI Australia 2016
NetCDF-4 Weather
NetCDF-4 Oceans
NetCDF-4 Bathy
NetCDF-4 EO
[HDF4EOS]
[Airborne Geophysics]
[SEG-Y]
[FITS]
[LAS LiDAR]
HDF5 ??
HDF5 Lustre
[email protected]
Other Storage (e.g., HDFS)
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP) Biodiversity & Climate Change VL
Climate & Weather Science Lab
eMAST Speddexes
eReefs
AGDC VL
All Sky Virtual Observatory
VGL
Globe Claritas
VHIRL
Open Nav Surface
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways Ferret, NCO, GDL, GDAL, GRASS, QGIS
Models
Fortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
Visualisation Drishti
ANDS/RDA Portal
AODN/IMOS Portal
Tools
TERN Portal
Data. gov.au
AuScope Portal
Digital Bathymetry & Elevation Portal
Data Portals
National Environmental Research Data Interoperability Platform (NERDIP) Open DAP
OGC W*TS
OGC SWE
OGC W*PS
OGC
netCDF-CF
WCS
OGC WFS
OGC WMS
RDF, LD
Data Conventions
Fast “whole-of-library” catalogue
CS-W
Direct Access
Services Layer
Vocab Service
PROV Service
ISO 19115, ACDD, RIF-CS, DCAT, etc. GDAL
API Layers
NetCDF-4 Climate
HP Data Library Layer
© NCI Australia 2016
NetCDF-4 Weather
NetCDF-4 Oceans
NetCDF-4 Bathy
NetCDF-4 EO
[HDF4EOS]
[Airborne Geophysics]
[SEG-Y]
[FITS]
[LAS LiDAR]
HDF5 ??
HDF5 Lustre
[email protected]
Other Storage (e.g., HDFS)
nci.org.au
Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.
Underlying High Performance Data (HPD) file format
2.
Close collaboration with data custodians and managers •
Planning, designing, and assessing the data collections
3.
Quality control through compliance with recognised community standards
4.
Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016
[email protected]
nci.org.au
Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.
Underlying High Performance Data (HPD) file format
2.
Close collaboration with data custodians and managers •
Planning, designing, and assessing the data collections
3.
Quality control through compliance with recognised community standards
4.
Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016
[email protected]
nci.org.au
Many levels of metadata Collection & dataset-levels
Collection
(e.g., parent-child metadata) ISO-19115, ANZLIC, etc.
Datasets
File (granule)-level Contains 2 types of metadata: (1) Variable-level (CF-Convention)
Datasets
Datasets
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
(2) Global file-level (ACDD**) **Can link to collection/dataset metadata
© NCI Australia 2016
File-level
Variablelevel(s)
[email protected]
nci.org.au
NCI’s Current NetCDF Data Holdings
FORMAT By collection
© NCI Australia 2016
[email protected]
nci.org.au
NCI’s Current NetCDF Data Holdings
The motivation: reduce ‘none’
FORMAT By collection
© NCI Australia 2016
[email protected]
nci.org.au
Existing Community Standards CF Conventions • Climate and Forecast Conventions and Metadata: http://cfconventions.org/
ACDD • Attribute Convention for Data Discovery: http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery
Together, these two standards define several categories of metadata ensuring: Usage, discoverability, and understanding of the data contents
© NCI Australia 2016
[email protected]
nci.org.au
Compliance checker • Want to adopt or utilise existing community checkers if possible • Two main options: • UK Reading (CF-Convention website links to this one) • IOOS (growing fast, designed to be modified and extended)
Our own modifications • Needed our own wrapper to enable collection-level scans • Tailor our output and reporting
© NCI Australia 2016
[email protected]
nci.org.au
Compliance checker Summarised version on the compliance status.
The break down… compliance scores and also measure of consistency across the collection
Providing attack plan for improvements: Make it easy for data managers to efficiently address and meet baseline compliance © NCI Australia 2016
[email protected]
nci.org.au
The result: “win-win” for all Data Quality Strategy In Action • Progressive improvement in the quality of the data across the different subject domains • Improves the ease by which users can access, utilise and combine the datasets from across NCI's holdings
© NCI Australia 2016
[email protected]
nci.org.au
Data Quality Strategy (DQS) Data Quality Strategy (DQS): What does it involve? 1.
Underlying High Performance Data (HPD) file format
2.
Close collaboration with data custodians and managers •
Planning, designing, and assessing the data collections
3.
Quality control through compliance with recognised community standards
4.
Data assurance through demonstrated functionality across common platforms, tools, and services © NCI Australia 2016
[email protected]
nci.org.au
Functionality tests • Extend to test “usability” across wide spectrum of scientific tools and data services • Commonly used libraries (e.g., netCDF, HDF, GDAL, etc.) • Accessibility by data servers (e.g., THREDDS, Hyrax, GeoServer) • Validation against scientific analysis and programming platforms (e.g., Python, Matlab, R, QGIS) • Visualization tools (e.g., ParaView, IDV, WMS-viewers)
© NCI Australia 2016
[email protected]
nci.org.au
Functionality tests
Primary motivation: Positive experience for our users. Expectation that advertised collections and services are usable.
© NCI Australia 2016
[email protected]
nci.org.au
Bonus results Bonus results: • Feedback to the local and international communities The more we test and test, the more we learn • Functionality tests lead to reference and training material for our user community
© NCI Australia 2016
[email protected]
nci.org.au
Bonus: User reference material
© NCI Australia 2016
[email protected]
nci.org.au
Bonus results Bonus results: • Feedback to the local and international communities The more we test and test, the more we learn • Functionality tests lead to reference and training material for our user community • Benefits of standardised and interoperable data formats © NCI Australia 2016
[email protected]
nci.org.au
Summary/Future Work What’s next? • Automating and extending these measures and tests across our full collection • What about the broader file formats? • Staying connected and working with international communities • E.g., NSF Funded “Advancing netCDF-CF for the Geoscience Community” (EarthCube) https://www.earthcube.org/content/advancing-netcdf-cf-geoscience-community
© NCI Australia 2016
[email protected]
nci.org.au