Ability to share with collaborators. Everyone has their own workflow. Creating usable data management systems is a huge, and well-studied problem ...
Systems Briefing A: Materials Databases Logan Ward Postdoctoral Scholar Data Science and Learning Division Argonne National Laboratory
Materials Databases Are Not New 2
Resources of Materials Data Have Long History CRC Handbook: 1913 ASM Handbooks: ~1920s JANAF Tables: 1964 (Hultgren) Select Values of the Thermo. Properties of…: 1973 Pauling File: 1995 OQMD: 2013 … and many more Why Important Now? Why not “solved”?
Why the Sudden Prevalence? 3
Why care now? Data is more useful
Seko et al. PRB (2014), 054303
Chen et al. Chem. Mater. (2012)
Zhang et al. Acta Mat. (2018)
Ward et al. Acta Mat. (2018) Kusne et al. Sci Rep. (2014)
Today’s Overview 4
Key Questions: 1. What are the ongoing trends? 2. What are the major efforts? 3. What are the current challenges/gaps? Conclusion: What is the dream?
*PSA: I work on the Materials Data Facility (MDF)
Materials Data Pyramid 5
Curation Effort
Interaction Frequency
Different Types of Data == Different Database Requirements! Image Credit: Jim Warren (NIST)
Working Data: Close to the Scientist 6
Everyday Data Needs [Data] scientists need… 1. 2. 3. 4.
Unrestricted access to data Portability To connect data to tools Everyone has their own workflow Ability to share with collaborators
Creating usable data management systems is a huge, and well-studied problem
Laboratory Inventory Management 7
NIST’s CDCS
Materials Commons
AFRL HyperThought
4CeeD
Key Features and Challenges 8
What makes a LIMS system? 1. 2. 3.
Guided Data Capture + Schemas User Accounts, Access Control Ability to Query Data
Codes already provide this
Will this concept scale? Yes, but… 1. 2. 3.
Need integration, training with facilities Adapting tools to large databases (minor) Integration with analysis tools!
Widespread data management will provide basis for sustainable materials databases …but it is the hardest challenge
High-Throughput Materials Collaboratory 9
Shared Repository for Metadata Data
Samples tracked across multiple experiments
https://mgi.nist.gov/htemc
“Sharable” Data and Publication 10
What about when a project is “done”? Need: “Publish and Forget” Requirements: 1. 2. 3. 4.
Provenance Information Archival Storage Detailed Descriptions Rewards for Data Publication
Common Features of All Services
Key Data Publication Services 11
Will these tools work for next generation beamlines?
Scaling Data Publication? Difficult 12
Publishing PBs is possible. http://opendata.cern.ch/
Our big challenge: Federated, distributed data Is the data too large to be moved? Large Data Size + Many Sources = >1 Publisher? What are the questions? • • •
Where is the published data getting stored? How to associate data produced at different beamlines? What data is worth publishing?
Publication for Beamlines will Require New Approaches
What Does Published Data Look Like? 13
Basic Provenance Information
Links to Files
Data is Available (!), Only Usable by Humans
Reference Data: What People Want 14
Smallest fraction of data. Typically… • Extensively curated • Composed of many experiments • Specific goal of collection
Accordingly, reference data are most widely used and usable Handbooks
Web Databases
Web APIs a.get_in_chemsys( [‘Ca’, ‘O’] )
Structured Data/APIs are critical for building tools
What is special about “reference data”? 15
Large amount of data Curated metadata Link to original source
Data in Tabular Form
Such data is rare, yet a requirement for ML
Reference Data: A Bright Future 16
Commercial/Industrial
Academic/National Laboratory
Reference Databases are Proliferating Rapidly!
Reference Data for Beamlines? 17
Key Question: What is worth the curation? What do we have a lot of?
What can be precisely described? Who has a large user base?
What data is readily usable?
Places to look: What are the “measurements”? If found: Can we distribute the reference data?
An example: Google Earth Engine 18
// Load the image from the archive. var image = ee.Image('LANDSAT/LC08/C01/T1/LC08_044034_20140318'); // Define visualization parameters in an object literal. var vizParams = {bands: ['B5', 'B4', 'B3'], min: 5000, max: 15000, gamma: 1.3}; // Center the map on the image and display. Map.centerObject(image, 9); Map.addLayer(image, vizParams, 'Landsat 8 false color');
Another Example: Sloan Sky Survey 19
Publishing Large Reference Datasets is Possible, We Just Need a Good Starting Case
What Are the Trends? 20
1. Data is Getting Published! 2. Repositories are Digital
3. Efforts are Community Driven
What Are the Major Trends? 21
Data is Getting Published Deluge of Data
1. • •
Data Management Systems seldom used Publication repositories lack metadata
Repositories are Digital APIs are Uncommon
2. •
Tools Do Not Work with Databases
Efforts are Community Driven Many Silos
3. •
Finding Best Dataset Difficult Current State: Data and Tools are Available Current Gap: Not Always Usable
22
What is the ideal materials database ? Short Version: One that you barely notice
A Seamless Data Infrastructure 23
Software Data Resources
Republish New Data/Software Just As Easily
Computing
Easily Access Data/Software/Compute from Anywhere You
The Materials Data Facility 24 Databases Datasets APIs LIMS etc.
EP
Data Discovery
Data Publication
Distributed data storage
• Mint DOIs • Associate metadata • Persist datasets
EP
EP
• Query • Browse • Aggregate
Publish and find materials data, regardless of size, location, type
Ian Foster (PI)
Ben Blaiszik
Jonathon Gaff
MDF “Connect” 25
Sharable Software (DLHub) 26
Send Model Discover Data
Train Model
Discover Data User 1
Send Materials
Retrieve Data User 2
Call Call DLHub DLHub Model Receive Properties
New science!
Publish Tools that Work with Data
DLHub
Receive DOI
Example: Predicting Metallic Glasses 27
10.1126/sciadv.aaq1566 Predicted glass-forming ability
DLHub
[“Zr”, “Co”, “ V”] Funding: 2018 Argonne Adv. Computing LDRD
Conclusions 28
Take Home Points: 1. Tools exist for data at all stages of curation 2. Scaling to light source data is achievable
Key Gaps: 1. Pushing adoption of data management systems 2. Establishing publication best practices 3. Linking tools together
Thanks to our sponsors! ALCF DF
Parsl
Globus
DLHub U.S. DEPARTMENT OF
ENERGY
IMaD
Argonne LDRD