Machine Learning for Materials Discovery

0 downloads 0 Views 4MB Size Report
Ability to share with collaborators. Everyone has their own workflow. Creating usable data management systems is a huge, and well-studied problem ...
Systems Briefing A: Materials Databases Logan Ward Postdoctoral Scholar Data Science and Learning Division Argonne National Laboratory

Materials Databases Are Not New 2

Resources of Materials Data Have Long History CRC Handbook: 1913 ASM Handbooks: ~1920s JANAF Tables: 1964 (Hultgren) Select Values of the Thermo. Properties of…: 1973 Pauling File: 1995 OQMD: 2013 … and many more Why Important Now? Why not “solved”?

Why the Sudden Prevalence? 3

Why care now? Data is more useful

Seko et al. PRB (2014), 054303

Chen et al. Chem. Mater. (2012)

Zhang et al. Acta Mat. (2018)

Ward et al. Acta Mat. (2018) Kusne et al. Sci Rep. (2014)

Today’s Overview 4

Key Questions: 1. What are the ongoing trends? 2. What are the major efforts? 3. What are the current challenges/gaps? Conclusion: What is the dream?

*PSA: I work on the Materials Data Facility (MDF)

Materials Data Pyramid 5

Curation Effort

Interaction Frequency

Different Types of Data == Different Database Requirements! Image Credit: Jim Warren (NIST)

Working Data: Close to the Scientist 6

Everyday Data Needs [Data] scientists need… 1. 2. 3. 4.

Unrestricted access to data Portability To connect data to tools Everyone has their own workflow Ability to share with collaborators

Creating usable data management systems is a huge, and well-studied problem

Laboratory Inventory Management 7

NIST’s CDCS

Materials Commons

AFRL HyperThought

4CeeD

Key Features and Challenges 8

What makes a LIMS system? 1. 2. 3.

Guided Data Capture + Schemas User Accounts, Access Control Ability to Query Data

Codes already provide this

Will this concept scale? Yes, but… 1. 2. 3.

Need integration, training with facilities Adapting tools to large databases (minor) Integration with analysis tools!

Widespread data management will provide basis for sustainable materials databases …but it is the hardest challenge

High-Throughput Materials Collaboratory 9

Shared Repository for Metadata Data

Samples tracked across multiple experiments

https://mgi.nist.gov/htemc

“Sharable” Data and Publication 10

What about when a project is “done”? Need: “Publish and Forget” Requirements: 1. 2. 3. 4.

Provenance Information Archival Storage Detailed Descriptions Rewards for Data Publication

Common Features of All Services

Key Data Publication Services 11

Will these tools work for next generation beamlines?

Scaling Data Publication? Difficult 12

Publishing PBs is possible. http://opendata.cern.ch/

Our big challenge: Federated, distributed data Is the data too large to be moved? Large Data Size + Many Sources = >1 Publisher? What are the questions? • • •

Where is the published data getting stored? How to associate data produced at different beamlines? What data is worth publishing?

Publication for Beamlines will Require New Approaches

What Does Published Data Look Like? 13

Basic Provenance Information

Links to Files

Data is Available (!), Only Usable by Humans

Reference Data: What People Want 14

Smallest fraction of data. Typically… • Extensively curated • Composed of many experiments • Specific goal of collection

Accordingly, reference data are most widely used and usable Handbooks

Web Databases

Web APIs a.get_in_chemsys( [‘Ca’, ‘O’] )

Structured Data/APIs are critical for building tools

What is special about “reference data”? 15

Large amount of data Curated metadata Link to original source

Data in Tabular Form

Such data is rare, yet a requirement for ML

Reference Data: A Bright Future 16

Commercial/Industrial

Academic/National Laboratory

Reference Databases are Proliferating Rapidly!

Reference Data for Beamlines? 17

Key Question: What is worth the curation? What do we have a lot of?

What can be precisely described? Who has a large user base?

What data is readily usable?

Places to look: What are the “measurements”? If found: Can we distribute the reference data?

An example: Google Earth Engine 18

// Load the image from the archive. var image = ee.Image('LANDSAT/LC08/C01/T1/LC08_044034_20140318'); // Define visualization parameters in an object literal. var vizParams = {bands: ['B5', 'B4', 'B3'], min: 5000, max: 15000, gamma: 1.3}; // Center the map on the image and display. Map.centerObject(image, 9); Map.addLayer(image, vizParams, 'Landsat 8 false color');

Another Example: Sloan Sky Survey 19

Publishing Large Reference Datasets is Possible, We Just Need a Good Starting Case

What Are the Trends? 20

1. Data is Getting Published! 2. Repositories are Digital

3. Efforts are Community Driven

What Are the Major Trends? 21

Data is Getting Published Deluge of Data

1. • •

Data Management Systems seldom used Publication repositories lack metadata

Repositories are Digital APIs are Uncommon

2. •

Tools Do Not Work with Databases

Efforts are Community Driven Many Silos

3. •

Finding Best Dataset Difficult Current State: Data and Tools are Available Current Gap: Not Always Usable

22

What is the ideal materials database ? Short Version: One that you barely notice

A Seamless Data Infrastructure 23

Software Data Resources

Republish New Data/Software Just As Easily

Computing

Easily Access Data/Software/Compute from Anywhere You

The Materials Data Facility 24 Databases Datasets APIs LIMS etc.

EP

Data Discovery

Data Publication

Distributed data storage

• Mint DOIs • Associate metadata • Persist datasets

EP

EP

• Query • Browse • Aggregate

Publish and find materials data, regardless of size, location, type

Ian Foster (PI)

Ben Blaiszik

Jonathon Gaff

MDF “Connect” 25

Sharable Software (DLHub) 26

Send Model Discover Data

Train Model

Discover Data User 1

Send Materials

Retrieve Data User 2

Call Call DLHub DLHub Model Receive Properties

New science!

Publish Tools that Work with Data

DLHub

Receive DOI

Example: Predicting Metallic Glasses 27

10.1126/sciadv.aaq1566 Predicted glass-forming ability

DLHub

[“Zr”, “Co”, “ V”] Funding: 2018 Argonne Adv. Computing LDRD

Conclusions 28

Take Home Points: 1. Tools exist for data at all stages of curation 2. Scaling to light source data is achievable

Key Gaps: 1. Pushing adoption of data management systems 2. Establishing publication best practices 3. Linking tools together

Thanks to our sponsors! ALCF DF

Parsl

Globus

DLHub U.S. DEPARTMENT OF

ENERGY

IMaD

Argonne LDRD