Machine Learning for Materials Discovery

A Practical Tutorial for Simple and Reproducible Materials Informatics Logan Ward Postdoctoral Scholar University of Chicago 17 October 2018

What is Materials Informatics 2

Another kind of computational materials science tool Goal: Accelerate design of materials Method: Replace experiments with computers

Many Theory-Based Tools: • •

• •

Density Functional Theory Phase Field Finite Element Analysis Computational Thermodynamics

Emerging Field: Data-driven Models Materials Data

Machine Learning

Predictive 𝜎𝑌 = 𝑓(𝑥) Model

FEA Image: http://www.icams.de/content/research/index.html

Machine Learning: What? Why? 3

What? Why?

Algorithms that generate computer programs Create software too complex to write manually

Supervised Learning: Given inputs, predict output Common Algorithms: y

𝑦 = 𝑓(𝑥)

𝑓(𝑥) = 𝑚𝑥 + 𝑏 𝑥 Community very new Best practices are still being learned

Today’s Goal: Explain my Workflow Metallic Glasses

Collaborators: SLAC, NIST, NU, USC, Citrine Crystalline Materials

Collaborators: NU [K. Kim]

Stopping Power

Collaborators: UIUC Deep Learning of Materials Data

Collaborators: NU [D. Jha]

What have I learned from working on these projects?

Objectives 6

Goal: Pass on “lessons learned” from doing materials informatics

A Few General Questions: How Should I Construct a Problem?  What Tools Are Available?  How Should I Organize My Project?  What About Publishing? 

7

How to make/use ML models? Practical concerns: What tools should I use? How should I avoid doing unneeded work? How to make sure my science is reproducible?

Materials Informatics Workflow 8

Collect

Process

Represent 𝑋Ԧ

Δ𝐻𝑓 = −1.0

Δ𝐻𝑓 = −0.5

Learn

𝑦Ԧ

𝒁𝑨

𝒁𝑩

𝚫𝐇𝐟

3

4

-1.0

3

5

-0.5

Δ𝐻𝑓 = 𝑓 𝑍𝐴 , 𝑍𝐵

Data Collection and Processing 9

Collect

Process

Represent

Learn

Collect data, define training set Challenge 1: Finding/storing structured datasets LIMS Tools

Data Resources

NLP

There are a variety of tools that can help you Materials Commons

Challenge 2: Clearly define inputs and outputs

Δ𝐻𝑓 = −1.0 Δ𝐻𝑓 = −0.5

Data Management Need Not Be Fancy… 10

Write code to format data (Don’t edit by hand!)

Knowing “what you have” and “where you got it” should not rely on memory

…but There Are Modern Tools 11

Common Data Curation System

Citrination

Organizing your data provides a solid starting point

Data Representation 12

Collect Process Represent Learn Encode domain knowledge into inputs

𝑃𝑟𝑜𝑝𝑒𝑟𝑡𝑦 = 𝒇 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 LiF

NaPb

Na2O 𝑥Na

NaPb

Representation of material Ex: Attributes = g(𝑥𝐻 , 𝑥𝐻𝑒 , …)

What does a representation need? |ΔΧ| Completeness: Differentiate materials Efficiency: Quick to compute Accuracy: Capture important effects Na2O LiF End product: Machine-learning compatible inputs

Data Representation: Open Codes 13

There is a variety of re-usable approaches, codes Electronic Structures

Crystal Structures

Microstructures

There may be a code that does what you need

Molecules

pyMKS Pictures from GitHub, Wikipedia

Matminer: Tookit for Materials ML 14

https://github.com/hackingmaterials/matminer/

Matminer: Tookit for Materials ML 15

Usable model in 5 lines of code data = MDFDataRetrieval().get_dataframe (source=‘oqmd’) data[‘composition’] = data[‘mdf.composition’].apply(Composition) f = ElementProperty().from_preset(‘magpie’) f.featurize_dataframe(data, ‘composition’) m = Lasso().fit(data[f.feature_labels()], data[‘oqmd.delta_e.value’])

https://github.com/hackingmaterials/matminer/

Matminer: “Featurizer” Library 16

Matminer has methods from many groups https://github.com/hackingmaterials/matminer/

Machine Learning 17

Collect

Process Represent Fit a model to the data

Learn

There are many algorithms, how do you decide? 1. Identify algorithms with desired properties. Ex: differentiability 2. Use cross-validation to optimize parameters 3. Pick the one with the lowest “loss”

Process/tools for building a ML model is well-established

Think Hard About Validation 18

Test Model How You Will Use It Evaluating new alloy systems? Exp.

ML

Predicting Future Events? Ward et al. npj Comp Mat. (2016)

Interpolating to new clusters? Ward et al, in preparation

Meredig et al. MSDE. (2018)

19

How to Organize? [Jupyter]

What is Jupyter? Why care? 20

• Jupyter is environment designed for

reproducible computational science • Stores code with outputs and documentation in a single notebook • •

Modeled after Mathematica https://www.theatlantic.com/science/ar chive/2018/04/the-scientific-paper-isobsolete/556676/

• Why do I care? Jupyter lets me… • organize my code • easily run on a remote system • keep track of results and rationale • communicate my research better

How I write a notebook 21

One notebook per “experiment” or “idea” Outline: 1.

Title and short abstract 1.

2.

Load in libraries 1.

3.

Put them up front, so it crashes early

Load in data from disk 1.

4.

“What am I doing here and why”

Ex: training data, results from other notebook

Each step in their own block 1. 2.

3. 4.

Introduction Code and explanation Figure/visualization Explanation of finding

Example Step 22 Introduction

Explanation

Visualization and conclusion

Someone should be able to understand this without knowing Python!

The Full Narrative 23

[A Collection of Many Small Steps]

Pitfalls When Making Notebooks 24 1.

Not including documentation Advice: Write what you’re going to do first

2.

Notebook not capturing entire process Problem: Jupyter lets you execute cells out of order Advice: Periodically “Restart Kernel and Run All Cells”

3.

Duplicate code between notebooks Advice: Make a separate module for common code

4.

Library conflicts Advice: Run mature projects in container, separate machines Advice: Make a “requirements.txt” file

5.

The “one cell notebooks” Advice: