The Computational Materials Repository - Scitation

I nP feo tr am s caat li o e n C oS m t op r u at g i ne g

The Computational Materials Repository The possibilities for designing new materials based on quantum physics calculations are rapidly growing, but these design efforts lead to a significant increase in the amount of computational data created. The Computational Materials Repository (CMR) addresses this data challenge and provides a software infrastructure that supports the collection, storage, retrieval, analysis, and sharing of data produced by many electronic-structure simulators.

T

he design of novel and versatile materials is an issue of central importance for society. This is exemplified by our current focus on discovering new materials for energy conversion and storage to provide a sustainable alternative to the fossil-based fuel economy. Atomic-scale calculations are becoming increasingly important in strengthening our ability to meet this challenge, as they provide an ever-improving alternative to expensive experiments. However, conducting computational atomic-scale materials design requires that we carry out calculations on many materials. This poses a challenge in that we must find a way to systematically store those calculations to enable easy retrieval, comparison, and analysis. To meet this challenge, we present a software infrastructure, Computational Materials Repository (CMR), which addresses this challenge by implementing a modular framework in Python that provides tools for collecting, storing, grouping, searching, retrieving, and analyzing data generated by many modern electronic-structure simulators. (For others’ efforts at meeting this challenge, see the sidebar “Availability and Alternative Software.”) CMR is the result of a collaboration under the Quantum Materials Informatics Project (www. qmip.org), which aims to establish the core technology for integrated computational materials

Computing in Science & Engineering

CISE-14-6-Jacobsen.indd 51

design. Our particular focus is on density functional theory (DFT), which represents a favorable tradeoff between speed and accuracy for the treatment of “few-hundred-atom” systems that are highly relevant for understanding the physical and chemical properties of materials. CMR can be used for single user projects, but can also make use of a MySQL database (www.mysql.com) for intergroup collaborations and processing significantly more data. CMR is currently in use in our research groups, and we provide it under an open 1521-9615/12/$31.00 © 2012 IEEE Copublished by the IEEE CS and the AIP

David D. Landis Technical University of Denmark

Jens S. Hummelshøj SLAC National Accelerator Laboratory and Stanford University

Svetlozar Nestorov University of Chicago

Jeff Greeley Argonne National Laboratory

Marcin Dułak Technical University of Denmark

Thomas Bligaard and Jens K. Nørskov SLAC National Accelerator Laboratory and Stanford University

Karsten W. Jacobsen Technical University of Denmark

This article has been peer-reviewed.

51

10/5/12 2:09 PM

Availability and Alternative Software

T

he Computational Materials Repository (CMR) software can be found at http://wiki.fysik.dtu.dk/cmr, and the CMR database can be accessed at http://cmr.fysik.dtu.dk.1 The Atomic Simulation Environment (ASE) can be found at http://wiki.fysik.dtu.dk/ase. Recently, several projects have shown up that go in a similar direction as CMR—see, for example, AflowLib.org at www.aflowlib.org and Estest at http://estest.ucdavis.edu. Quixote2 is a system to organize, share, and query data for computational chemistry codes. As with CMR, it’s distributed as open source software and implements a similar workflow to process data. The main advantages of CMR over Quixote are scriptable database access, the possibility of adding data to the database without a converter, and being able to create groups of calculations with custom

fields and keywords. Another example is the Materials Project,3 which aims to provide a public database of electronic structure calculations for materials screening, structure prediction, analysis, and data mining. The unique feature of its data generation framework is the combination of known existing compounds, automated design for testability (DFT) calculations, and analysis to predict novel materials. References 1. T.R. Munter et al., “Virtual Materials Design Using Databases of Calculated Materials Properties,” Computational Science & Discovery, vol. 2, no. 1, 2009, p. 015006; doi:10.1088/1749-4699/2/1/015006. 2. S. Adams et al., “The Quixote Project: Collaborative and Open Quantum Chemistry Data Management in the Internet Age,” J. Cheminformatics, vol. 3, no. 1, 2011, p. 38; doi:10.1186/1758-2946-3-38. 3. A. Jain et al., “A High-Throughput Infrastructure for Density Functional Theory Calculations,” Computational Materials Science, vol. 50, no. 8, 2011, pp. 2295–2310.

source license to any group or individual who might find it useful.

CMR

From experience with the Atomic Simulation Environment (ASE; see https://wiki.fysik.dtu.dk/ ase)1 and its interfaces to legacy as well as state-ofthe-art electronic structure codes, we’ve learned that Python (www.python.org) is well suited for our purpose; it’s a high-level programming language used in research and industry (due to its intuitive and powerful syntax) that fits the needs of newcomers as well as expert programmers. As with the ASE, we’d like users to benefit from CMR at several levels of complexity. At the simplest level, the CMR can be used by individual users who keep their data in an ordinary file system; installing a database isn’t required. Power users who want to access data faster and with more flexibility can install a MySQL database and use one or more of the CMR interfaces to benefit fully from the system. The database system that we describe here is tailored to store, retrieve, and analyze data related to properties of matter at an atomic scale. A fairly large number of so-called electronic structure codes exist today (see www.psi-k.org/codes. shtml), and it’s the ambition of CMR to work broadly with many of these codes. The different codes can, however, have quite different output file formats, even though the output might contain relatively similar information. Our solution to the file-format issue is to initially convert the data to intermediate db-files (also called cmr-files). 52


Whenever possible, the variable names used in the db-file are the same as in the original file format, which lets users become quickly familiar with the database’s use. The reasons for using intermediate db-files are that they let users verify content, collect data without directly accessing the database, and exchange data with other parties. The CMR thereby introduces more flexibility in storing different types of calculations, new possibilities for intra- and intergroup collaboration, and better support for third-party analysis tools than the standalone Java-based Virtual Materials Design Platform (VMDF) package.2

Accessing the Database

Communication with a database is done by writing queries. For example, SELECT * FROM db.tbl would return all data from a table named tbl that’s contained in a database called db. This isn’t very user-friendly because queries tend to become long and complicated. Moreover, queries depend on the underlying database’s structure, so when the database structure changes, previously written queries become invalid. CMR handles this query-writing burden by providing three interfaces: a Python, a PHP, and an HTML/JavaScript interface. The most powerful interface is the Python interface, which lets users retrieve a subset of the data by selecting keywords, atomic numbers, or data ranges, and then do further processing. This interface is flexible with respect to the data’s location and can query a database as well as a dbfile repository. This ensures that people starting Computing in Science & Engineering

10/5/12 2:09 PM

with a db-file repository can easily switch to a database later. Using Python lets users perform advanced operations such as grouping results based on keywords and writing results back to the database. PHP is a general-purpose scripting language that runs on a webserver. The PHP interface makes it possible to visit a CMR server with a Web browser and provides access to the database in a manner similar to a state-of-the-art search engine; searches are performed by entering keywords, atom names, and data ranges for variables. Additionally, atomic structures and all files (such as calculation scripts and graphics) included in a db-file can be viewed, which turns out to be very useful—especially if the keywords aren’t accurate or it’s not clear how the data were calculated. Results can be downloaded, further analyzed, grouped, and re-uploaded with the Python interface. Although the Python interface offers a lot of functionality, in many cases it’s convenient to use a graphical user interface where, for example, atomic structures can be visualized as in the HTML/JavaScript interface silo. Silo runs in state-of-the-art browsers such as Chrome, Opera, Safari, or Firefox (though Firefox doesn’t support local data caching in a SQLite database). Simple Graphical User Interface (SiGUI), a plug-in for silo, lets users create queries visually without the need to know a database query language. In Figure 1, we show a query that retrieves data from a user named “strabo,” who selects results that contain the atom “Li,” and finally picks the results that contain the keyword “halide.” The results are presented in a cover flow showing the atomic structures above a table with corresponding values. Analyzing Data

Analyzing the data is one of the most important aspects of CMR. Without analysis, the database would just present an overwhelming amount of strings and numbers of little use. It’s therefore important that the analysis be both thorough and flexible. The first feature to use for analysis is the taxonomy, which consists of system-generated tags and keywords that describe the data. Some of these are quite obvious, such as the identity of an atom being nitrogen and the numbers that describe a given atom’s position in space. These are the kinds of keywords that are already identified with the initial data upload. However, there are also other keywords that can be automatically deduced. If a November/December 2012


Figure 1. The silo plug-in SiGUI lets users create queries in a simple graphical user interface. The results are shown as an atomic structure coverflow and in a table.

particular system contains a nitrogen atom surrounded by three hydrogen atoms at typical bond distances, this is an indication that the system contains an ammonia molecule. Thus, it might be appropriate to tag the system with the keyword ammonia, so that later searches can easily find the calculations involving this molecule. Such automatic keyword assignments are typically complex and therefore left to users. Another feature for analysis purposes is the use of folksonomy—that is, “a system of classification derived from the practices and methods of collaboratively creating and managing tags to annotate and categorize content” (see http://en.wikipedia. org/wiki/Folksonomy). In other words, the data base should take advantage of all the useful knowledge available from its users in classifying the content. One user might, for example, carry out studies related to how small molecules bind to a nickel surface and therefore decide to add a keyword “chemisorption” to a set of calculations. This is useful later on for other users who might want to study the same subject, and who (with a simple search) can identify relevant calculations that have already been carried out. The taxonomy and folksonomy tagging becomes even more dynamic by introducing agents. When we say agent here, we mean a program or process running in the background (or in some cases executed by the user manually) to query the database using taxonomy and folksonomy classification to retrieve data, perform operations, group results, or create tables. An example of a simple agent could be a piece of code identifying ammonia molecules. This code could run automatically every time new data are uploaded to the database 53

10/5/12 2:09 PM

Personal directory

Native format

Calculate design

Db-file repository

Calculate on supercomputer

Db-file Python interface

Agent Silo interface Present data

Database

PHP/ HTML interface Webserver

Figure 2. CMR system overview. The main steps are the creation of the db-files, the uploading to the database, and the automatic analysis performed by the agents.

and ensure that there would be an up-to-date list of ammonia molecules entering calculations in the database. An agent to calculate chemisorption energies is a little more involved, as it requires three electronic total energy calculations: • the adsorbate in the gas-phase (X ); • the clean surface (Y ); and • the surface with the adsorbate, such as a molecule bound to the surface (Z). The chemisorption energy, Echem, is then calculated as Echem = EZ - E X - EY . The agent first finds all solutions for X, Y, and Z that satisfy the following criteria: • X.keywords contain adsorbate, • Y.keywords contain surface, • Z.keywords contain surface + adsorbate, • X.ads = Z.ads, and • Y.surface = Z.surface. This list can be supplied with more conditions, such as to ensure that the three calculations in question are compatible in terms of system size, convergence, and other parameters. Based on all the discovered triplets of calculations, the agent builds a new table in the database with the chemisorption information. This agent could run 54


periodically and create a table that’s accessible to all users. An additional available feature for the analysis is the possibility of grouping calculations together. This could connect the three calculations involved in a calculation of the chemisorption energy permanently. Later, we’ll describe how this works. To illustrate the CMR system’s combined functionality, Figure 2 provides a system overview. The data might come from different electronic structure codes to be uploaded in the repository using the intermediate db-file format. During and after the upload, the data are analyzed by a set of agents based on the available taxonomy and folksonomy classification. The raw data and high-level data generated in tables can be accessed using the Python, PHP, or HTML/ JavaScript interface.

The Inner Workings

We now discuss some of the important details of CMR’s implementation. Most of the material here isn’t necessary for general or casual CMR users to know, but is essential for the superuser who’d like to tweak system performance. Db-Files and Cmr-Schemas

Db-files are stored in XML format. We chose XML (www.w3.org/standards/xml) for several Computing in Science & Engineering

10/5/12 2:09 PM

reasons: XML files can be easily extended and verified, the content can be read by humans, the structure is highly self-explanatory, and there are XML parsers available in almost any programming language. Anyone can therefore read db-files and might relatively easily create an application utilizing them independently of CMR. Here’s an extract of an XML file that contains the output from a Grid-based projectoraugmented wave method (GPAW; see https:// wiki.fysik.dtu.dk/gpaw) calculation: ... -586.442 6 8 ... ...

The content is intuitively obvious: TotalEnergy is equal to −586.442; AtomicNumbers is an array containing 6 and 8; and the variable types are double, long_array, and long. The difference between a db-file containing data from a GPAW calculation and one containing data from a different code, such as a Dacapo calculation, is defined by the schema. The schema declares the names of the variables and types, and whether they’re optional or mandatory. The CMR schemas are used to validate the content of the db-files. This enables early detection of possible problems with the data; if a field is expected to be an integer but is a string, it would fail to upload to the database. Grouping and Generating Unique IDs

The db-files are useful beyond simply storing converted information from programs. They’re also used to store information about groups of calculations. The difference is only the schema. Using db-files, we’re able to not only identify a group’s members but also to add keywords, a description, and user-defined fields. An example would be calculating a chemisorption energy requiring three calculations. When creating this group, we add a field chemisorption_ energy with the calculated result and fields November/December 2012


with the names of the surface atoms and the adsorbate. In a database, it’s easy to know which items belong to a group, because every item has an automatically assigned unique ID—but is it possible to transfer the data to a third party’s database and keep the same id? No, because the third party has different data in the database, and hence the IDs are most likely already used. To circumvent this issue, we create a hash value (unique identifier) from the content of the calculation. We use the secure hash algorithm SHA-1 (http://en.wikipedia. org/wiki/SHA-1) with a key length of 160 bits, which is sufficiently long to almost certainly avoid collisions. The hash and the ability to add user-defined fields to groups open more options for exchanging data. The previously described group containing the chemisorption energy can be transferred to the third party without including the db-files that were used to actually determine the chemisorption energy. This saves memory, and because the group retains the reference (the unique IDs of its members), the third party can identify the omitted parts and request them if necessary at any time.

Software Design and Applying CMR

People using CMR have different needs: some want to use CMR to query their data, others would like to use the system to collect data (but upload the data to a database that arranges the data in a very specific way). Meanwhile, developers want to implement other readers for using their codes with CMR. To cope with these requirements, CMR is built in a modular way and supports various groups of plug-ins that can be easily extended: • converters are readers of foreign file formats that convert to db-files; • mappings define mappings of names, types, and units; • agents are autonomous or manually started processes that analyze data; • cmr-schemas define exactly what data will be collected from a reader and the types of variables; and • tests test a plug-in or any other CMR functionality. Converting a supported output file—for example, from Dacapo, Gaussian, Grid-based Projector-Augmented Wave Method (GPAW), or the Vienna Ab initio Simulation Package (VASP)—to a db-file is simple. Arguments, keywords, and extra 55

10/5/12 2:09 PM

program, about 100 participants completed more than 5,000 electronic structure calculations that were stored in an early version of CMR. Figure 3 shows the results, where the green box indicates the region of alloy stability and a decomposition energy that’s advantageous for hydrogen storage.2 All the data are available in our CMR at https:// cmr.fysik.dtu.dk, together with data compiled in a project on solar-induced splitting of water.3

K Na

Li –0.8 0.2 ∆Ealloy (eV/f.u.) ∆Edecomp (eV/H2)

Figure 3. Energetic properties—alloy stability (DEalloy) and decomposition energy (DEdecomp)—of Li-, Na-, or K-transition metal borohydrides calculated during the 2008 Center for Atomic-Scale Materials Design (CAMD) Summer School.2 The most promising candidates for reversible hydrogen storage are inside the box.

fields are placed in a Python dictionary and passed as an argument to the convert function: import cmr params = {"input": "example.gpw", "db":True, "keywords": ["gpaw", "example", "test"], "files": ["log.txt"], "description": "Vacuum convergence test", "vacuum": 12.5 }

T

he possibilities for developing new materials based on computational design—as expressed, for example, through the Materials Genome Initiative 4 —can be expected to drive further research collaboration within atomic-scale computations in the coming years also at the international level. CMR provides some basic capabilities to deal with the large amounts of data, which will be generated, but certainly much further software development is to be expected in this area to ease large-scale collaborative efforts in the future.

Acknowledgments We acknowledge financial support from the Danish Center for Scientific Computing and the Danish Council for Strategic Research’s Programme Commission on Nanoscience, Nanotechnology, Biotechnology, and IT (NABIIT). The Center for Atomic-Scale Materials Design (CAMD) is funded by the Lundbeck Foundation. The Sustainable Energy through Catalysis (Suncat) Center for Interface Science and Catalysis is funded by the US Department of Energy (DOE). Jeff Greeley acknowledges support from the Department of Energy, Office of Science, and Office of Basic Energy Sciences through the Early Career Research Program.

References 1. S.R. Bahn and K.W. Jacobsen, “An Object-Oriented

cmr.convert(params)

The flag "db":True means that the output should be written directly to the db-file repository. Every plug-in is expected to provide one or more tests. Tests are run when a release version is created and can be invoked by users as well to make sure the environment is set up correctly. Screening Example

During the 2008 Center for Atomic-Scale Materials Design (CAMD) Summer School at the Technical University of Denmark (DTU), a project was carried out with the aim of identifying ternary alkali-transition metal borohydrides for hydrogen storage. 2 Over the course of this 56


Scripting Interface to a Legacy Electronic Structure Code,” Computing in Science & Eng., 2002, vol. 4, no. 3, pp. 56–66. 2. J.S. Hummelshøj, “Density Functional TheoryBased Screening of Ternary Alkali-Transition Metal Borohydrides: A Computational Material Design Project,” J. Chemical Physics, vol. 131, no. 1, 2009; http://dx.doi.org/10.1063/1.3148892. 3. I.E. Castelli et al., “Computational Screening of Perovskite Metal Oxides for Optimal Solar Light Capture,” Energy * Environmental Science, vol. 5, 2012, pp. 5814–5819; doi:10.1039/C1EE02717D. 4. “Materials Genome Initiative for Global Competitiveness,” white paper, Group on Advanced Materials, June 2011; www.whitehouse.gov/sites/default/files/ microsites/ostp/materials_genome_initiative-final.pdf.

Computing in Science & Engineering

10/5/12 2:09 PM

David D. Landis is a PhD student at the Center for Atomic-Scale Materials Design (CAMD) at the Department of Physics, Technical University of Denmark. His research interests include data collection and software engineering. Landis has an MS in computer science from ETH Zurich, Switzerland. Contact him at [email protected]. Jens S. Hummelshøj is an associate staff scientist at the SUNCAT Center for Interface Science and Catalysis at SLAC National Accelerator Laboratory and Stanford University. His research interests include the theoretical description of solid materials for energy storage at the atomic level and scientific data warehousing. Hummelshøj has a PhD in theoretical physics from the Technical University of Denmark. Contact him at [email protected]. Svetlozar Nestorov is a senior research associate at the Computation Institute at the University of Chicago. His research interests include data warehousing, data mining, high-performance distributed computing, and large-scale crowdsourcing applications. Svetlozar holds a PhD in computer science from Stanford University. Contact him at [email protected]. Jeff Greeley is a staff scientist in the theory and modeling group of the Argonne National Laboratory’s Center for Nanoscale Materials. His research interests are in the use of periodic density functional theory calculations to model and design nanoscale heterogeneous catalysts, electrocatalysts, and batteries. Greeley has a PhD in chemical engineering from the University of Wisconsin-Madison. He’s a member of the American Institute for Chemical Engineers, the American Chemical Society, and the Electrochemical Society. Contact him at [email protected]. Marcin Dułak is a computer engineer at the Department of Physics, Technical University of Denmark. Dułak has a PhD in chemistry from the University of Geneva, Switzerland. Contact him at Marcin.Dulak@ fysik.dtu.dk. Thomas Bligaard is a senior staff scientist and deputy director for theory at the SUNCAT Center for Interface Science and Catalysis at the SLAC National Accelerator Laboratory and Stanford University. His research interests include catalysis informatics, exchangecorrelation functionals for improved accuracy in density functional theory, optimization methods, data mining, free energy sampling, and numerical algorithms. Bligaard has a PhD in theoretical physics from the Technical University of Denmark. Contact him at [email protected].

November/December 2012


Jens K. Nørskov is the Leland T. Edwards Professor of Engineering and a professor of chemical engineering and photon science at Stanford University, and he is the director of the SUNCAT Center for Interface Science and Catalysis at SLAC National Accelerator Laboratory and Stanford University. His research interests include the theoretical description of surfaces, heterogeneous catalysis, materials, nanostructures, and biomolecules. Nørskov has a PhD in theoretical physics from Århus University. Contact him at [email protected]. Karsten W. Jacobsen is the director of CAMD and a professor of theoretical physics at the Department of Physics, Technical University of Denmark. His research interests include electronic-structure theory, large-scale simulations, and computational materials design. Jacobsen has a PhD in physics from the University of Copenhagen, Denmark. Contact him at [email protected]. Selected articles and columns from IEEE Computer Society publications are also available for free at http://ComputingNow.computer.org.

IEEE Open Access Unrestricted access to today’s groundbreaking research via the IEEE Xplore® digital library

IEEE offers a variety of open access (OA) publications: • Hybrid journals known for their established impact factors • New fully open access journals in many technical areas • A multidisciplinary open access mega journal spanning all IEEE fields of interest Discover top-quality articles, chosen by the IEEE peer-review standard of excellence.

Learn more about IEEE Open Access

www.ieee.org/open-access

12-TA-0424-Open Access 3.25x4.75 Final .indd 1

9/24/12 10:06 AM

57

10/5/12 2:09 PM

The Computational Materials Repository - Scitation

The Computational Materials Repository - Scitation

Suggest Documents

Materials databases for the computational materials scientist

DOEâ Computational Materials Sciences

Integrated Computational Materials Engineering

computational modeling of materials school - Materials Networking

The Computational 2D Materials Database: High ...

Computational materials chemistry at the nanoscale

Computational materials science: The emergence of predictive

pccp paper - Computational Materials Group

Computational Peptidology - Open Repository of National Natural ...

Inverted top-emitting blue electrophosphorescent organic ... - Scitation

Tight-Binding Approach to Computational Materials ...

Computational materials design and engineering - Steel Research ...

THERMO-CALC & DICTRA, Computational Tools For Materials ...

Tight-Binding Approach to Computational Materials

A Review of Computational Methods in Materials

Computational Mechanics of Materials and Structures - CiteSeerX

Computational Materials Design for High Critical Temperature ...

Integrated Computational Materials Engineering for ... - AIAA Info

Grand Challenges in Computational Materials Science - Core

Computational Materials Design for High Critical Temperature

An Integrated Computational Materials Engineering ...

GROMACS USER MANUAL - computational materials simulation

Tight-Binding Approach to Computational Materials ...

Computational Thermodynamics and Kinetics in Materials Modelling ...

The Computational Materials Repository - Scitation