Computer Physics Communications 171 (2005) 133–153 www.elsevier.com/locate/cpc
GaussDal: An open source database management system for quantum chemical computations ✩ Bjørn K. Alsberg ∗ , Håvard Bjerke, Gunn M. Navestad, Per-Olof Åstrand Department of Chemistry, Norwegian University of Science and Technology (NTNU), Division of Physical Chemistry, Realfagbygget, Høyskoleringen 5d, N-7491 Trondheim, Norway Received 5 January 2005; accepted 13 April 2005 Available online 13 June 2005
Abstract An open source software system called GaussDal for management of results from quantum chemical computations is presented. Chemical data contained in output files from different quantum chemical programs are automatically extracted and incorporated into a relational database (PostgreSQL). The Structural Query Language (SQL) is used to extract combinations of chemical properties (e.g., molecules, orbitals, thermo-chemical properties, basis sets etc.) into data tables for further data analysis, processing and visualization. This type of data management is particularly suited for projects involving a large number of molecules. In the current version of GaussDal, parsers for Gaussian and Dalton output files are supported, however future versions may also include parsers for other quantum chemical programs. For visualization and analysis of generated data tables from GaussDal we have used the locally developed open source software SciCraft. Program summary Title of program: GaussDal Catalogue identifier: ADVT Program summary URL: http://cpc.cs.qub.ac.uk/summaries/ADVT Program obtainable from: CPC Program Library, Queen’s University of Belfast, N. Ireland Computers: Any Operating system under which the system has been tested: Linux Programming language used: Python Memory required to execute with typical data: 256 MB No. of bits in word: 32 or 64 No. of processors used: 1 ✩ This paper and its associated computer program are available via the Computer Physics Communications homepage on ScienceDirect (http://www.sciencedirect.com/science/journal/00104655). * Corresponding author. E-mail address:
[email protected] (B.K. Alsberg).
0010-4655/$ – see front matter 2005 Published by Elsevier B.V. doi:10.1016/j.cpc.2005.04.008
134
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
Has the code been vectorized or parallelized?: No No. of lines in distributed program, including test data, etc: 543 531 No. of bytes in distribution program, including test data, etc: 7 718 121 Distribution format: tar.gzip file Nature of physical problem: Handling of large amounts of data from quantum chemistry computations. Method of solution: Use of SQL based database and quantum chemistry software specific parsers. Restriction on the complexity of the problem: Program is currently limited to Gaussian and Dalton output, but expandable to other formats. Generates subsets of multiple data tables from output files. 2005 Published by Elsevier B.V. PACS: 07.05.Kf Keywords: Quantum chemistry; Data management; Structural query language (SQL); Database
1. Introduction Quantum chemistry (QC) is increasingly becoming a powerful tool for chemists and biologists to study the properties of molecules. This is possible thanks to the rapid development of better computer hardware and improved quantum chemical methods. When using QC software for the study of molecules, a large number of possible properties are computed. Depending on the type of calculation performed, large output files are generated that contain information about the wavefunction, its energies, vibrational modes, xyz-coordinates, thermodynamic properties, etc. A consequence of better availability of QC methods is a rising number of chemists and biologists using these methods in their work. As more projects are started that involve a substantial number of molecules to be analyzed, the demand for better and more efficient management tools of QC data will intensify. QC computation of large sets of molecules is in particular relevant within the fields of drug design and quantitative structure-activity relationships where many molecular structures are compared. In such fields there is an increased use of quantum chemistry methods to extract suitable molecular descriptors [1]. Descriptors may also be atomic coordinates, orbital energies, atomic charges and thermodynamic properties. The common approach to management of QC data used today is to: • use multiple directories for storing various data categories, • select carefully file names to indicate type of molecules and treatments, • create local scripts for parsing and file handling. However as project sizes and number of projects increases this is not the most efficient approach when we want to perform: • Extraction of multiple data tables from different files. • Exploration of different combinations of data tables. • Working in multiple projects. Many professional quantum chemists create their own scripts or programs for handling the information processing suitable for them. However, this is not optimal for a wider group of users of QC programs. Local scripts/programs tend to be very limited in their scope and are usually not suitable for solving a wide range problems. In addition, local scripts often suffer from other drawbacks such as lack of documentation, support, user group, expandability and accessibility. GaussDal attempts to solve these problems by creating a general open source framework for data management of QC results where extracted data tables easily can be further processed and analyzed. As an example data analysis system we use SciCraft [2] which is developed in our group and has an open source [3] license.
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
135
2. Software requirements The design and implementation of GaussDal is guided by a set of principles and ideas important for the type of software envisioned. Some of the keywords that identify these guiding principles are: • Open source. An important requirement for GaussDal is that it has an open source license. Too much chemistry related software is closed or restricted which limits the availability to investigators and also the possibility of future expansions and modifications. Open source is a growing trend in academic software which in many ways reflects the true spirit of science. • Modularity. For future expansion and maintenance it is important that each functional part of the system should be as independent as possible. • Flexibility. The system should enable easy and effective access to stored data such that a wide range of data combinations are possible. “Data combinations” is here interpreted to be the collection of various combinations of data items from a database. • Accessibility. Increasingly, investigators collaborate on joint projects or have similar research problems. The collaborators may be located in the same research group or across countries. In these circumstances there is a need to effectively share and access data. Thus, in addition to providing a database system for local use, it should also provide a system for access through the world-wide-web. • Security. It should be possible for user of the system to ensure security restrictions to the database if desired. This is in particular important for commercial investigators that need to protect their data. 2.1. Related software There are existing software implementations available that try to solve some of the management problems mentioned above, such as MolWorks (http://www.molworks.com) and Thor from Daylight Chemical Information Systems (http://www.daylight.com/). However, all these systems are commercial software and subjected to strict licensing requirements. There are two open source project related to GaussDal: OpenChem Workbench (http://www.mines.edu/research/ccre/) and GaussSum (http://gausssum.sourceforge.net/). The OpenChem WorkBench software appears to contain some of the desired properties of GaussDal, however inspection of the web site indicates that the project has had little activity in the last years and has not progressed to a sufficiently mature state. The GaussSum program is basically a parser for Gaussian output files which does not contain a database for handling the results which is so central to GaussDal. Within the field of molecular mechanics simulations, there is a software which follows several of the ideas used for GaussDal. This is the BioSimGrid database (http://www.biosimgrid.org/) which allows contributors in a GRID computation network to share results from their computations. This is a centralized approach to data management and is in contrast to GaussDal which can run on the local machine of each user. This is to ensure security and flexibility in the use of the program. In addition, GaussDal is also able to work over the Internet in a centralized mode. 3. Design and construction 3.1. Overall structure of GaussDal GaussDal is composed of two main types of modules: • A parser module where a set of parsers for different QC programs is located. Currently GaussDal contains only two parsers, one for Gaussian [4] and one for Dalton [5]. In future versions of GaussDal parsers for other popular QC programs may be included such as GAMESS [6], Jaguar [7] and MOLCAS [8]. • A database module where the results from the parsed files are stored in a relational database.
136
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
GaussDal has two operational phases: • A parsing phase where individual output files are analyzed and information is extracted. This operation may be performed one or many times. Thus GaussDal can be used dynamically as well as at the end of a project. • A querying phase where Structural Query Language (SQL) commands are sent to the database. These commands combine different sources of data from one or several projects into a single data table. The querying of the database can be performed both in command line mode or through a world-wide-web interface. 3.2. Database structure The database structure is independent of the type of parser used. It is assumed that the underlying physical structure is the same regardless of which software is being used to calculate the physical properties. However, there may be properties that are calculated by only a small number of QC software packages. The basic design of the database reflects an attempt to mirror the relationships between conceptual objects such as atoms, molecules, wavefunctions, basis sets, experiments and various physical properties. Some of the relations used are (bold face indicates database tables in GaussDal): • • • • • •
A Project contains Experiments. An Experiment consists of Computations. A Computation contains Molecules. A Molecule has Atoms. A Molecule has Symmetry, Dipole-moment and Total-energy. An Atom has Coordinates, Type, Charge, and Basis set.
This is not a complete list, but illustrates the main organization of information structure produced by QC calculations. The database structure is shown in Fig. 1. 3.3. Technology decisions Python [9,10] was chosen as the computer language for GaussDal since it has the following properties: • • • •
Open source, Platform independent, Large method library, Easy to use for non-programmers.
For the relational database the open source software PostgreSQL [11] was used. This was chosen over MySQL [12] because it supports for schemas, subset selection and authentication methods. 3.4. Visualization and data analysis GaussDal is constructed to be a stand-alone application where its output can be further processed by a wide range of programs for visualization and data analysis. Here we will focus on one program in particular which is referred to as SciCraft [2]. It is developed locally [3] and uses an open source license.
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
137
Fig. 1. This shows an overview of the GaussDal database structure for storing of molecular properties. Here are some of the major data tables and how they are connected.
3.5. The SciCraft system 3.5.1. Integration SciCraft is open source software [13,14] designed to enable flexibility with respect to integration of different technologies. To achieve this SciCraft works as a front-end to several numerical “engines” that actually performs the data handling and processing. The properties of expandability and accessibility are obtained by selecting high level language “engines” such as Matlab, Octave, Mathematica or R. As SciCraft insists on an open source philosophy only open source languages are used (R, Octave, Python, etc.). Stand-alone programs written in, e.g., C/C++ or Java can also be used, however we have focused on high level languages since they make implementation of new data analytical methods much more efficient. All communication between SciCraft and these “engines” is performed through plug-ins. 3.5.2. Visual programming Another important aspect of SciCraft is the use of an intuitive graphical user interface (GUI) based on a visual programming environment (VPE) [15–18]. Computer programs are here represented as diagrams which consist of nodes and links (connection lines). Each node represents a method or an operator and each link shows the flow of data. A link is displayed as an arrow to indicate direction of the data flow, see Fig. 2. The VPE is a natural choice for data analysis purposes as tasks often can be regarded as a flow of data through different filters or operators. 3.5.3. The GaussDal node For SciCraft we have constructed a special node which allows the user to extract directly the information contained in output files from the GaussDal database. This node has several features which are useful for larger projects involving comparison between many molecules:
138
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
Fig. 2. This shows the graphical user interface for the SciCraft system. It employs a visual programming environment (VPE) where control is achieved through a diagrammatic programming where nodes are different methods or operators and the links represent the flow of data between the nodes.
• An interface which allows the selection of which molecular variables to be used (e.g., xyz-coordinates, charges, energies, etc.) • A 3D molecular editor is presented to allow the user to determine a set of reference atoms in a reference molecule and manually identify the different comparable atoms for each molecule in the data set. If the index vector for each comparable atoms between molecules and the reference molecule is known in beforehand, this can also be used by the node. This will cause the node to ignore manual atom selection. The selection process is not necessary for calculations involving only one molecule with, e.g., different xyz-coordinates since the atom indexing will be comparable from one configuration to the next. • Given that comparable atoms have been identified, the node then optionally applies 3D Procrustes analysis [19] to the selected atoms to perform rigid rotation and translation to align the comparable atoms in the molecules. The result from this alignment is a transformation matrix which is subsequently applied to all atoms in each molecule. The Procrustes operation may also be useful for calculations on a single molecule. • When all molecules are transformed into a comparable coordinate system, the GaussDal node collects the relevant data for each molecule and store them as rows in a data matrix. This data matrix then forms the basis for subsequent data analysis. Typically each row vector in this matrix describes the structure of a molecule using, e.g., features such as: x1 y1 z1 · · · xn yn zn property11 propertyn1 propertynm (1) for m different properties and n different comparable atoms in every molecule. Properties are both local and global, where local ones describe structural features (such as atomic arrangements, distances, angles, etc.)
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
139
Fig. 3. The graphical user interface for the SciCraft GaussDal node which enables the user to perform manual mapping of comparable atoms in a set of molecules. It also selects which atomic properties are passed to the final description of each molecule.
and global ones are valid for the whole molecule (such as molecular volume, thermodynamic variables, etc.). For local properties they must be comparable from one molecule to the next. Usually this is performed by finding comparable atoms/structures with respect to a reference molecule. The vector containing all the local and global properties for a molecule is here referred to as a structure vector. The user interface for the GaussDal node in SciCraft is shown in Fig. 3.
4. The SQL language The Structured Query Language (SQL) provides a query language for displaying, inserting and manipulating the data in a relational database. The following section gives a short introduction to SQL since this may not be familiar to most chemists and physicists. Most contemporary Database Management Systems (DBMS) utilize the concept of relational database systems. Relations are tables that present a list of one or more objects in terms of attributes. One or more of these attributes
140
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
Table 1 A relation, “MolExp” Molname
Basis
Total energy
Experiment
Oxygen Ethane Water
STO-3G 3-21G 6-31G*
−147.578511032 −77.6009880571 −76.0098687150
11a 23b 24b
Table 2 A relation, “AtomsTypes” Atom
Molname
O O H H C
Oxygen Water Water Ethane Ethane
may link an object to another object in order to present it’s relation to it. An object instance is a table row, and in database terms called a record or more formally, a tuple. Examples of relational DBMSs are PostgreSQL, Oracle, MySQL and Microsoft SQL. 4.1. Relations A relation is a table or a list of tuples containing attributes of an object. A simple example which stores the results of different computations/experiments on three simple molecules (oxygen, ethane and water) is shown in Table 1 where the name, basis set used in the QC computation, the total energy and an experiment identifier are stored for each molecule. In order to avoid storing data redundantly across tables, relational database systems relate two or more tables to each other by a common attribute. These attributes do not need to have the same names such that rows of different tables may be linked together by the values of their attributes. Instead of storing data common to several records of a table in each relevant row, we may create a new table in order to concentrate the data, and relate the instances to each other by a common attribute. In our example we may want to know the atom types associated with each molecule. To do this we create the relation, “AtomsTypes”, see Table 2 where we store the atom names contained in each molecule. The common attribute between Tables 1 and 2 is “Molname” and we are able to make queries to the database about information shared in the two tables linked by the common attribute. 4.2. Structured queries However, a database consisting of only two tables is not representative of a realistic situation. In a typical database, a table may consist of hundreds or thousands of rows and the relations between the records often form large and complex networks. To automate the task of searching and combining data the SQL language has been created. 4.2.1. Some basic SQL commands The four basic SQL commands that can be used to display, insert, manipulate and delete data in relational databases are, respectively, SELECT, INSERT, UPDATE and DELETE. In addition we have the commands FROM and WHERE which controls the sources of the different data tables and the relationships between the selected tables. The complete syntax of SQL commands is beyond the scope of this article, however a subset is presented together with its semantics, to facilitate the understanding of how GaussDal is being used.
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
141
Table 3 The result of a SQL query AtomTypes.atom
MolExp.basis
Experiment
O O H H C
STO-3G 6-31G* 6-31G* 3-21G 3-21G
11a 24b 24b 23b 23b
An SQL command is case insensitive and consists of, among others, keywords, identifiers, constants and operators. Keywords are often written in upper case in order to separate them from identifiers, which are written in lower case. An SQL command is terminated with a semicolon, e.g.: SELECT name FROM MolExp;
Using our relations “AtomTypes” and “MolExp” from above we can create a query which returns a table showing the different basis sets used for each atom type in each experiment: SELECT AtomTypes.atom, MolExp.basis, MolExp.experiment FROM AtomTypes, MolExp WHERE AtomTypes.molname = MolExp.molname;
The resulting table is shown in Table 3. We have used an = operator in the previous example, however we could have used other values such as > and < to perform matching of data. In addition, operators such as +, -, * and / can be used to perform arithmetic operations on data. The next query example makes use of this: SELECT AtomType.atom, MolExp.molname, MolExp.energy FROM AtomType, MolExp WHERE energy > -77 AND AtomType.molname = MolExp.molname;
will return the following if entered in PostgreSQL’s interactive terminal, Psql: atom | molname | Total energy | Experiment ----------------------+---------------+-------------------+--------------H | Water | -76.0098687150 | 24b O | Water | -76.0098687150 | 24b ----------------------+---------------+-------------------+---------------
5. GaussDal web solution In addition to working in command line, it is also possible to use GaussDal over the Internet. 5.1. Web interface GaussDal has a web interface which lets the user submit computations to a GaussDal database through a web browser. Together with the PostgreSQL web client phpPgAdmin [20], GaussDal can be used through a web browser such that databases with QC results in remote locations can be accessed. This functionality in particular is useful for maintaining collaboration between research groups in different parts of the world.
142
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
Fig. 4. View of GaussDal database when logged into the PhpPgAdmin web interface.
5.1.1. Submitting computations Submitting computations through the GaussDal web interface is performed by uploading results from computations that are to be parsed through the GaussDal web location. An interactive GaussDal database example can be inspected at the address http://www.scicraft.org/?cat=8&page=28. By pressing the “Browse” button the user may browse through the files local on his computer. When the “Upload” button is clicked the computation is uploaded to the server and the data is parsed into the SQL database. 5.2. PhpPgAdmin The PhpPgAdmin database interface [20] is more intuitive and easy to use than Psql since it allows point-andclick browsing of the database, and yet is as powerful as Psql in that it allows for entering SQL queries manually. Features of the PhpPgAdmin interface enables the user to access the databases without needing to remember structural details about them. Fig. 4 shows the view of a GaussDal database when logged into the PhpPgAdmin web interface. Navigation through different databases is achieved using the left-side tree representation and tables may be selected for further inspection or modification. When a table is selected the user is presented with meta-information about the selected table, see Fig. 5. If, e.g., the table “molecule” is selected the user may be presented with a view similar to Fig. 6. From here the user can browse the data in the table or proceed with operations that correspond to the standard SQL operations, such as SELECT, UPDATE and INSERT. By following the “Browse” hyperlink the user may browse through the data that is stored in the table, as shown in Fig. 7. The tables can also be exported to different formats, including XML and CSV. Fig. 8 shows how these records are edited directly through descriptive input fields by clicking on the corresponding “Edit” button. Since the guiding user interfaces provided by PhpPgAdmin may not suffice for more elaborate queries, PhpPgAdmin also allows the user to execute SQL queries as if using Psql, as shown in Fig. 9. All SQL queries that are valid in Psql are also valid in this interface. Fig. 10 shows the results of the query. The administration of user privileges is also handled through PhpPgAdmin where user accounts are easily created and managed. For instance, the system facilitates the management of executing privileges such as SELECT
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
143
Fig. 5. View of meta-information about the different GaussDal tables.
or INSERT such that it can be granted or revoked to users individually. Those who upload their data may then allow certain users to inspect or even modify the database while denying others access to it. 6. Application examples In the following two example applications of GaussDal is presented. This is to illustrate how the software can be used to extract information from multiple output files generated by quantum chemistry computations. 6.1. Force field parametrization 6.1.1. Description of data set A force field is a representation of molecular energies and intermolecular interaction energies in terms of an analytical function and atom-type or atom-pair parameters [21]. Perhaps the most common force-field expression for the energy, V , is the site–site Lennard-Jones potential, qI qJ σI J 12 σI J 6 , + 4πεI J − V= (2) RI J RI J RI J I,J >I
144
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
Fig. 6. SciCraft GaussDal node for handling mapping of comparable atoms and selecting what atomic properties to use.
Fig. 7. Web interface for browsing through stored data in a GaussDal database.
where the summation over I and J runs over all atom pairs, qI and qJ are atomic charges representing the classical electrostatics, and εI J and σI J are atom-pair van der Waals parameters describing both the short-range repulsion and the attractive dispersion energy. The major advantage of force fields is that the energy is calculated in a fraction of the time as compared to quantum chemical calculations, and they are therefore suitable in molecular simulations where the energies and forces are calculated repeatedly [22].
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
145
Fig. 8. Demonstration of how database content can be edited.
Fig. 9. How SQL commands can be inserted to the GaussDal web interface.
The construction of a force field consists of two steps. First, an analytical function for the energy is chosen that models the different energy contributions, and secondly, the atom-type parameters have to be determined. Furthermore, force fields may be parametrized either from experimental data (empirical force fields) or from quantum chemical calculations of molecular energies. A parametrization from quantum chemical calculations has several advantages including that empirical force fields are only valid within in the range of the experimental data adopted in the parametrization, but an explicit polarization term has to be included in a force field based on first principles to account for liquid properties properly [23].
146
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
Fig. 10. This shows the result of a SQL query through the GaussDal web interface (see also previous figure).
The intra molecular energy in molecules is normally parametrized from a Taylor expansion around the equilibrium geometry in a set of internal coordinates, qI , such as, for example, bond lengths, bond angles and torsional angles [24]. The energy expansion becomes 1 aI qI + bI J qI qJ + · · · , V (q) = V0 + (3) 2 I
I,J
∂V , is zero if the equilibrium geometry is where V0 defines the zero-level of the energy, and the gradient, aI = ∂q I chosen as the expansion point. In this work, we parametrized the intra molecular potential surface for H3 O+ from a set of 304 quantum chemical calculations where the molecular geometry were varied. The six intra molecular coordinates, qI , are chosen in terms of the three O–H bond lengths, ri , and the three H–O–H bond angles, θij , where i and j denote hydrogen atoms. To model the bond stretches, a Simon–Parr–Finlan (SPF) expansion [25] was adopted for each O–H bond length, ri ,
qI =
ri − re , ri
I = 1, 2, 3,
(4)
where re is the equilibrium bond length. The advantage with an SPF expansion as compared to a regular expansion in the deviation of the bond lengths is that it behaves correctly at large distances. Similarly for the three bond angles, θij , we use the following expansion qI =
θij − θe , ri + rj
I = 4, 5, 6,
(5)
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
147
where θe is the equilibrium bond length. The denominator including the O–H bond lengths ensures that the term vanishes in the limit of dissociation. A second order expansion for H3 O+ thus becomes 1 2 1 2 qI + bss qI qJ + bsb qI qJ + bbb qI qJ + bb qI , V (q) = bs 2 2 3
3
I =1
3
i=1 j >i
3
6
i=1 j =4
6
6
I =4 j >i
6
(6)
i=4
where on the right-hand side we in turn have the harmonic stretching term, the cross-term between different stretching motions, the cross-term between stretching and bending motions, the bending cross-terms, and the harmonic bending term. If we have a set of m calculations of the energy for different geometries, Vm = V0 +
5
Bk Qkm ,
(7)
k=1
where Bk are the second-order parameters in Eq. (6) and Qkm contains the geometry distortions. Bk and V0 may thus be determined by linear regression. 6.1.2. QC calculations The complete active space self-consistent field (CASSCF) method has been adopted for the quantum chemical calculations [26]. An active space of 8 electrons in 8 orbitals has been chosen and the 1s electrons have been kept inactive. The aug-cc-pVTZ basis set has been adopted in all calculations [27]. The program package Dalton was used [5]. The basic strategy for varying the molecular geometry was to stretch one O–H bond towards dissociation while varying the other degrees of freedom around the equilibrium values. The increments in the variation around the equilibrium was 0.05 Å for the bond lengths and 7.5◦ for the bond angles. The angle between the dissociating hydrogen atom and “water molecule plane” were chosen to be 0◦ , 30◦ , and 60◦ . In total, 304 calculations were generated. 6.1.3. The SQL query In this query we want to use GaussDal to extract the xyz-coordinates of all atoms at all selected points on the potential energy surface. In addition, for each point the total energy must also be extracted. The SQL query is as follows: SELECT DISTINCT ON (atom.atomid) molecule.moleculeid, atom.atomicnumber, coordinate.x, coordinate.y, coordinate.z, totalenergy.totalenergy FROM coordinate, atom, molecule, totalenergy WHERE coordinate.atomid = atom.atomid AND atom.moleculeid = molecule.moleculeid AND totalenergy.moleculeid = molecule.moleculeid ORDER BY atom.atomid, coordinate.coordinateid DESC;
6.1.4. Data analysis with SciCraft To find the regression coefficients (in Eq. (6)) bs , bss , bsb , bbb , bb (contained in the vector b) from data calculated by QC (here the “measured data”), we need to create an X matrix in the linear regression equation y = Xb + e. y contains the total energy calculated by QC methods. An Octave program (here referred to as “MLRmod.m”) was created to calculate the qI and qJ coordinates from the data given by GaussDal. This program also performed multiple linear regression (MLR) to estimate b using the formula: bˆ = (XT X)−1 Xy. The SciCraft module diagram used is shown in Fig. 11. The leftmost node reads the output file generated by GaussDal and sends the data to the GaussDal node. In this case the node uses the default mapping since the comparability between atoms in every
148
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
Fig. 11. This shows the SciCraft module diagram for the force field parametrization example.
Fig. 12. This shows a SciCraft plot of the measured versus estimated energies in the force field parametrization example.
molecular file is known. The output from the GaussDal node is then sent to the “MLRmod.m” program which uses the regression model to produce a vector of estimated energies yˆ . The plot between the estimated and “measured” energies are then sent to a 2D plotting node, see Fig. 12. In the lower energy range the model appear to be adequate, however for the higher energies the data contain variation which cannot be accounted for by the chosen model.
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
149
6.2. Electrophilic addition to propene 6.2.1. Description of data set This example is based on a study of an intrinsic reaction coordinate (IRC) [28] analysis of electrophilic addition of hydrochloric acid to propene. In general, addition of HX (where X is a halogen) to an asymmetric alkene may give two products, depending on which carbon the hydrogen is added to. These kinetically controlled additions obey Markovnikov’s rule [29], which in its simplest form, is that hydrogen adds to the ethylenic carbon which has more hydrogen atoms. The rate determining step is believed to be the formation of an intermediate carbocation, formed by protonation. The usual rationalization of Markovnikovs rule is based on the stability of this intermediate. The anti-Markovnikov reaction (AM) is then disfavored since the AM reaction path involves a primary carbocation, whereas the M reaction path involves a more stable secondary carbocation. To study the differences between these two reactions, the reactions were simulated at multiple points along the reaction pathway starting from the transition state. In total 163 different geometries were generated and collected into GaussDal. Once in GaussDal it was straightforward to extract the xyz-coordinates and the atomic charges for the atoms in every reaction step for further analysis. 6.2.2. QC calculations The reaction pathway calculations were carried out by first locating the transition states, and then integrating the intrinsic reaction coordinate (IRC) equation [28] in both forward and backward directions, using the Gonzales and Schlegel second-order method [30,31]. The step-size used was 0.3 (amu)1/2 bohr. At each sampled point along the reaction coordinate, a single-point energy calculation at density functional theory (DFT) level (B3LYP functional) with a 6-31G** basis set using the Gaussian98 program [4] was performed. More details about the calculations of this data set can be found in [32]. 6.2.3. The SQL query In this application we want to extract information related to the xyz-coordinates and charges for each atom and a global property such as the total energy for each of the geometries along the reaction paths of the Markovnikov and anti-Markovnikov reactions. The SQL query is: SELECT molecule.moleculeid, atom.atomicnumber, coordinate.x, coordinate.y, coordinate.z, atomiccharge.charge, totalenergy.totalenergy FROM molecule,atom,coordinate,atomiccharge,computation,totalenergy WHERE molecule.moleculeid = atom.moleculeid AND coordinate.atomid = atom.atomid AND atomiccharge.atomid = atom.atomid AND coordinate.type = ’standard orientation’ AND computation.computationid = molecule.computationid and molecule.moleculeid = totalenergy.moleculeid;
6.2.4. Data analysis with SciCraft In relation to the problem of finding comparable atoms we know here which atoms are comparable from one point along the reaction path to another. To avoid problems with relative shift and rotations of the structures, the distances between each atom are used in the statistical analysis rather than the original Euclidean coordinates. Here only the atoms in the propene molecule were included in the study. The additional information included in the structure vector was the charge for each atom. When the input structure vectors are ready a wide variety of data analytical tools are available for inspection of the data set. Here we use the principal component analysis (PCA) method [19,33]. In PCA we find a new set of latent variables, also referred to as principal components that point in the maximum variance of the data. PCA
150
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
Fig. 13. A SciCraft module diagram showing the data analysis pipeline for the QSAR example.
is thus a powerful method for compressing data and inspecting higher dimensional spaces of correlated variables. Geometrically the PCA performs a rotation of the original multidimensional space such that each new variable axis point in the direction of maximum variance for the data set. The SciCraft module diagram for the PCA of this data set is shown in Fig. 13. The results from the analysis is shown in Fig. 14. The left plot is a score plot of the two first principal components. At the left in the plot we have positions related to the product whereas at the right we have the points related to the reactants. The upper arch of points are from the Markovnikov reaction and the bottom are from the anti-Markovnikov reaction. The right hand part shows the loadings plot which can be used to determine which structural features are important for the discriminating between the two reactions. The first component shows the difference between reactant and product states, whereas the second component shows the major difference between Markovnikov and anti-Markovnikov reactions. A more detailed analysis of this system is beyond the scope of this article and can be found in [32].
7. Discussion GaussDal allows the user to combine different data structures rapidly using a database controlled by the SQL language. Many investigators today accomplish similar data combinations through the programming of specialized Perl, Python or shell scripts for their particular applications. However, such scripts are often too tailored to a particular investigators’ needs to be practical for a wider community of users. Therefore, a more general approach as found in GaussDal is much more fruitful. GaussDal is designed to be part of a system where the following operations are integrated: • Experimental design, • Theoretical calculations (quantum, molecular mechanics, QSAR/QSPR properties, etc.), • Data extraction and collection,
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
151
Fig. 14. The PCA scores and loadings plot from analysis of the IRC data set.
• Data preprocessing, • Data analysis, • Scientific visualization. GaussDal together with SciCraft offers the possibility of keeping all these operations controlled from a single environment. It has been recognized that too much time is spent on the management of these steps and how they interact. Of particular interest in this case are the data analysis and visualization steps which should not be viewed as end tasks in a long project, but as part of a larger investigative loop.
8. Output extract The following contains typical tables from GaussDal output files (data from the “force field parametrization” example in text): molid | x | y | z | totalenergy ------------+--------------+-------------------+---------------+-------------2032 | 0.094733102 | 0 | 0 | -76.429209345311 2032 | -0.7013791674 | -1.7138465847 | 0 | -76.429209345311 2032 | 1.9844289872 | -0.010691014 | 0 | -76.429209345311 2032 | -1.3777829217 | 1.7245375987 | 0 | -76.429209345311 2033 | 0.6734192537 | 0 | 0 | -76.241486172257 2033 | 1.0643756105 | -1.8140608254 | 0 | -76.241486172257 2033 | 2.2138841342 | 1.0347132384 | 0 | -76.241486172257 2033 | -3.9516789985 | 0.779347587 | 0 | -76.241486172257 2034 | 1.299573641 | 0 | 0 | -76.159000554771
152
2034 2034 2034 2035 2035 2035 2035 2036 2036 2036 2036 2037 2037 2037 2037 2038 2038 2038 2038
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
| 2.2989220806 | -1.2137872543 | | 1.6745033077 | 1.5268937565 | | -5.2729990293 | -0.3131065021 | | 2.4799514692 | 0 | | 3.4515362889 | -1.2361228663 | | 2.8893874669 | 1.5180049283 | | -8.820875225 | -0.281882062 | | 4.8417329414 | 0 | | 5.7982078 | -1.2478512121 | | 5.2696059395 | 1.5129115956 | | -15.9095466809 | -0.2650603835 | | 11.9279712779 | 0 | | 12.8749498914 | -1.2550730289 | | 12.3673119476 | 1.5096213348 | | -37.1702331169 | -0.2545483059 | | 23.7386798775 | 0 | | 24.6824179846 | -1.257511506 | | 24.1819130279 | 1.5084830701 | | -72.6030108899 | -0.2509715641 |
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| | | | | | | | | | | | | | | | | | |
-76.159000554771 -76.159000554771 -76.159000554771 -76.14571518411 -76.14571518411 -76.14571518411 -76.14571518411 -76.142110556372 -76.142110556372 -76.142110556372 -76.142110556372 -76.140957067808 -76.140957067808 -76.140957067808 -76.140957067808 -76.140778502767 -76.140778502767 -76.140778502767 -76.140778502767
References [1] M. Karelson, Molecular Descriptors in QSAR/QSPR, Wiley and Sons, 2000. [2] B.K. Alsberg, L. Kirkhus, T. Tangstad, E. Anderssen, Data analysis of microarrays using SciCraft, in: Knowledge Exploration in Life Science Informatics, Proceedings, in: Lecture Notes in Artif. Intell., vol. 3303, Springer, Berlin, 2004, pp. 58–68. [3] B.K. Alsberg, L. Kirkhus, T. Tangstad, SciCraft—a general data analysis tool, http://www.scicraft.org/. [4] M.J. Frisch, G.W. Trucks, H.B. Schlegel, G.E. Scuseria, M.A. Robb, J.R. Cheeseman, V.G. Zakrzewski, J.A. Montgomery Jr., R.E. Stratmann, J.C. Burant, S. Dapprich, J.M. Millam, A.D. Daniels, K.N. Kudin, M.C. Strain, O. Farkas, J. Tomasi, V. Barone, M. Cossi, R. Cammi, B. Mennucci, C. Pomelli, C. Adamo, S. Clifford, J. Ochterski, G.A. Petersson, P.Y. Ayala, Q. Cui, K. Morokuma, D.K. Malick, A.D. Rabuck, K. Raghavachari, J.B. Foresman, J. Cioslowski, J.V. Ortiz, A.G. Baboul, B.B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. Gomperts, R.L. Martin, D.J. Fox, T. Keith, M.A. Al-Laham, C.Y. Peng, A. Nanayakkara, C. Gonzalez, M. Challacombe, P.M.W. Gill, B. Johnson, W. Chen, M.W. Wong, J.L. Andres, C. Gonzalez, M. Head-Gordon, E.S. Replogle, J.A. Pople, Gaussian 98, revision a.7, Gaussian, Inc., Pittsburgh, PA, 1998. [5] T. Helgaker, H.J. Aa. Jensen, P. Joergensen, J. Olsen, K. Ruud, H. Aagren, A.A. Auer, K.L. Bak, V. Bakken, O. Christiansen, S. Coriani, P. Dahle, E.K. Dalskov, T. Enevoldsen, B. Fernandez, C. Haettig, K. Hald, A. Halkier, H. Heiberg, H. Hettema, D. Jonsson, S. Kirpekar, R. Kobayashi, H. Koch, K.V. Mikkelsen, P. Norman, M.J. Packer, T.B. Pedersen, T.A. Ruden, A. Sanchez, T. Saue, S.P.A. Sauer, B. Schimmelpfennig, K.O. Sylvester-Hvid, P.R. Taylor, O. Vahtras, Dalton, a molecular electronic structure program, release 1.2. [6] M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.H. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S.J. Su, T.L. Windus, M. Dupuis, J.A. Montgomery, General atomic and molecular electronic-structure system, J. Comput. Chem. 14 (11) (1993) 1347–1363. [7] D.M. Philipp, R.A. Friesner, Mixed ab initio qm/mm modeling using frozen orbitals and tests with alanine dipeptide and tetrapeptide, J. Comput. Chem. 20 (14) (1999) 1468–1494. [8] G. Karlstrom, R. Lindh, P.A. Malmqvist, B.O. Roos, U. Ryde, V. Veryazov, P.O. Widmark, M. Cossi, B. Schimmelpfennig, P. Neogrady, L. Seijo, MOLCAS: a program package for computational chemistry, Comput. Mater. Sci. 28 (2) (2003) 222–239. [9] The Python Project, http://www.python.org, 2003. [10] D.M. Beazley, G. Van Rossum, Python Essential Reference, second ed., 2001. [11] R. Stones, N. Matthew, Beginning Databases with PostgreSQL, Wrox Press, 2001. [12] P. DuBois, MySQL, Sams, 2003. [13] R. Stallman, GNU general Public License, http://www.gnu.org/copyleft/gpl.html, 2003. [14] E.S. Raymond, The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary, revised ed., O’Reilly and Associates, 2001. [15] J.M. Maubach, W. Drenth, Data-flow oriented visual programming libraries for scientific computing, in: Lecture Notes in Comput. Sci., vol. 2329, Springer, Berlin, 2002, pp. 429–438.
B.K. Alsberg et al. / Computer Physics Communications 171 (2005) 133–153
153
[16] M. Takatsuka, M. Gahegan, Geovista studio: a codeless visual programming environment for geoscientific data analysis and visualization, Comput. Geosci. 28 (10) (2002) 1131–1144. [17] D. Spinellis, Unix tools as visual programming components in a GUI-Builder environment, Softw.-Pract. Exp. 32 (1) (2002) 57–71. [18] M. Acacio, O. Canovas, J.M. Garcia, P.E. Lopez-de Teruel, MPI-Delphi: an MPI implementation for visual programming environments and heterogeneous computing, Futur. Gener. Comp. Syst. 18 (3) (2002) 317–333. [19] D. L Massart, B.G.M. Vandeginste, L.M.C. Buydens, S.D. Jong, P.J. Lewi, J. Verbeke-Smeyers, Handbook of Chemometrics and Qualimetrics: Parts A and B, Elsevier Science, 1997. [20] D. Wilson, C. Kings-Lynne, T. Ratschiller, R. Casson, M. Barrus, B. Budnick, http://phppgadmin.sourceforge.net, 2004. [21] A.R. Leach, Molecular Modelling: Principles and Applications, second ed., Prentice-Hall, 2001. [22] M.P. Allen, D.S. Tildesley, Computer Simulations of Liquids, Clarendon, Oxford, 1987. [23] O. Engkvist, P.-O. Åstrand, G. Karlström, Accurate intermolecular potentials obtained from molecular wave functions: Bridging the gap between quantum chemistry and molecular simulations, Chem. Rev. 100 (2000) 4087–4108. [24] A.K. Rappé, C.J. Casewit, Molecular Mechanics Across Chemistry, University Science Books, Sausalito, 1997. [25] G. Simons, R.G. Parr, J.M. Finlan, New alternative to the Dunham potential for diatomic molecules, J. Chem. Phys. 59 (1973) 3229–3234. [26] B.O. Roos, The complete active space self-consistent field method and its applications in electronic structure calculations, Adv. Chem. Phys. 69 (1987) 399–445. [27] R.A. Kendall, T.H. Dunning Jr., R.J. Harrison, Electron affinities of the first-row atoms revisited. Systematic basis sets and wave functions, J. Chem. Phys. 96 (1992) 6796–6806. [28] K. Fukui, The path of chemical reactions—the IRC approach, Acc. Chem. Res. 14 (1981) 363. [29] J. McMurry, Organic Chemistry, fourth ed., Brooks/Cole Publishing Company, 1996. [30] C. Gonzales, H.B. Schlegel, Reaction-path following in mass-weighted internal coordinates, J. Phys. Chem. 94 (1990) 5523–5527. [31] C. Gonzales, H.B. Schlegel, Improved algorithms for reaction-path following—higher-order implicit algorithms, J. Chem. Phys. 95 (1991) 5853–5860. [32] B.K. Alsberg, V.R. Jensen, K.J. Børve, Use of multivariate methods in the analysis of calculated reaction pathways, J. Comput. Chem. 17 (10) (1996) 1197–1216. [33] H. Martens, T. Naes, Multivariate Calibration, John Wiley & Sons, New York, 1989.