Novel Mathematical Programing Model for Computer ... - CiteSeerX

Novel Mathematical Programing Model for Computer Aided Molecular Design (In Industrial and Engineering Chemistry Research)

Nachiket Churi

Luke E. K. Achenie

Department of Chemical Engineering, U-222 University of Connecticut 191 Auditorium Road Storrs, CT 06269 Copyright c Luke E. K. Achenie & Nachiket Churi September 20, 1998

Abstract The synthesis of new compounds by computer aided molecular design approaches has received considerable attention recently due mainly to the promise of reducing the time and eort required using traditional empirical approaches. These conventional approaches have involved trial and error methods in which a large number of compounds are synthesized and tested in a laboratory. This paper describes a general mathematical programming model for designing compounds that have pre-speci ed performance characteristics. This formulation is novel in that it gives nearly complete information about the molecular structure, and is therefore able to exploit accurate property prediction methods that require such information. The computer aided molecular design strategy is able to generate a set of promising candidate compounds for experimental evaluation. The new formulation is illustrated with a refrigerant design case study.

1

Introduction

There is an ever-increasing need to design and utilize new and innovative compounds to replace a number of existing compounds that are deemed to be unsuitable for industrial use for various environmental and economic reasons. The search for new compounds usually involves a combination of design heuristics, prior experience and direct experimental studies. While there is no substitute for experimental study, there is a de nite need for a systematic methodology which will generate new molecules that are promising enough to be considered for experimental veri cation. Computer Aided Molecular Design is one such methodology. Computer Aided Molecular Design (CAMD) is a reverse engineering procedure which generates molecules that conform to various physical property requirements that are speci ed a priori. Various investigators have successfully developed CAMD approaches using group contribution methods, notably Macchietto et al. (1990) , Gani et al. (1991), Venkatasubramanian et al. (1994) , Vaidyanathan and El-Halwagi (1996) and Duvedi and Achenie (1996) amongst others. Macchietto et al. (1990) approached the CAMD problem by evaluating feasible molecules formed from a starting set of groups implicitly within the framework of a constrained optimization problem. Gani et al. (1991) have outlined a group-contribution based CAMD approach involving preselection of groups, generation of feasible molecules, and nally, rating the generated molecules in terms of a performance index. Venkatasubramanian et al. (1994) applied a combinatorial optimization method to polymer design using genetic algorithms for property prediction. Vaidyanathan and El-Halwagi (1996) have applied a CAMD procedure to polymer design using various structural and thermodynamic constraints for monomers and the corresponding addition and condensation polymers. This 1

procedure uses the number-occurance of UNIFAQ groups in the monomer and the repeat unit to predict the polymer's properties. Duvedi and Achenie (1996) developed a mixed-integer non-linear programming approach to CAMD in which integer variables are employed to give the number of times a particular group occurs in the molecule. This approach eectively removes the limitations of other approaches. However, it is limited in terms of the amount of information that can be obtained. The approach is unable to specify the connectivity between the groups, which is required to fully specify a molecule and to take advantage of various accurate property prediction methods. CAMD has been hampered by the availability of accurate physical property prediction models. In recent years, however, a number of group contribution techniques have been developed that are accurate enough to be used in CAMD (Reid et al., 1987; Joback and Reid, 1987; Lai et al., 1987). Some of these methods are able to predict properties to within 5% of the experimental values. The method proposed by Constantinou and Gani (1994) is also able to predict property dierences in isomers. This method uses second-order groups that are formed by combinations of rst-order groups and whose contributions act as corrections to rst-order contributions. In this paper, a systematic Mixed Integer Non-Linear Programming (MINLP) approach to CAMD is described that is able to take advantage of various new group contribution techniques such as the one by Constantinou and Gani (1994) . The new formulation uses discrete variables that give structural and connectivity information, and is thus able to give a nearly complete description of the molecule. This method is able to distinguish between isomers, and has the advantage of being able to take advantage of accurate property prediction methods, and of providing the designer complete control over the type of constraints to 2

be introduced for the application of interest. This novel methodology is described here with reference to refrigerant design.

2 2.1

MINLP Formulation The Basis Set

Initially, we consider a starting set of distinct structural groups. By structural groups we mean groups of connected atoms formed by a combination of functional groups, and whose net valence is at least 1. Examples of structural groups include CH3 0, CH2 F0, CHCl =, etc. The selection of the basis set depends on the intended application and availability of accurate group contribution techniques for predicting the properties of interest.

2.2

Variables

Let us de ne the following parameters : m

= number of structural groups in the basis set

vk

= valence of the k group in the basis set

smax

= maximum valence of all the groups in the basis set

n

= number of groups in a molecule

nmax

= maximum number of structural groups in a molecule

th

We then specify a basis set of valency

smax

m

structural groups having valencies

f g, with maximal vk

. The maximum number of groups allowed in a molecule is

nmax

. The actual

number of groups, n, is obtained from the solution of the mathematical programming model. Throughout the formulation, the index k speci es a structural group's position in the basis 3

set while indices i and j specify its position in the designed molecule. In addition, let us de ne the following discrete variables : 8 > > > < 1 if the

= > > >

uik

th

i

group in the molecule is of the k kind th

: 0 otherwise 8 > > > < 1 if the ith group's

= > > >

zijp

j

th

site is attached to the p group th

: 0 otherwise

Finally, we de ne w to be a binary vector of length

nmax

. Its rst

n

terms are zero, while

the remaining terms are one. This vector is introduced in the formulation to ensure that a connected molecule is formed from the structural groups. Figure 1 explains how these variables and parameters are to be interpreted for 1,1,1Trichloro-2,2-di uoroethane. The bold numbers next to the groups are the group numbers in the molecule, and the numbers next to the bonds are the site numbers for the individual groups. A basis set of six groups (m = 6), namely CH3 ; CH2; CH; F and Cl is chosen for illustration and only the non-zero terms of the variables are given. For example, the non-zero terms of u are u16, u24, u35; a non-zero u16 implies that the 1 group in the molecule is of st

type 6. From the basis set we can see that the 6 group is Cl. Similarly, u24 implies that th

the second group is a C. At this stage, we do not distinguish between the various ways in which C can be bonded. Distinction between the various types of bonding (>C

available from the z variables. The z variables are interpreted in a manner similar to u. A non-zero z713 implies that the 7 group's 1 site is attached to the 3 group. Since u73 and th

st

rd

35 are non-zero we know that the 7 group is of type 3 (CH) and the 3

u

th

rd

group is of type

5 (F). Thus a non-zero z713 indicates a bond between CH and F. Since there are a total of 7 4

groups in the molecule, n = 7, and the rst 7 terms of w are zero and the remaining terms (w8 : : : w

max

n

) are non-zero. In this case, n

max

, which is speci ed a priori, is at least 7.

In summary, the parameters for this example are : m

= 6 "

v = smax

n

1 2 3 4 1 1

= 4 = maxfv

k

F3

1

1

Cl

1

j H 3

C

g

= 7

Cl 4 2H2

#

Basis Set

1

1 2 3 4 5 6

CH 7 1

2

1

3

4

Cl 5

F6

1

1

Non-zero binary variables z

u 16 24 35 46 56 65 73

112

317

217 224 231 245

412

w 713 722 736

512 617

Figure 1: Example Molecule

5

8 9. . .

nmax

CH3 CH2 CH C F Cl

2.3

Structural Feasibility Constraints

The purpose of the Structural Feasibility Constraints is to generate molecules that do not violate basic feasibility criterion such as the Octet Rule. The molecules should not have any unattached sites or multiple bonds attached to the same site. Our basic philosophy here is to develop a model that is linear in the integer variables. Using the variables de ned above, these criteria are expressed as follows : m max X X

uik

zijp

=

n

=1 =1

i

k

max max sX X

n

=1 =1 01 X max X

p

j

m max X X

n

=1 =1

i

zijp

=1 =1

p

uik

m X

k

s

i

(1)

nmax

=1

= 1:::n

uik vk i

0

wi

(2)

max

i

= 2:::n

(3)

max

j

max X

n

+

=1

wi

=

(4)

nmax

i

k

1 = 0

(5)

w

wi

max max nX X

s

j

=

k

v

=1

p

zijp

0

max X

n

=1

zivk p

+ M (u

ik

0 1)

+1

wi

i

0

i

= 1 : : : (n

max

= 1:::n

max ;

k

0 1)

(6)

= 1:::m

(7)

p

max X

s

zijp

=1

j

=

max X

s

j

max X

=1

zpji

i

= 1 : : : (n

max

zijp

1

i

= 1:::n

01

0

i

= 2:::n

n

=1

max ;

0 1)

j

;p

= (i +1) : : : n (8) max

= 1:::s

max

(9)

p m X

k

=1

uik

0

m X

k

=1

ui

;k

max

(10)

Equation 1 puts a limit on the number of structural groups that can constitute a molecule. The double-summation on the left-hand side is equal to n, the actual number of groups present in the molecule. Since a minimum of two groups are required to form a molecule and an upper limit of

nmax

is speci ed, 2

n

6

nmax

. Equation 2 is an implementation

of the Octet rule. The expression on the left gives the number of bonds attached to the i

th

group, while that on the right states that if the vk

th

i

group is of the

k

th

type, its valence is

. This ensures that the number of bonds attached to a group equals the valence of the

group. Equations 3, 4, 6 and 5 ensure that only one molecule is formed. This is realized by constraining the

th

i

group to be attached to one of the groups before it, that is, groups 1

to (i 0 1) (Equation 3). Thus the second group has to be attached to the rst group, the third group has to be attached to either of the rst two groups, and so on. The rst group is always present (Equation 5), and the (i + 1) group is present only if the th

th

i

group is

present (Equation 6). Equation 7 has to be introduced to account for dierent valencies of the groups. This equation is a linear analog of the non-linear equation max max nX X

s

uik j

= k +1 =1

which states that if the

zijp

=0

i

= 1:::n

max ;

k

= 1:::m

(11)

p

v

th

i

group is of the

connections for its sites (v + 1) to k

smax

k

th

kind, then that group should not have any

which are non-existent.

M

is a number that is

signi cantly larger than all other terms in the equation. Equation 7 is written in a form that is convenient from a computer programming point of view. Equation 8 is the symmetry constraint. Hence if a group is attached to a second group, the second group is automatically attached to the rst one. Equation 9 ensures that a group's site can be attached at most once to some other group. Equation 10 is applied to force existance of the

th

i

group if the

(i 0 1) group is present. The structural feasibility constraints (Equations 1 to 10) are linear; th

hence they form a convex hull separating feasible molecular structures from infeasible ones.

7

2.4

Characteristics of the Formulation

The variables used to describe the molecule give a nearly complete information about the molecule's composition and connectivity. One of the advantages of this new formulation becomes apparent when one looks at the various group contribution methods. Quite a few of these methods make use of a certain amount of structural information in giving accurate predictions. For example, the group contribution method for liquid speci c heat (Chueh and Swanson, 1973) requires the addition of 4.5 to the value of

C pl

for \: : : any carbon group

which ful lls the following criterion: `A carbon group which is joined by a single bond to a carbon group which is connected by a double or triple bond with a third carbon atom.' " It is obvious that this correction cannot be added without any connectivity information. A search of the open literature gave no previous CAMD formulation that were able to give the level of detailed molecular information required for the more accurate { and more complex { group contribution techniques. Another way in which structural information proves useful is in incorporating bonding constraints. One of the desired property of a refrigerant is its stability, i.e., the compound should not spontaneously decompose or polymerize. In many cases, polymerization can be eliminated by ensuring that the refrigerant does not have double or triple bonds. There is no systematic way of introducing this constraint mathematically in the absence of detailed structural information. With the new formulation, the constraint that has to be added to eliminate multiple bonds is : max X

s

j

=1

zijp

1

i

= 1 : : : (n

max

0 1)

;p

= (i + 1) : : : n

max

(12)

If for some reason we do not want cyclic molecules, we can introduce a constraint based 8

on the following relation which is a variation of the well-known Euler Equation in Graph Theory : l

In this equation,

=b+a0n

(13)

is the total number of independent loops,

l

is the number of \sub-

b

molecules" being designed, a is the total number of attachments, and

n

is the total number

of groups. Obviously, b = 1 since we want a single connected molecule. A loop is a series of distinct groups that are attached in such a manner that one can go from one group to others, and return to the rst group via a dierent route. A loop is independent if it cannot be described by a combination of other loops. Two groups are said to be attached if there exists at least one bond between them. These terms are explained with examples in Figure 2. The total number of attachments (a) and the number of groups (n) are given by a

=

max X01

n

= +1

=1

p

i

where

cip

max X

n

(14)

cip

i

= 0

if

max X

s

j

= 1

if

max X

=

m max X X

zijp

=0

i

= 1 : : : (n

0 1)

;p

= (i + 1) : : : n

0

i

= 1 : : : (n

0 1)

;p

= (i + 1) : : : n

s

j

n

=1

=1

zijp >

n

=1 =1

i

max

max

max

max

(15)

uik

k

By using these relations in Equation 13 and setting l = 0, the resulting molecule will not have any loops and will thus be acyclic. In summary, the formulation gives us a lot of control over the features of the target molecules. We can distinguish between isomers, control multiple bonds, and also specify whether or not we desire a cyclic molecule. One small limitation to the connectivity information is that one cannot distinguish be9

CH2

% @ @ %

Cyclic Molecule (One Loop)

CH2

CH F

" CH @ " @ " CH CH2 Cyclic Molecule QQ2 (Two Independent Loops) Q CH

Unconnected Molecules (Two Sub-Molecules)

Cl

CH2 F + CH3 F

Figure 2: Loops and Sub-Molecules tween the individual bonds in case of multiple bonds. For example, the two structures in Figure 3 cannot be distinguished under the formulation. However, this is not a serious drawback since such cases are not common. In any case, no existing group contribution method is able to distinguish between such structures in a systematic manner. The formulation also does not consider any constraints based on bond angles. While it is possible to incorporate bond angles, we could not nd any property prediction methods that can take advantage of this additional information. Figure 4 shows two possible ways in which the variables can describe a molecule. In general, there are several such multiplicities in specifying a molecule. This can lead to increased computational expense. The degree of multiplicity depends on the number of groups present in the molecule and also their valencies. Since these dierent solutions describe the same molecule, they have identical objective function values. This fact can be used to elim10

bb "" 2 bb ""

bb "" 2 bb ""

""bb 1 bb""

1

2

2

1

OH

OH

OH

""bb 1 bb""

1

1

2

2

OH

Non-zero terms of `z' 112 122 211 221

112 122 211 221

Figure 3: Indistinguishable Molecules

Basis Set 1 CH 2 CH 3 F 3 2

1

2

2

3

1

3

CH3 CH2 F CH3 CH2 F 1

1

2

1

1

2

Non-zero terms of `u' and u u z 12 11 112 21 22 211 33 33 223 312 Figure 4: Multiplicity 11

1

`z' z 113 122 211 311

1

inate multiplicities by applying a constraint to the objective function value. However, since other structures may have the same objective function value there is a risk of eliminating potentially good molecules by this method. The method used in this work to eliminate multiplicities involves introducing an integer cut (constraint) every time a new solution has the same objective function value as that obtained in the previous iteration. A more rigorous approach is to introduce constraints that force the sites to be ordered; however, this approach was not used in the case study considered in this paper.

2.5

Property Constraints and the Ob jective Function

In addition to the structural constraints given in the previous section, property constraints have to be enforced to obtain desired molecules. The structural constraints ensure that the molecules formed are physically realizable while the property constraints eliminate all those molecules that do not possess the desired properties. Thus the structural constraints act as the core of the MINLP formulation to which various property constraints can be added to get molecules of interest. Selection of property constraints can be quite dicult if the desired properties cannot be expressed mathematically. For example, \low volatility" can be expressed in terms of the boiling point, while \toxicity" cannot be easily interpreted in terms of any fundamental physical property. In such cases, correlations have to be developed using a large number of existing compounds. The selection of appropriate property constraints is very important to the success of the formulation since these constraints are, in general, non{linear. Hence they introduce non-convexities in the solution space making it more dicult to reach global optimality. 12

One property constraint that has to be incorporated in all cases is thermodynamic feasibility. A molecule is thermodynamically feasible at a particular temperature T if its Gibbs' Free Energy of formation

Gf

at that temperature is negative, i.e., ( )

Novel Mathematical Programing Model for Computer ... - CiteSeerX

Novel Mathematical Programing Model for Computer ... - CiteSeerX

Suggest Documents

COMPUTER AIDS FOR MATHEMATICAL MODEL ...

SHIMMER (1.0): a novel mathematical model for ... - Geosci. Model Dev.

SXP: Simplified Extreme Programing Process Model - MECS

A Mathematical Model of the Impact of Novel Treatments ... - CiteSeerX

A Mathematical Model for Interplanetary Logistics - CiteSeerX

application of the motad linear programing model

A mathematical model for integrated diagnostics - JHU Computer ...

Mathematical Model for a Novel Cryogenic Flow Sensor ... - Faculty

A Novel Mathematical Model for Manpower ... - Science Direct

Mathematical Logic for Computer Science

Mathematical Logic for Computer Science

Mathematical Methods for Computer Science

Novel mathematical techniques for structural

SLWV - A Logic Programing Theorem Prover - CiteSeerX

Mathematical Methods Computer Algebra System - CiteSeerX

Advanced Mathematical Thinking & The Computer - CiteSeerX

Mathematical Modeling and Computer Simulation of ... - CiteSeerX

(VCAA) Mathematical Methods Computer Algebra System - CiteSeerX

A MATHEMATICAL MODEL RELEVANT FOR

Mathematical model

Mathematical model

A COMPREHENSIVE MATHEMATICAL MODEL FOR

toward a mathematical semantics for computer languages - CiteSeerX

A Mathematical Model for IGBT