agram of the coefficients file: that each coef f-row entry must match a key ...... trial Engineering and Management Sciences, Northwestern University. R. FOURER ...
Database Structures for a Class of Mathematical Programming Models Robert Fourer Department of Industrial Engineering and Management Sciences Northwestern University Evanston, IL 60208-3119, U.S.A. (708) 491-3151
4erQiems . n m . edu
hierarchical database schemes. A comparison of these schemes then reveals their contrasting strengths and weaknesses. The remainder of this introduction briefly describes traditional relationships between mathematical programming and database management systems, and outlines the subsequent sections of the paper.
Abstract The maintenance of largescale mathematical programming models has long been recognized to involve a substantial task of data management. This paper aims to elucidate some of the principles of database construction for large-scale mathematical programming, by use of a linear programming example that incorporates aspects of production and network models. Twolikely algebraic formulations are described, and are shown to correspond to relational and hierarchical database schemes that have contrasting strengths and weaknesses.
Background Large-scale linear programming applications optimize over hundreds or thousands of variables, subject to comparably large numbers of constraints. The maintenance of these models has long been recognized to involve a substantial task of data management. Indeed, in both the design and use of a mathematical programming system (or MPS), a substantial portion of the effort has no direct relation to the variables and constraints, but is instead concerned with the description, manipulation and display of data. This holds true both for traditional matrix generation systems (Beale 1970) and for more modern systems based on algebraic modeling languages (Brooke, Kendrick and Meeraus 1982; Fourer, Gay and Kernighan 1990) or matrix block schematics (Baker 1986; Welch 198%). The data required by mathematical programming models is not typified by the payroll and order entry data that appear as examples in textbooks on database management (Date 1990; Ullman 1988). An MPS’s data sets tend to be smaller but more complex in structure, and to be more numerical (as opposed to symbolic). Designers have responded to these features by creating data representations that are specific to MPS software; most popular is a scheme of tables whose rows and columns are specially labeled (Welch 1987a). Any differences in the nature of an MPS’s data are not so great, however, as to preclude the use of standard database structures for mathematical programming. The similarities between MPS data tables and relational database tables has been remarked upon by Welch (1987a). MPS implementations using the so-called network database model were investigated by Bonczek, Holsapple and Whinston (1978) and by Stohr and Tanniru (1980); more recently, an implementation of an MPS that employs aspects of the hierarchical and relational database models has been described by Baker (1986). Other researchers (cited in Fourer 1990) have proposed ways in which standard database concepts or software can be integrated with-or interfaced to-various mathematical programming systems. In some cases the database aspect takes the lead, while the ability to optimize is regarded as an added feature. In 1.1
1. Introduction The work recounted here grew out of a project to design an optimization package for steel mill planning. Because the project was supported by the American Iron and Steel Institute, it was to be based on a generic model-one that any particular steel company could specialize to its own operations, simply by supplying its own data. Users of the model would be concerned mainly with entering and maintaining their data, and with reporting the optimal production levels. In light of the data’s central role, it was decided to implement the optimization package in the context of a database management system. The user would thus enter the description of a steel mill as a collection of materials and facilities records of a prescribed structure. An implied linear program would then be a u t o m a t i d y generated and solved, and the optimal values would be automatically left in appropriate fields. Finally, the results would be displayed as desired, by use of the database system’s varied reporting options. As the implementation of the steel optimization package has progressed, it has become clear that many of the underlying principles are of much broader applicability. Certainly, the generic linear programming model can describe other productive enterprises that transform flows of raw materials into diverse finished products, through a series of processing steps. There are equally important general principles at work, however, in the way that the database structure is related to the formulation of the linear programThe goal of this paper is to suggest some of the principles of database construction for linear programming (and for large-scale optimization in general) by use of the steel optimization model as an example. Two likely formulations of the model are introduced initially, and are shown to correspond to relational and
0073-1129/91/0000/0306$01.OO Q 1991 IEEE
306
other designs the lead is given to the mathematical programming aspect, with the database being just one feature of an integrated MPS.A third approach employs both a general-purpose database management system and a mathematical programming system, which pass information between them.
Normally, raw materials (such as coal, ore, or limestone) can only be bought, while finished products (tempered coils, pipes) are only sold. Intermediates (pig iron, slabs, unfinished coils) can often be neither bought nor sold, but there are exceptions depending on market conditions. We can set a product's maximum bought or sold to zero to indicate that no buying or d n g of the product is possible. To %ad" the facility for a specified level of production, the minimum sales amounts may be set equal to the maximum for every finished product. For each material, the model may also optionally specify a list of conversions to other materials. Eh& conversion has a given yield and cost per unit (of the material being converted). Conversions often serve as a bookkeeping device. For example, tempered coils might be converted into five different products, each representing coils destined for a different market, and each with its own revenue per unit and sales limits. Some of the coils of secondary grade could also be downgraded to scrap by means of a conversion. The major transformations of materials at a steel mill cannot be described as simple conversions. We must rather define a collection of facilities at which transformations occur. Each facility houses one or more productive activities, which define the materials that are used and produced. Specifically, the following information is provided for each activity at a facility:
1.2 Outline Section 2 presents two formulations of our linear programming example, which differ in how they describe the indexing of model components. Sections 3 and 4 then describe relational and hierarchical database structures, respectively, that correspond to the organization of data in the two formulations. Each structure has certain advantages and disadvantages, as explained in Section 5, with respect to ease of use, data storage, and data retrieval. A longer version of this paper (Fourer 1990) describes the implementation of our ideas in the context of the steel optimization project, and compares a variety of alternative strategies for integrating database and linear programming software. Section 6 briefly summarizes this material, and indicates some likely directions of future research.
2. Formulations
The amount of each input required by a unit of activity.
We adopt the following general formulation that covers all linear programs (LPs) to be considered in this paper: Maximize
cin,lcjzj
Subject to
6"5 cy=,aijzj 5 U?", 1;"
5 zj 5 U?,
The amount of each output produced by a unit of activity. The (variable) cost per unit of activity.
i = 1,. .. , m j = 1,...,n
Upper and lower limits on the number of units of activity. The number of units of activity that can be accommodated by one unit of the facility's overall capacity.
It will be of conceptual and practical use to investigate database structures that are capable of representing any LP of this form, and U?, along with all a;j that by storing all lTw,uYw, cj, are nonzero. This form is too general, however, for our prospective users to work with it directly. They need a much more specific formulation, as described below, that speaks of familiar materials and facilities, and of their limits, yields, costs and capacities. We begin by introducing our specific model informally, using a few steelmaking concepts for examples. Then we present two formal descriptions, which differ mainly in how they describe and use index sets. Sections 3 and 4 will show how these two descriptions naturally give rise to two different database structures.
lp
Naturally, there are upper and lower limits on the overall capacity of each facility. There are similar limits on a facility's total use of each input and total production of each output. Where the lower and upper limits are not critical, the lower may be set to 0, and the upper to a large, effectively infinite, value. A common example of a facility is a rolling mill at which several finished products are made. The rolling of each product must be modeled as a separate activity, since it produces a separate output. The units of each activity are in tons (of product), while the units of the facility's capacity are in hours. Thus the model specifies each activity's capacity use in tons per hour. In effect, the activities are competing for the limited capacity of the facility; the optimization automatically trades off a product's use of rolling capacity against its yield and cost of production and its potential for revenue from sales. A different example occurs at a basic oxygen furnace, where all activities produce liquid steel. Here activities differ in the amounts of the various inputs that they use to produce a ton of steel; each represents, in effect, a different recipe for steel production. Both the units of each activity and the units of the furnace's capacity are measured in tons of steel produced, so the capacity use figure is just 1. (The use of discrete recipes can be viewed as approximating the solution of a more general blending problem at the furnace.) We can characterize this model as a kind of hybrid between production and network flow. Thus there is a general directional flow of materials through the network, from the "hot end"
2.1 A n informal description
We consider the sort of production plant typified by a steel mill; raw materials enter, various transformations to intermediate
goods are performed, and finished products leave. Profit is the total revenue from sales of the finished products, less the costs of acquiring the raw materials and making the transformations. Our problem is to run the plant at the most profitable levels of activity, in one future planning period. Specification of the model begins with a list of materials, and the following data for each: The cost per unit of material bought, and the minimum and maximum quantities that can be bought. The revenue per unit of material sold, and the minimum and maximum quantities that can be sold.
307
through casting to finishing. Exceptions to this general flow may occur, however, as a result of by-products that can be re-used in production. The most common example is steel scrap, which occurs as an output of every rolling and finishing facility, and which (if not sold) may be recycled as an input to steelmaking furnaces. Natural gas from coke ovens and blast furnaces can also be recycled, in a more complex scheme that produces heat or electricity.
would then lose a certain consistency in the data definition. In our current formulation, every collection of data values is indexed either over a single fundamental set (Mor F)or over a subset of a Cartesian product of fundamental sets (such as F x M). This convention, which extends to the variables and constraints as well, is not always followed in linear programming formulations, but we will see that it is valuable to the design of an associated relational database structure.
2.2 First formulation Our first algebraic description of the above problem is characterized by the use of ordered "tuples" of indices t o describe productive possibilities. It is presented in full in Appendix A; we comment here on some of its more important conventions and features. The model is built upon two fundamental sets of objects: a set M of materials, and a set 3 of facilities. Certain data, such as the purchase limits for each material and the capacity of each facility, are indexed directly over these sets. However, much of the data is indexed over sets of ordered pairs and triples, whose components are taken from M and 3:
2.3 Second formulation
Me""" 5M x M Fin +Out
3=t
A'" AoUt
C 3x M 3x M CFX? 3x M x ? 3x M x ?
Our second algebraic description is characterized by a different way of representing the index sets, in which the ordered pairs and triples are replaced by indexed collections of subsets. As seen in Appendix B, where the full formulation is presented, this change has immediate implications for how the model is expressed, even though the model represents the same class of linear programs as before. The six collections of subsets are written as follows in Appendix B:
MY""
3p
conversions facility inputs facility outputs facility activities activity inputs activity outputs
3fut
?
A$
A:?
ce 3fa
conversions from material j inputs at facility i outputs from facility i activities at facility i inputs to activity k at facility i outputs from activity k at facility i
This change does not affect the model's names for the data values; both formulations have U$, a$k, and so forth. The differenceis in how the models describe the indexing of this data. For example, the first formulation specifies a value U$ for each ( i , j ) E p , while the second specifies a U$ for each i E 3 and j E Sections 3 and 4 will show how these different ways of describing the data can correspond to different ways of organizing it. The ? above refers to the same phantom "set of activities" as in the first formulation. There is no question as to the domains of the latter two collections of subsets, however. The inputs available to activity k at facility i are a subset of the inputs available M. Each A$ to all activities at facility i ; hence A$ is actually a sub-subset. The values a$! are correspondingly defined for all i E ?, k E 3Ft, and j E A%.
The construction and use of these sets are readily understood from the descriptions in Appendix A. However, the latter three have special characteristics that require further comment. Consider the set FXt. As defined in Appendix A, ( i , k ) E T X tmeans that k is an activity available at facility i. Thus we can reasonably regard 3& as a subset of 3 x A, where A is a previously unmentioned set of activities. Indeed, all of the three occurrences of ? above would properly be references to A. We do not refer explicitly to the set A in our formulation, however, because activities do not play a role in the model apart from their association with facilities. There are no model components indexed over activities alone; and although activities at different facilities may have the same name, they are not the same activity, and the linear program would be the same if the names of any of them were changed. This distinction will eventually be seen to carry over to the database structure. In the case of A'" (and analogously for Aout), the model defines the domain of the triples more restrictively than the summary above:
A'"
M M M
e.
e
3. Relational Structures A subset of a Cartesian product of sets is a relation in a mathematical sense. Thus it is not surprising that data indexed over pairs or triples from Cartesian products, as in our first formulation, has a natural representation in relational database structures. This section develops and comments upon likely relational schemes for our models. To establish general principles and terminology, we first focus on the simpler structure of the general linear programming model that was introduced at the beginning of Section 2. We observe how the nonzero coefficients of the model can be viewed as being indexed over ordered pairs, and then propose a collection of relational tables appropriate to this data. We finally investigate an analogous relational scheme for the considerably more complicated structure of the multi-facility production model.
{(i,j,Is) : ( i , j ) E 3'" and (i,Is) E F X t } .
The set on the right is just p w FEt,the natural join of p and Fmt.We restrict A'" to this set because, for material j to be an input to activity Is at facility i , the material must be an allowable input at the facility, or ( i , j ) E 9,and the activity must exist at the facility, or ( i , k) E 3'LCt. The only data values indexed over A'" are the technological coefficients, a$k. We could thus dispense with A'" entirely by and by letting be indexing directly over 9 w FWt, zero if activity k does not actually use input j. However, we
.ek
308
3.1 A n ordered-pair view of t h e general model
We can think of the shape of the general linear program as being determined by two sets:
Z = {I,. ..,m} is the set of constraints 9 = { 1,...,n} is the set of variables
The content and structure of the constraints file would be represented in conventional notation as the relation
CONSTRAINTS (vow-name,roo-min, roo-ma). We find it convenient to present the same information graphically, by means of the following concise diagram:
Adopting the customary LP terminology, we will identify constraints with rows of the coefficient matrix, and variables with columns of the matrix. Two "right-hand side" values are associated with each constraint: QO"
= the lower limit on constraint i , for each i E Z
uyw = the upper limit on constraint i , for each i E Z
roo-min row-ma The TITLE of the file is at top, followed by the f i e l d names, with the key in i t a1 i c s . The variables file is similarly depicted as:
Each variable has a "profit" and two bounds, a8 well as an activity level:
I VARIABLES
= the profit (if positive) or cost (if negative) per unit of variable j, for each j E 9 z j = the activity level of variable j, for each j E 9 lyl = the lower limit on variable j, for each j E ," uj"' = the upper limit on variable j, for each j E 3
cj
Here CO 1-name is the variable's unique identifier, and the key for the file. The following four fields store the values called cj, ljd, zj, and uyl in the mathematical formulation. Strictly speaking, the field col-optimal is superfluous, since it holds the values of variables z j rather than any data required to describe the linear program. However, for practical purposes it is highly desirable to be able to store the solution values in the same files as the data. For the same reason, we could have included a field in the CONSTRAINTS file for the dual values. It remains to define a file to hold the nonzero coefficients. Its structure can be diagrammed as follows:
Finally, the nonzero coefficients are indexed over a subset of the constraint-variable pairs:
C
Zx
,"is the set of coefficient nonzeroes: ( i ,j) E C means variable j is used in constraint i aij = the coefficient of variable j in constraint i , for each ( i , j ) E C
This approach to indexing the Coefficients makes sense for most large linear programs, including those that derive from our production model, because each variable figures in just a few of the many constraints. With minor modifications, the general formulation of the linear program can accommodate this view of the data: Maximize Subject to
I COEFFICIENTS I coeff-row c o eff- co 1 coef f -value
CjEjcjzj IT" 5 C(ij)Ec aijzj 5 UT", for all i E Z lj"" 5
Zj
5 U$
I
col-name col-prof it col-min col-optimal col-max
Each record's coef f -value field holds the coefficient of the variable specified by the coeff-col field, in the constraint specified by the coeff-rowfield. The pair (coeff-row, coeff-col) is the key, since every nonzero coefficient corresponds to a different combination of constraint and variable. Two essential pieces of information are absent from the diagram of the coefficients file: that each coef f-row entry must match a key (row-name) entry in the CONSTRAINTS file, and that each coeff-col entry must match a key (col-name) entry in the VARIABLES file. In the terminology of relational databases, coeff-rowand coeff-col are foreign keysof the COEFFICIENTS relation. Following Date (1990), we use an arrow in our diagram to identify foreign key references:
for all j E 9
The summation denoted by C(ij)Ec a;jzj is taken for each k e d i E Z; it is interpreted as the sum over all j such that ( i , j ) , for the given i , is a member of C. 3.2 A relational database for t h e general model A relational database is a collection of files. For our linear program, the simplest file is the one that identifies the constraints and their associated limits. It is a collection of m records, one for each constraint. Every record contains three fields: row-name, a unique identifier for the constraint; roo-min, the constraint's lower limit; roo-max, the constraint's upper limit. The row-name field is the database analogue of the subscript i ; it is designated the key for this file, which must be different for each constraint record. The values stored in roo-min and row-max naturally correspond to I?" and U?" in the algebraic formulation.
coeff-col
-+
VARIABLES
coeff-value The + represents a many-to-one relationship. Since there may be many nonzero coefficients in one constraint, there may be many
309
COEFFICIWTS records that have the same constraint identifier in their coeff-row field. But each constraint identifier may appear only once in the tom-name field of the CONSTIUIlDTS file.
out-mat out-min out-opt out -max
3.3 A relational database for the production model In even the simple example above, we can see certain principles for deriving a relational database structure from a linear
programming formulation. Most fundamentally: Each indexing set (such as Z,7 . or C) gives rise to a iile in the database.
to-mat -+ UTERIALS conv-yield conv-cost
The definition of keys also follows certain rules: A set of single members (such as Z or 3 )produces a database file having a single key field, whereas a set of pairs (such as C) produces a file whose key is defined by two fields.
The set 3xthas only one many-to-one correspondence:
I ACTIVITIES
9)
act-name act-min act -opt act-max act-cost act-cap-rate
We now examine how these rules, and their natural extensions, can give a database for the multi-facility production model set forth in Appendix A. The production model explicitly defines two sets of single members. First is the set M of materials, over which are indexed Z?, U?, c? and Zj", U?, Second is the set 3 of facilities, which indexes just and U?. They give rise t o the following files:
zy,
zy,
ZTp
buy-min buy-opt buy-max buy-cost sell-min sell-opt sell-max sell-cost
HATERIALS
The model's other two sets of pairs give rise to files that are only a bit different. The set MwnvE M x M yields two key fields that have many-to-one relationships to the same materials field:
Each collection of items indexed over a set (such as y", zj or a;j) gives rise t o a field in the corresponding file.
A domain restriction on a set of pairs (namely C E Z x is reflected in two many-to-one relationships.
+
e.
The ladc of a correspondence for act-name is a direct consequence of the absence of an explicit set A of single activities, which was discussed in Section 2. It remains to consider the two sets of triples. Each member of Ai" has the form ( i , j , B ) , where i is a facility, j is a material, and B is an activity; and each member indexes one data value, a$. Thus, by analogy with the treatment of pairs, it would be reasonable to associate a relational table that has three key fields and a data field, as follows:
caprin
ACTIVITY-INPUTS act-in-fac + FACILITIES act-in-mat + HATERIALS act-in act-in-rate
All of the other sets are subsets of pairs or triples. The set 9 3x M can be handled by direct analogy with C above. The resulting file is:
The actual restriction on the activity input data is stronger, however, as specified in Appendix A. For ( i , j , le) E A'" to make sense, it is necessary that both ( i , j ) E 9 and ( i , B ) E In our relational database notation, this might be specified by saying that
+&.
in-mat in-min in-opt in-max
+
HATERIALS
(act-in- fac, act-in-mat) -+ FACILITY-INPUTS (act-an-fac, act-in) + ACTIVITIES Restrictions of these kinds can be enforced by relational database systems, but not so readily as the simpler restrictions that we is exactly have seen in previous examples. The situation for doUt analogous; one need only substitute "out" for all occurrences of "in" above. The completed database scheme, consisting of eight relational
The key pair (in- f ac, in-mat) uniquely defines a facility input, and the remaining three fields correspond to what are called Z$, z$, and U$ in Appendix A. An entirely parallel structure is yielded by the set 3 O U ' C 3 x M :
310
tables, can be used to support a broad variety of queries. We defer further discussion, however, to the comparison in Section 5 with the alternative hierarchical stmcture.
The sum over {j E 9 : i E C,} does seem rather awkward, however. Since there is one of these sums for each constraint, it might be preferable to define a subset for each i E Z (rather than for each j E 9):
C; .7 is a subset of row nonzeroes: j E C; means variable j is used in constraint i
4. Hierarchical Structures We now seek to, in effect, repeat the previous section's development, but working from the formulation of Appendix B rather than Appendix A. The result is a hierarchical data structure that will be seen, in the sequel, to have notable advantages as well as disadvantages. The structure of this section parallels that of the previous one. We rewrite the general linear programming model so that the nonzero coefficients are specified by use of an indexed collection of subsets, and show how the coefficient values are naturally represented within a hierarchical database structure. Then we extend the same ideas to the case of a hierarchical database for the production model. To facilitate later analyses, this section employs terms and diagrams that are somewhat nonstandard for hierarchical databases, but that emphasize parallels with the relational scheme, and that are readily interpreted in describing our implementation.
ai,
Then the formulation simplifies to the following: Maximize
cjzj
Subject to I?" 1:"
5 CjECi a;jzj 5 U?", 5 z j 5 U?',
With respect to fundamental sets of single elements, a hierarchically structured database looks the same as a relational one. Thus we begin with the following two iiles from Section 3:
I
COHSTRAINTS 1
= {l,...,m}is theset ofconstraints
7 . = (1,. ..,n} is the set of variables
row-max
The values indexed over Z or 9 alone are also described the same as before: U?"
= the lower limit on constraint i, for each i E Z = the upper limit on constraint i , for each i E Z
cj
= the profit (if positive) or cost (if negative)
zj
= the activity level of variable j, for each j E 9 = the lower limit on variable j, for each j E .? = the upper limit on variable j, for each j E 9
U ? '
The difference comes in the description of the coefficients. Suppose that we define, for each j , a subset of Z to specify which constraints use zj: Cj
I
F ==l I
col-name col-profit col-min col-optimal
There is one of these fdes for each variable, just as there are col-name, col-prof it, col-min, col-optimal, and col-max fields for each variable. Thus, in the hierarchical database scheme, a COEFFICIEWTS file is incorporated as a subiile of the VARIABLES file. We depict this situation as follows:
C Z is a subset of column nonzeroes: i E Cj means variable j is used in constraint i
Then the nonzero coefficient values are described as follows: a;, = the coefficient of variable j in constraint i, for each j E 9 and i E C,
col-prof it col-min col-optimal col-max
These are the same coefficients as in the previous section; all that has changed is the way that their indexing is presented. With insignificant modifications, our general formulation of the linear program can accommodate this view as well: Maximize
(yiGEEF-,l
Cj,=j~j~j
Subject to I$"
5
zjEJ:iEC, a+, 5
lj"' 5 z j 5 uyl,
U?",
I
Consider now one of the sets C,, for a particular j. C, is itself a set of single elements, which are restricted to lie in the set 2; and for each i E Cj, there is a coefficient value. Thus we can imagine that Cj gives rise to a relational file as follows:
per unit of variable j, for each j E .7
lyl
for all j E 7 .
4.2 A hierarchical database for the general model
Just as in the previous section, we imagine the shape of the general linear program as being determined by two sets:
I?"
for all i E Z
Of course, there may be other concerns than the convenience of the algebraic formulation. In our implementation (Fourer 1990), for example, the subsets C, are preferred for generating the "column-wise" data structures required by an optimization algorithm.
4.1 An indexed-subset view of the general model
Z
= the coefficient of variable j in constraint i , foreachiEZandjEC;
I
for all i E Z for all j E .?
311
'
coeff-rov
4
COPSTRAIHTS
'I
The user of this structure can regard COEFFICIENTS as just another field of the VARIABLES file. Rather than containing a single value like other fields, however, this field is itself an entire relational subfile of values. This subfile has its own collection of records, each containing a coeff-row field and a coeff -value field. When a database is structured in this way, all of the informa, tion about a variable, including a list of its nonzero coefficients, can be retrieved from a single record of the VARIABLES file. On the other hand, there is no obvious way to retrieve all nonzero coefficients in a single constraint. For such a retrieval to be equally easy, a different subfile, andogous to the set C i , would have to be added instead to the CONSTRAINTS file:
1 CONSTRAINTS
The record for a particular material contains, in addition to the purchase and sales data, a subrecord for each thing that the particular material can be converted to. The notation to-not + MATERIALS indicates that the target of each conversion must itself be a material in the database; this is the analogue of the requirement M Y C M in the algebraic formulation of Appendix B. We can apply similar reasoning to start building a hierarchical scheme for facilities. The members i E F index the values lFp, uFp and the subsets Ffut,F,wt; members of the subsets also index certain values, as follows:
e,
j E+ 1@,x$, U$ j E 3:~: %?U' U+ SJ 3 13 5 SJ
I
k E F,H:
row-name row-min row-max
12,z$,
U?,
I$,
r$
It follows that there should be a key field, two data fields, and three subfiles. Each subfile should have its own key field, and three to five data fields. The result looks like this:
coeff-value cap-min cap-max
Just as the sets Cj and Cj tend to be convenient for different purposes, the subfiles of the CONSTRAINTS and VARlABLES files are useful for viewing the database in different ways. This characteristic represents both an advantage and a disadvantage for the hierarchical structure, but we defer further discussion to the comparison with the relational structure in Section 5.
in-min in-opt inaax
A hierarchical database for t h e production model A file of materials for the hierarchical database is readily constructed along the lines of the VARIABLES file above. The members j E M index not only the values l:w, u : ~ c, ; ~ and lj", U?, but also the subsets MF". Thus the MATERIALS file should be expected to have a key field and eight data fields (as in Section 3) plus a subfile "field". Moreover, since for each j l E MYnvthere are further values azFv, cgFv and zzFv, the subfile should be expected to have a key field and three data fields. We are thus led to a scheme that can be diagrammed as follows: 4.3
zp,
e,
out-min out -opt out-mar
r-
ziw,
ACTIVITIES
ac t-name act-min act-opt act-max act-cost act-cap-rate
MATERIALS mat-name buy-min buy-opt buy-max buy-cost sell-min sell-opt sell-max sell-cost
Here each facilities record contains three collections of subrecords, rather than just one as in the case of the materials. Still missing from this structure, however, is the detailed activities data. For each k in the subset F,@, there are not only the values lzt,zZt, and so forth, but also the sub-subsets d$ and d7Ft, which are also used to index data:
The corresponding hierarchical database concept is that, within each activities subfile, there are sub-subfiles of input and output records. A diagram of the activities subfile alone should thus appear as follows:
conv-yield conv-cost conv-opt
312
in lower-level executable languages. An experienced database administrator may well find that a relational structure allows quicker and more reliable development of routines required to support LP modeling.
ACTIVITIES
oct-none act-min act -opt act-max act-cost act-cap-rate
5.2 Data storage
act-in-rat e ACT-OUTPUTS
ocf -out-not act-out-rate
4
XATERIALS. OUTPUTS
1
The full facilities file is represented by substituting this activities subfile diagram for the one that appears within the facilities file diagram above. The complete hierarchical database structure for the production problem thus consists of just two files, but with extensive internal structures of subfiles and sub-subfiles.
5.
Comparisons
We divide our comparisons of the relational and hierarchical approaches into three categories. There are benefits either way in ease of use, but there is a clear tradeoff between a hierarchical scheme’s compact data storage and a relational scheme’s flexible data retrieval. 5.1 Ease of use
A database scheme is valuable only insofar as it can be implemented, within a database management system, to assist in the storage and extraction of data. Hence we cannot say conclusively whether a relational or hierarchical scheme is most convenient for linear programming data; too much depends on the features of the surrounding database management system. At most, we can point out some advantages that each approach will tend to confer. The hierarchical database file diagrams in Section 4 appear, at first glance, to be more complicated than the relational di* g r a m s in Section 3. Yet the hierarchical diagrams actually offer an especially simple and straightforward representation of the data. There are only two files, one for materials and one for facilities. In the materials file, each record contains all the purchase and sales data on one material, plus a list of any applicable conversions together with their costs and yields. In the facilities file, each record likewise specifies all information pertinent to a particular facility; and each activity subrecord gives all information pertinent to one activity at the facility. There is a certain intuitive arrangement to the hierarchical structure that facilitates implementation of the kind of system that we wish to develop. The relational structure, in contrast, has a larger number of simpler files. It arguably requires the user to have a greater understanding of database principles in order to extract desired information. There is a well-known advantage, however, in the existence of declarative languages (such as SQL) for querying relationally structured data. The same queries on hierarchically structured data tend to require programming (and debugging)
All of our proposed database schemes satisfy the fundamental requirement that no piece of information be stored in more than one place. The hierarchical schemes do allow the data to be maintained more compactly, however. The difference can be seen clearly in the data structures for the general LP model, which appear toward the beginning of Sections 3 and 4. There would be an exact correspondence between fields in the two structures, except that the coeff-col key field in the relational structure does not appear in the hierarchical one. This field is needed in the relational approach, because each coefficient is regarded as an independent entity, which can be identified only by giving the constraint (coeff-row) and variable (Coeff-coZ) to which it belongs. In the hierarchical approach, however, a coefficient is considered part of the information about a variable; in the coefficient subrecords for a variable, it suffices to specify the constraint in which each coefficient appears. This difference of one field makes for a considerable reduction in what must be stored and retrieved. The hierarchical structure saves one coeff-col entry for each of the many coefficients. In addition, any space that would be required by an index file for the coeff-col field-to facilitate searching or sorting on the field-is also saved. Finally, a lengthy index for the coeff-row field can also be avoided. Under the hierarchical structure, each coefficient subfile can be expected to have just a few records, because linear programs typically involve each variable in only a few constraints. Hence the subfile’s coeff-row key field requires only a minimal form of indexing to permit efficient searches of subrecords. Many instances of comparable savings appear in the database for the multi-facility production problem. Corresponding to each of the four files with two key fields and the two files with three key fields in the relational structure, there is just one key field needed in the hierarchical structure. Thus there is a savings of the contents of 8 key fields and of as many as 14 indexes. Indeed, the only sizable indexes likely to be used by the hierarchical scheme are for the not-none key field of the HATERIALS file, and the roc-nume key field of the F A C I L I T I E S Ne. 5.3 Data retrieval
We have already hinted, in the previous section, that the hierarchical structure pays for its compactness by a loss of flexibility. For the general LP formulation, the difference is essentially that the hierarchical scheme can only easily access coefficients “by column”, whereas the relational scheme can also conveniently scan them “by row”. As a practical illustration, consider the problem of displaying all nonzero coefficients in a given constraint, together with the values of the corresponding variables. Such a request is readily filled with the help of the join operation provided by any relational database management system. We would not expect to easily perform any comparable query on the hierarchical structure, however, because it does not conveniently collect a constraint’s coefficients in a single file; they are scattered throughout the subfiles of different records in the VARIABLES file. This difficulty could be remedied by putting a subfile of coefficients in
313
the COHSTRAIYTSfile as we& as shown in Section 4; but then the same data would appear in two different places. So long as we only want to access the nonzero coefficients of given variables, however, the hierarchical structure should be quite efficient. Indeed, all of the coefficient information for a variable can be retrieved directly as part of the record for the variable. In the case of the multi-facility production model, the situation is predictably similar. Some queries are as easily performed on the hiexarchical structure as on the relational one, because they involve just scanning a record and its subrecords (and possibly their sub-subrecards). Examples include the following:
Mitchell 1984), some of which offer attractive topics for further research. The subject of this paper can also be viewed in broader terms. From the analysis in Sections 3 and 4, it is evident that there exist general principles that relate the customary algebraic description of linear programming data to standard structures for databases. The development of these principles, in the context of new modeling software for mathematical programming, is another likely topic for future. work. Finatly, we remark that the choice of a database model may also be a worthwhile subject for further study. The hierarchical and relational models have been emphasized in this paper partly because they are well established (and are in particular both supported by the database software that our implementation employs) and partly because they have an appealingly natural correspondence to the two formulations of our linear programming example. Yet there undoubtedly exist more intricate data indexing requirements that can wise in mathematical p m &ramming,as seen for example in the analyses by Bisschop and Kuip (1989).It is possible that alternative database models will be found to be preferable, particularly for certain difficult classes of LPs not considered in this paper.
What materials can be convested from a given material, and at what cost and yield? What are all the materials used as input by a given facility, and how much of each is used in the optimal solution? What are the costs of all the activities at a given facility? Where the desired information is scattered among subrecords, however, the relational model can be expected to be suprio+ as in these seemingly similar examples: What materials can be converted to a given material, and at what cost and yield?
Acknowledgements This work has been supported in part by grants from the American Iron and Steel Institute, and by grant DDM-8908818 from the National Science Foundation. Valuable comments on an earlier version of this paper were provided by A.M. Geoffrion and by several referees.
What are all the facilities that use a certain material as input, and how much do they use in the optimal solution? What are the costs of all the activities that make use of a given material? These kinds of queries are useful not only for retrieving information of interest, but for diagnostic purposes. For instance, if the last query above shows that no activity makes use of a given material, then there is reason to expect an error or omission in the data.
Appendix A. Formulation Using Ordered Pairs and Triples This appendix describes the formulation on which the database scheme of Section 3 is based. Owing to limited space, only the data and variables are shown; the objective and constraints may be found in the longer version of this paper (Fourer 1990).
6. Further Work Our investigation of database structures was originally motivated by the intention of implementing them in an easy-to-use system for production planning. A longer version of this paper (Fourer 1990)describes the implementation that has evolved. We have developed it in the 4th Dimension database management environment, running on Apple Macintosh computers; for various practical reasons, we have favored the hierarchical database model, although 4th Dimension also supports aspects of the relational model. Even though the ideas of this paper arose in the development of one application, they can suggest a considerable range of possibilities for the integration of database management and linear programming. In the longer version cited above, we survey a variety of possibilities for the addition of linear programming features to database management systems (Choobineh 1990;Lenard 1990;Pasquier et al. 1985;Savage and Baker 1990),and for the incorporation of database features into modeling systems for linear programming (Baker 1986;Bonczek, Holsapple and Whinston 1978;Burger 1982;Stohr and Tanniru 1980). We also describe options for intermediate levels of integration between database and LP software (Geoffrion 1990;Kendrick, Krishnan and Carl-
Materials data M is the set of materials
= lower limit on purchases of material j , for each j E M
l:w
u:w = upper limit on purchases of material j , for each j E M
c : ~ = cost per unit of materid j purchased, for each j E M
ly
= lower limit on sales of material j,for each j E M = upper limit on sales of material j , for each j E M = revenue per unit of material j sold, for each j E M
U?
Tu
MconvC_ M x M is the set of conversions: ( j , j ' ) E Mconvmeans material j can be converted to material 'j 43'"
c
3 14
= number of units of material j' that result
from converting one unit of material j , for each ( j ,j ' ) E MWv y = cost per unit of material j of the conversion from j to j ' , for each ( j ,j ' ) E Mconv
Facilities data
z$ = units of material j used as input by facility i , for each ( i ,j) E 9 .$ut = units of material j produced as output by facility i, for each ( i ,j) E 3 0 d
3 is the set of facilities
17p= minimum amount of the capacity of facility i that must be used, for each i E 3 uTP = capacity of facility i , for each i E 3
zgt = units of activity k operated at facility i , for each ( i , k) E 3='
9 F x M is the set of facility inputs: ( i ,j) E
P means material j
is used as an input at facility i 1;:
= minimum amount of material j that must be used
U$
= maximum amount of material j that may be used as input at facility i, for each ( i ,j) E 9
as input at facility i , for each ( i ,j) E
Appendix B. Formulation Using Indexed Subsets
P
This appendix describes the formulation on which the database scheme of Section 4 is based.
3""' 3 x M is the set of facility outputs: ( i ,j) E Fad means material j is produced as an output at facility i ' :1 U?,*
Materials data
M
= minimum amount of material j that must be produced as output at facility i, for each ( i ,j) E Fo"' = maximum amount of material j that may be produced as output at facility i , for each ( i ,j) E 3""'
c y = cost per unit of material j purchased, for each j E M
lfd = lower limit on sales of material j, for each j E M U ? = upper limit on sales of material j, for each j E M qd = revenue per unit of material j sold, for each j E M
FCt { ( i , k) :i E F} is the set of activities: ( i , k) E FCtmeans k is an activity available at facility i
MYnv M is a subset of conversions, for each j E M: j' E MYnvmeans material j
lzt = minimum number of units of activity k that may be run at facility i , for each ( i , k) E 3& U%' = maximum number of units of activity k that may be run at facility i , for each ( i ,k) E 3&
can be converted to material j'
a;Fv
et= cost per unit of running activity k at facility i ,
c
for each ( i ,k) E FCt r z t = number of units of activity k that can be accommodated in one unit of capacity of facility i , for each ( i , k) E F& and ( i , k) E 3"') is the set of activity inputs: ( i ,j, k) E Ah means input material j is used by activity k at facility i
3 is the set of facilities lFp = minimum amount of the capacity of facility i that must be used, for each iE 3 U ? = capacity of facility i , for each i E 3
a$k = units of input material j required by one unit of activity k at facility i , for each ( i ,j , k ) E A'"
a$
= number of units of material j' that result from converting one unit of material j, for each j E M and j' E M Y y = cost per unit of material j of the conversion from j to j', for each j E M and j' E M Y
Facilities data
c { ( i ,j , k ) :( i ,j ) E
AoUtC
= lower limit on purchases of material j, for each j E M
u : = ~ upper limit on purchases of material j, for each j E M
Activities data
Ah
is the set of materials
ly
p c M is a subset of facility inputs, for each i E 3: FF means materid j is used as an input at facility i jE
j,k):(i, j)E 3 0 d and ( i , k ) E FCt} is the set of activity outputs: ( i ,j, k) E do"'means output material j is produced by activity k at facility i ((2,
12 U$
= units of output material j produced by one unit of activity k at facility i , for each ( i ,j, k) E dout
= minimum amount of material j that must be used as input at facility i , for each i E F and j E = maximum amount of material j that may be used as input at facility i, for each i E 3and j E
Ffut M is a subset of facility outputs, for each i E 3: j E 3fut means material j is produced as an output at facility i
Variables
zp zF ;v
= units of material j bought, for each j E M = units of material j sold, for each j E M
lyt = minimum amount of material j that must be produced as output at facility i , for each i E 3and j E 3,m
urt= maximum amount of material j that may be produced
= units of material j converted to material j', for each (j,j') E MmV
as output at facility i , for each i E 3 and j E
315
F,-
References
Activities data 3,vt is a subset of activities, for each i E 3: k E F,Ft means k is an activity available at facility i
T.E. BAKER,1986. A HierarchicalJRelationalApproach to Modeling, Chesapeake Decision Sciences, New Providence, NJ.
lgt = minimum number of units of activity k that may be run at facility i, for each i E 3 and k E 3,Yt U$ = maximum number of units of activity k that may be run at facility i, for each i E F and k E F r t et= cost per unit of running activity k at facility i, for each i E 3 and k E F r t TE= ~ number of units of activity k that can be accommodated in one unit of capacity of facility i, for each i E F and k E F,@
E.M.L. BEALE,1970. Matrix Generators and Output Analyzers, in H.W. Kuhn (ed.), Proceedings of the Princeton Symposium on Mathematical Programming, Princeton University Press, Princeton, NJ, pp. 25-36. J.J. BISSCHOPAND C.A.C. KUIP, 1989. Compound Sets in Mathematical Programming Modeling Languages,Technical Report, University of Twente, Enschede, The Netherlands. R. BONCZEK, C. HOLSAPPLE AND A. WHINSTON, 1978. Mathematical Programming Within the Context of a Generalized Data Base Management System, R.A.I.R. 0.Recherche Op’rationneUe/Operations Re search 12, 117-139.
dit C 3F is a set of activity inputs,
A. BROOKE, D. KENDRICK AND A. MEERAUS, 1988. GAMS:A User’s Guide, Scientific Press, Redwood City, CA.
for each i E 3 and k E F r t : j E d$ means input material j is used by activity k at facility i
W.F. BURGER,1982. MLD: A Language and Data Base for Modeling, Research Report RC 9639, IBM Research Division, Yorktown Heights, NY.
aek = units of input material j required by one unit of activity k at facility i, for each i E F,k E F r t , and j E d$
d:zt
a$
J . CHOOBINEH, 1990. A Data Sublanguage for Representation and
Formulation of Linear Mathematical Models, MD21.3, program of the 29th TIMS/ORSA Joint National Meeting, Las Vegas.
Ffutis a set of activity outputs, for each i E F and k E 3,@: j E dyzt means output material j is produced by activity k at facility i
C.J. DATE, 1990. An Introduction to Database Systems, AddisonWesley Publishing Company, Reading, MA. R. FOURER, 1990. Database Structures for a Class of Mathematical Programming Models, Technical Report 90-06, Department of Industrial Engineering and Management Sciences, Northwestern University.
= units of output material j produced by one unit of activity k at facility i , for each i E F,k E and j E d:tt
R. FOURER, D.M. GAYAND B.W. KERNIGHAN, 1990. A Modeling Language for Mathematical Programming, Management Science 36, 51S554. A.M. GEOFFRION, 1990. FW/SM: A Prototype Structured Modeling Environment, Working Paper No. 377, Western Management Science Institute, University of California, Los Angeles. AND s. CARL-MITCHELL, 1984. InD. KENDRICK,R. KRISWNAN terfaces Between Database and Modeling Systems, Paper no. 8412, Center for Economic Research, Department of Economics, University of Texas, Austin, TX. M.L. LENARD,1990. A Data-Model-Solver Interface Written in SQL, MD21.2, program of the 29th TIMS/ORSA Joint National Meeting, Las Vegas.
3rt,
Variables
z;w
zyu
= units of material j bought, for each j E M = units of material j sold, for each j E M
z;?”
= units of material j converted to material j ’ , for each j E M and j ‘ E MYnv
z$
= units of material j used as input by facility i , for each i E F and j E Fy
t::z
= units of material j produced as output by facility i, for each i E 3 and j E Ffut
J. PASQUIER, P. H~TTENSCHWILER, T. HURLIMANN AND B. SUDAN,
1985. A Convenient Technique for Constructing Your Own MPSX Generator Using dBASE 11, Institute for Automation and Operations Research, University of Fribourg, Switzerland.
x g t = units of activity k operated at facility i, for each i E F and k E F y t
AND M.D. BAKER,1990. Mathematical Modeling DataS.L. SAVAGE bases, MD21.1, program of the 29th TIMS/ORSA Joint National Meeting, Las Vegas. E.A. STOHRAND M.R. TANNIRU, 1980. A Database for Operations Research Models, International Journal of Policy Analysis and Information Systems 4, 105-121.
J.D. ULLMAN, 1988. Principles of Database and KnowledgeBase Sys-
tems, Computer Science Press, Rockville, MD. J.S. WELCH,JR., 1987a. The Data Management Needs of Mathematical Programming Applications, IMA Journal of Mathematics in Management 1, 237-250.
J.S.WELCH,JR., 1987b. PAM-A Practitioner’s Approach to Modeling, Management Science 33, 61&625.
316