two optimisation routines from the NAG sequential library 11], E04JAF and E04UCF. .... Note that in both parallel routines the underlying sequential numerical ...
The Development of Parallel Optimisation Routines for the NAG Parallel Library R. W. Ford, G. D. Riley and T. L. Freeman, Centre for Novel Computing, Department of Computer Science, The University of Manchester, Manchester. M13 9PL, U.K.
1 Introduction In this paper we consider the design, development and evaluation of parallel optimisation routines for the NAG Parallel Library. We focus on the parallel implementation of two optimisation routines from the NAG sequential library [11], E04JAF and E04UCF. E04JAF implements a quasi-Newton algorithm for the minimisation of a smooth function subject to xed upper and lower bounds on the variables (see [4], [6] for information about quasi-Newton algorithms, and [7] for details of E04JAF). E04UCF implements a sequential quadratic programming (SQP) algorithm to minimise a smooth function subject to constraints on the variables which may include simple bounds, linear constraints and smooth non-linear constraints (see [4], [6] for information about SQP algorithms, and [8] for details of E04UCF). The documentation of E04UCF suggests that the user supplies any known partial derivatives of the objective and constraint functions to improve time to solution and robustness of the algorithm. All partial derivatives in E04JAF and any unsupplied partial derivatives in E04UCF are approximated by nite dierences. The parallel versions of both routines calculate the function values required for the nite dierence approximation of the unspeci ed partial derivatives concurrently. Parallelisation and interface issues are discussed in the context of the parallel library and performance results are presented for four test cases, running on a Fujitsu AP3000 (a closely coupled message passing architecture) and a network of desktop Sun workstations connected by ethernet. These initial results show the potential of this approach, however they also suggest that a more sophisticated load balancing scheme would result in improved parallel performance.
2 Parallelisation This section describes the parallelisation strategy employed in the two routines. For this purpose the essential algorithmic steps are: 1. the calculation of appropriate nite dierence intervals; 1
2. the calculation of partial derivatives at the current point so that a local (quadratic) approximation to the nonlinear function can be formed and thereby a search direction can be de ned; 3. a line-search in the direction de ned by the local approximation. Step 1 is executed once. Steps 2 and 3 are repeated until either a maximum iteration count is exceeded or some given accuracy is achieved. All the above steps involve function evaluations, and even for modest functions, the objective (and constraint) function evaluations dominate computation. For E04JAFP1 , parallelisation is based on the concurrent evaluation of objective functions. For each variable the calculation of nite dierence intervals (step 1) and partial derivatives (step 2) involve a number of calls to the objective function. The actual number depends on the nature of the function. For example, central dierences may be used instead of forward dierences; the former requires more function evaluations than the latter. The objective function calls corresponding to dierent variables are independent and are computed in parallel. Some of the objective function calls for each variable may also be independent; if so, they are also computed in parallel. The line search (step 3) will also involve a number of function evaluations, but, each evaluation depends on the result of the previous evaluation and must therefore be done sequentially. E04UCFP bases its parallelism on the concurrent evaluation of objective functions and optionally also the constraint functions. In typical use, it is expected that the objective function will dominate the computational cost. The default parallel strategy is, therefore, very similar to E04JAFP. E04UCFP diers in one sense; each objective function call in the line search has an associated call to the constraint function|the objective function call is therefore computed in parallel with the constraint function call. If the PARCONF option is set for E04UCFP (via a support routine) then parallelism is also exploited through concurrent evaluation of constraint functions. The parallelism exploited is similar to that used for the objective function. For each constraint the calculation of nite dierence intervals (step 1) and partial derivatives (step 2) requires a number of constraint function calls. The calls corresponding to dierent constraints are independent and are computed in parallel. Some of the constraint function calls for each variable may also be independent; if so, they are also computed in parallel. In general constraint and objective function calls are independent of each other and are computed in parallel. One necessary consequence of parallelisation is that the guarantee that the constraint function will always be called before the objective function (given in the sequential routine E04UCF) is no longer valid. The documentation for E04UCF also suggests that partial derivatives for the objective and constraint functions should be supplied where possible. This, of course, improves the robustness of the algorithm, but also decreases the amount of parallelism available. Where partial derivatives for the constraint and objective function are fully supplied the only parallelism available in this implementation is the concurrent evaluation of pairs of objective and constraint functions, as described earlier. Parallelism is implemented in a master-slave style where the master assigns work (function evaluations) to the slaves and waits for the results. Note that for parallelism In accordance with NAG's naming convention the parallel routines inherit the sequential routine names with an appended \P". 1
2
to be worthwhile the function evaluation(s) must be of sucient granularity to dominate the communication costs. The communication costs for the objective function are: (1) sending the required position vector (from the master to the slave) and (2) sending the resultant function value for that position (from the slave to the master). The communication costs for the constraint function (in E04UCFP) are: (1) sending the required position vector and a constraint mask which identi es the constraints to be computed (from the master to the slave) and (2) sending the resultant constraint vector (from the slave to the master). In both cases an extra integer ag is sent (from the master to the slave) to indicate whether the slave should terminate, either due to convergence or an error. In E04UCFP this ag is also used to indicate to the slave whether it is to compute an objective or constraint function. E04UCFP allows the user to request termination of the routine by setting a ag in the objective or constraint function. To support this feature this ag is also returned (from the slave to the master). In E04UCFP the number of constraint function calls required to calculate the nite dierence intervals grows as ( ) where is the number of unspeci ed derivatives in the objective function and is the number of constraints with unspeci ed derivatives. When and are large this can lead to the master sending a large number of work requests to each slave in a short space of time. In our implementation a slave deals with each work request separately and we rely on the message passing system to buer any outstanding requests. It is therefore possible for the message passing system to run out of buer space, causing the program to crash. To avoid this problem the code keeps the number of outstanding work requests below a maximum threshold. At this threshold the master will not issue a new work request until a previous request has completed. Note that in both parallel routines the underlying sequential numerical algorithms remain unchanged. As mentioned earlier there remain a number of sequential calls to the objective (and constraint) functions|in particular, the line search is sequential. Each of the steps must also be computed in order; for example, the next iteration can not be performed before the current iteration has completed. The number of elements to be calculated in the gradient vector may also vary during the computation as some independent variables may become constrained. Performance improvement will, therefore, not be linear with the number of processors, and will depend on the nature of the particular function. O NM
N
M
N
M
3 NAG Parallel Library The two routines described in this paper have been developed for inclusion in the NAG Parallel Library [10]. This library is a collection of parallel Fortran 77 routines for the solution of numerical and statistical problems. The library is primarily intended for distributed memory parallel machines, including networks and clusters, although it can readily be used on shared memory parallel systems that implement PVM [5] or MPI [12]. The library supports parallelism and memory scalability, and has been designed to be portable across a wide range of parallel machines. The library assumes a Single Program Multiple Data (SPMD) model of parallelism in which a single instance of the user's program executes on each of the logical processors. The NAG Parallel Library uses the Basic Linear Algebra Communication Subprograms (BLACS) [3] for the majority of the communication within the library. Imple3
mentations of the BLACS, available in both PVM and MPI, provide a higher level communication interface. However, there are a number of facilities that are not available in the BLACS, such as sending multiple data types in one message (multiple messages must be sent). There is, therefore, a clear trade-o between code portability (plus ease of maintenance) and performance. Our implementation errs towards portability. The library is designed to minimise the user's concern with use of the BLACS, PVM or MPI, and present a higher level interface using library calls. Task spawning and the de nition of a logical processor grid and its context is handled by the parallel library routine Z01AAFP. On completion the library routine Z01ABFP is called to unde ne the grid and context. The routines Z01AAFP and Z01ABFP can be considered as left and right braces, respectively, around the parallel code. The library is also designed to minimise the dierences between the familiar sequential library interface and the parallel library interface. In the case of optimisation routines it is very successful; the only dierences to the users between the sequential and parallel versions of the two optimisation routines are:
an extra argument (the BLACS context); an extra argument classi cation: arguments can be Global or Local|Global implies the same value(s) on each process whereas Local implies dierent values on each process; an increased number of error conditions associated with the routine checking whether the environment is correctly set-up and that global data is indeed Global; that the constraint function in E04UCFP is no longer guaranteed to be called before the objective function.
4 Test Cases Four test cases have been chosen to evaluate the parallel performance of the two routines. Extended Rosenbrock and Watson, from the NETLIB unconstrained function repository [2] (see also [9]), are used with E04JAFP. FORTRAN 77 versions of BT11 and DTOC6, taken from the Constrained and Unconstrained Testing Environment (CUTE) [1], are used with E04UCFP. The test cases are characterised by the number of variables ( ) and the number of non-linear constraints ( ), as appropriate. In all cases gradients are calculated by nite dierence approximations. As these functions are not particularly computationally costly (they are designed primarily to test the robustness of the algorithms) the computational cost is arti cially varied by adding delays which wait for a set amount of (wall clock) time . Note, for our examples this delay is several orders of magnitude greater than the function evaluation cost allowing the latter to be neglected. This allows us to use well known functions whilst at the same time exploring the parallel performance of the algorithm. In E04UCFP, all constraints are speci ed in one function. E04UCFP supplies the function with a mask of the constraints it requires evaluating. This mask varies during execution and depends on the nature of the problem. The varying cost of this function is set to be proportional to the number of constraints calculated. N
M
T
4
5 Results
5.1 E04JAFP|Extended Rosenbrock and Watson
The achieved performance of the Rosenbrock and Watson test cases for execution on both the Fujitsu AP3000 and a network of Sun workstations are shown in Figures 1 and 2. Results are presented for (arti cial) function evaluation times = 0 01s and = 0 1s. The Ideal performance is given by ( s + p ) where s is the number of function evaluations which must be executed sequentially, p is the number which may be executed in parallel, is the delay and is the number of processors. The values of s and p for the two test cases are given in Table 1. As the function evaluation cost dominates execution time and the function evaluation time is independent of processor performance, the execution times in Figures 1 and 2 for the SUN's and AP3000 are similar despite the dierence in the power of the processors. The dierence between the Ideal performance and actual performance on one processor is due to inaccuracies in the realisation of the requested delays , i.e: if = 0 1s then the actual delay is slightly greater than 0 1s and the inaccuracy is dierent on the two architectures. From these performance results the following observations can be made: 1. The parallelism is not dominated by communication overhead: once the function cost is around 0.1s, or greater, the results, presented in Figures 1 and 2, closely follow the Ideal performance, which assumes perfect load balance and zero communication costs. Most real world applications have much greater function evaluation costs than this. For example, a CFD code which is used as the objective function for E04JAFP in an aerodynamics application takes at least 10 minutes per evaluation on a workstation. 2. The dimensionality of the problem limits the available parallelism and also determines the ratio of sequential to parallel function evaluations. Rosenbrock, for which =10, runs out of parallelism with 11 processors and Watson ( =20) scales better than Rosenbrock, see Figures 1 and 2. 3. With an evaluation cost of 0.01s, performance on the Sun network is limited by communication cost, whereas the AP3000 still shows performance improvement. However, the total execution time is less than 20s|a time for which sequential execution would normally be adequate. 4. Unavoidable load imbalance, which arises when the load cannot be evenly distributed, can be seen as `steps' which occur for both test cases. For example, in Figure 2, there is no performance improvement from 8 to 10 processors for the Watson function on the AP3000 with f=0.1s. T
T F
F =P
:
T
:
F
F
T
P
F
T
T
F
:
:
N
N
Function
Total Sequential Parallel Evaluations Evaluations s Evaluations p Rosenbrock 506 106 400 Watson 1381 161 1220 Table 1: Function evaluation count for Rosenbrock ( =10) and Watson ( =20). F
N
5
F
N
60 Rosenbrock (Suns, f=0.01s) Rosenbrock (Suns, f=0.1s) Rosenbrock (AP3000, f=0.01s) Rosenbrock (AP3000, f=0.1s) Rosenbrock (f=0.1s) Ideal Rosenbrock (f=0.01s) Ideal
50
Execution time (s)
40
30
20
10
0 0
1
2
3
4
5 6 7 Number of processors
8
9
10
11
12
Figure 1: Performance results for Rosenbrock. 160 Watson (Suns, f=0.01s) Watson (Suns, f=0.1s) Watson (AP3000, f=0.01s) Watson (AP3000, f=0.1s) Watson (f=0.1s) Ideal Watson (f=0.01s) Ideal
140
Execution time (s)
120
100
80
60
40
20
0 0
1
2
3
4
5 6 7 Number of processors
8
9
10
Figure 2: Performance results for Watson. 6
11
12
5.2 E04UCFP|BT11 and DTOC6
Performance results for the BT11 test case are given in Figures 3 and 4 , whilst performance results for the DTOC6 test case are given in Figures 5 and 6. All results are for the Fujitsu AP3000. In both cases the former Figure shows the performance when the cost of the constraint function can be neglected and the latter the performance when the cost of the objective and constraint functions are equal (assuming all partial derivatives for a constraint function are evaluated). For this routine the ideal performance cannot be easily calculated. The following observations can be made from these performance results: 1. For these test cases, the communication overhead is negligible. This can be seen by the close agreement for the even numbered processors of the two curves in Figure 3. In the PARCONF case, zero-cost constraint functions are being calculated in parallel resulting in increased communication compared with the default case. Despite this, there is no visible loss in performance. See point 4 for an explanation of the `saw tooth' eect. In Figure 5 only a very small eect can be observed despite a very large number of calls to the constraint function. Again, this is encouraging as the test cases are computationally small relative to those expected to be run in practice. 2. For the default parallel strategy, the maximum parallelism is equal to the number of variables, . This can be seen in Figure 3 where, for BT11, =5 and the performance does not increase beyond 6 processors. As increases, the ratio of parallel to sequential evaluations for the objective function also increases, giving the improved parallel performance|compare Figure 3 ( =5) and Figure 5 ( =41). 3. As the number of constraints ( ) is increased, the ratio of parallel to sequential evaluations for the constraint function increases, resulting in improved parallel performance|compare Figure 4 ( =3) with Figure 6 ( =20). As both and increase the relative fraction of time spent computing the constraint function also increases compared with the objective function|again compare Figure 4 ( =3) with Figure 6 ( =20). 4. Allowing the constraint and objective functions to be calculated in parallel through the PARCONF option causes the saw tooth eect seen in Figures 3 and 5. This is due to the diering costs of the objective and constraint functions, and the use of a nave round robin scheme for the allocation of work. 5. As shown in Figures 4 and 6, PARCONF provides a signi cant increase in the amount of parallelism, and a corresponding improvement in parallel performance when the cost of the constraint function can not be neglected. 6. Varying the size of the work buer limit makes little or no dierence to the overall parallel performance. For DTOC6, see Figures 5 and 6, only a very slight performance improvement is observed as the limit is increased. As and are small in the BT11 example, any reasonably sized limit on buer use makes no dierence to the performance results. The results shown in Figures 3 and 4 for a buer size limit of 100 are identical to those for 25 and 200. N
N
N
N
N
M
M
M
N
M
M
M
N
7
M
40 BT11 (f=0.2s, c=0.0s, buf=100) BT11 (f=0.2s, c=0.0s, PARCONF, buf=100) 35
Execution time (s)
30
25
20
15
10
5
0 1
2
3
4
5
6 7 Number of processors
8
9
10
11
12
Figure 3: Performance results for BT11 (t objf=0.1s, t conf=0.0s). 40 BT11 (f=0.1s, c=0.1s, buf=100) BT11 (f=0.1s, c=0.1s, PARCONF, buf=100) 35
Execution time (s)
30
25
20
15
10
5
0 1
2
3
4
5
6 7 Number of processors
8
9
10
11
12
Figure 4: Performance results for BT11 (t objf=0.1s, t conf=0.1s). 8
40 DTOC6 (f=0.03s, c=0.0s, buf=25) DTOC6 (f=0.03s, c=0.0s, buf=100) DTOC6 (f=0.03s, c=0.0s, buf=200) DTOC6 (f=0.03s, c=0.0s, PARCONF, buf=25) DTOC6 (f=0.03s, c=0.0s, PARCONF, buf=100) DTOC6 (f=0.03s, c=0.0s, PARCONF, buf=200)
35
Execution time (s)
30
25
20
15
10
5
0 1
2
3
4
5
6 7 Number of processors
8
9
10
11
12
Figure 5: Performance results for DTOC6 (t objf=0.03s, t conf=0.0s). 40 DTOC6 (f=0.01s, c=0.01s, buf=25) DTOC6 (f=0.01s, c=0.01s, buf=100) DTOC6 (f=0.01s, c=0.01s, buf=200) DTOC6 (f=0.01s, c=0.01s, PARCONF, buf=25) DTOC6 (f=0.01s, c=0.01s, PARCONF, buf=100) DTOC6 (f=0.01s, c=0.01s, PARCONF, buf=200)
35
Execution time (s)
30
25
20
15
10
5
0 1
2
3
4
5
6 7 Number of processors
8
9
10
11
12
Figure 6: Performance results for DTOC6 (t objf=0.01s, t conf=0.01s). 9
6 Conclusions Two optimisation routines have been parallelised for the NAG Parallel Library. The parallel library interface provides a simple migration path from the sequential to parallel library for such routines. The use of BLACS in these routines helped with implementation and portability, without aecting performance unduly. For both routines, parallel performance increases with the number of variables ( ) and non-linear constraints ( ), with unspeci ed derivatives. The PARCONF option for E04UCFP, increases parallelism and gives signi cant performance improvement when the constraint function is computationally costly. However, it does highlight the problem of load imbalance which will be addressed in a future release of the library. The work buer limit is required to avoid buer over ow, however it makes little dierence to the overall performance. It is therefore set to 100 internally in the routine. Finally, our results show that workstation networks are a viable (and cheap) parallel computing resource for this type of optimisation problem, even for relatively modest function evaluation costs. N
M
7 Acknowledgements This work was funded by the ESPRIT framework IV project P20018 PINEAPL. The authors would also like to thank the Fujitsu European Centre for Information Technology (FECIT) for granting access to their AP3000.
References [1] I. Bongartz, A. R. Conn, N. I. M. Gould and Ph. L. Toint, (1995) CUTE: Constrained and Unconstrained Testing Environment , A.C.M. Transactions on Mathematical Software 21, pp. 123{160. [2] J. J. Dongarra and E. Grosse, (1987) Distribution of mathematical software by electronic mail ,Communications of the ACM, 30, pp. 403{407. [3] J. Dongarra and R. C. Whaley, (1997) A User's Guide to the BLACS v1.1 , Technical Report CS-95-281, University of Tennessee, Knoxville, Tennessee. [4] R. Fletcher, (1987) Practical Methods of Optimization, Second Edition, John Wiley and Sons, Chichester. [5] A. Geist, A. Beguelin, J. Dongarra, R. Manchek, W. Jiang, and V. Sunderam, (1994), PVM: A Users' Guide and Tutorial for Networked Parallel Computing , The MIT Press, Cambridge, Massachusetts. [6] P. E. Gill, W. M. Murray and M. H. Wright, (1981) Practical Optimisation, Academic Press, London. [7] P. E. Gill, W. M. Murray, (1976) Minimization Subject to Bounds on the Variables National Physsical Laboratory Report NAC 72. 10
[8] P. E. Gill, S. J. Hammarling, W. M. Murray, M. A. Saunders and M. H. Wright, (1986) Users's Guide for LSSOL (Version 1.0), Department of Computer Science, Stanford University, Report SOL 86-1. [9] J. J. More, B. S. Garbow and K. E. Hillstrom, (1981) Testing Unconstrained Optimization Software, A.C.M. Transactions on Mathematical Software 7, pp. 17{41. [10] N.A.G., (1997) N.A.G. Parallel Library Manual, Release 2 , N.A.G. Ltd., Oxford. [11] N.A.G., (1997) N.A.G. Fortran Library Manual, Mark 17 , N.A.G. Ltd., Oxford. [12] M.Snir, S.Otto, S.Huss-Lederman, D.Walker and J.Dongarra, (1996) MPI: The Complete Reference , The MIT Press, Cambridge, Massachusetts.
11