Testing Parallel Variable Transformation on VPP500

0 downloads 0 Views 800KB Size Report
Dec 26, 1997 - cross-bar network for data communication whose transfer rate is 400 Mbps. In order to .... Since the PVT algorithm is free from the concept of assigning a separate block of variables ..... (4286) 4.038 (8006). 4.5 ... [9] J.J. Mor e, B.S. Garbow and K.E. Hillstrom, \Testing unconstrained optimization soft- ware ...
Testing Parallel Variable Transformation on VPP500 Eiki Yamakawa College of Business Administration Takamatsu University Takamatsu, Kagawa 761-0194, Japan E-mail: [email protected] and

Masao Fukushima 1 Department of Applied Mathematics and Physics Graduate School of Engineering Kyoto University Kyoto 606-01, Japan E-mail: [email protected] December 26, 1997

Abstract.

This paper studies performance of the parallel variable transforma-

tion (PVT) algorithm for unconstrained nonlinear optimization through numerical experiments on a Fujitsu VPP500, one of the most up-to-date vector parallel computer. Special attention is paid to a particular form of the PVT algorithm that is regarded as a generalization of the block Jacobi algorithm that allows overlapping of variables among processors. Implementation strategies on the VPP500 are described in detail and results of numerical experiments are reported.

Key words.

Parallel variable transformation, generalized block Jacobi algorithm,

unconstrained optimization, nonlinear programming, VPP500 1

The work of this author was supported in part by the Scienti c Research Grant-in-Aid from the Ministry of Education, Science, Sports and Culture, Japan.

1.

Introduction

In the last several years, a number of novel parallel algorithms for solving nonlinear optimization problems have been developed. Among others, Mangasarian [8] proposed the parallel gradient distribution (PGD) algorithm, and Ferris and Mangasarian [2] presented the parallel variable distribution (PVD) algorithm, which was further studied by Solodov [13]. More recently, one of the authors [4] presented a general framework for unconstrained nonlinear optimization called the parallel variable transformation (PVT) algorithm that encompasses the above-mentioned parallel algorithms. The purpose of this paper is two-fold. The rst is to study performance of the PVT algorithm for large-scale unconstrained optimization problems. Special attention will be paid to a particular form of the PVT algorithm that is regarded as a generalization of the block Jacobi method that allows overlapping of variables among processors. We will show that this speci c PVT algorithm is e ective for large-scale nonlinear least squares problems. The second is to describe how the algorithm is implemented on a Fujitsu VPP500, which is one of the most up-to-date vector parallel computer. Although numerical experiments with parallel algorithms have been frequently reported in the literature, most of the recent experiments seem to have been carried out on a Connection Machine CM-2 or CM-5. Since a Fujitsu VPP500's architecture is considerably di erent from those machines, and since its performance seems very high, we believe that a detailed description of the implementation strategies used in the numerical experiments is of interest to the people in the optimization community. This paper is organized as follows. In x2, we state the PVT algorithm and its global convergence and rate of convergence results. We also mention the generalized block Jacobi algorithm. In x3, we describe in detail implementation strategies of the block Jacobi algorithm on a VPP500. In x4, we report results of numerical experiments. In x5, we conclude the paper with some remarks.

1

2.

Parallel Variable Transformation

In this section, we state the parallel variable transformation (PVT) algorithm for solving the unconstrained minimization problem minn f (x); 2
2 0; c1 > 0 and c2 > 0 such that dist [x; S 3 ]   implies

Theorem 2.2 ([4])

krf (x)k  c

1

dist [x; S 3 ]

(6)

and

0 f 3  c dist [x; S 3] ; (7) where dist [x; S 3 ]  inf fkx 0 x0 k j x0 2 S 3g and f 3  min 2 (k) ). Note that, in general, since r'(k) to the non-zero rows of A(k) ` (y` ) = (A` ) rf (A` y` + x ` we cannot apply this strategy to the PVD algorithm of [2], even if the objective function of the original problem has some particular structure, because the PVD algorithm corresponds to may have n non-zero the particular PVT algorithm in which the transformation matrices A(k) ` rows. 4.

Numerical Experiments

We have conducted computational experiments on a VPP500 with the PVT algorithm, in particular, with the generalized Jacobi method as described in x2.3. In this section, we rst state the characteristics of the test problems that we solved, and then report the numerical results. 4.1.

Test Problems

We tested the algorithm on several large-scale nonlinear least-squares problems taken from the literature [9, 10, 11] that constitute a part of the source for the CUTE collection [1]. The details of the test problems are given below. Note that we can vary the problem size arbitrarily by specifying the number n of variables.

10

Problem 4.1 ([9])

Xf x 0

Extended Powell singular function.

Objective function:

f (x) =

n=4

(

4i

i=1

+ 10x4i02)2 + 5(x4i01

3

Problem 4.2 ([9])

4

Xf

3

4

4

Minimum value:

Problem 4.3 ([10])

f (x) =

n=2 i=1

0 x i0 )

100(x2i

2 2

1

2

+ (1

0 x i0 ) g 2

2

1

x(0) = (01:2; 1:0; . . . ; 01:2; 1:0) f 3 = 0 at x3 = (1; . . . ; 1)

Starting point:

Xf

Generalized Rosenbrock function.

Objective function: Starting point: Minimum value:

f (x) = 1 +

n

i=2

100(xi

0 xi0 ) 2

1

2

+ (1

0 xi0 ) g 2

1

x(0) = (1=(n + 1); . . . ; 1=(n + 1)) f 3 = 1 at x3 = (1; . . . ; 1)

Xf

Broyden tridiagonal function.

Objective function: Starting point: Minimum value:

Problem 4.5 ([9])

4

1

Extended Rosenbrock function.

Objective function:

Problem 4.4 ([9])

4

x(0) = ( 3; 01; 0; f 3 = 0 at x3 = (0; . . . ; 0)

Minimum value:

2

4

0 2x i0 ) + 10(x i0 0 x i) g 1; . . . ; 3; 01; 0; 1)

+ (x4i02

Starting point:

0 x i)

f (x) =

n

(3

i=1

0 2xi)xi 0 xi0 0 2xi 1

x(0) = (01; . . . ; 01) f3 = 0

X nx

Broyden banded function.

Objective function:

f (x) =

n

i=1

2 i (2 + 5xi ) + 1

0

+1

Xx

j 2Ji

g

+1

j (1 +

2

o

xj )

2

,

Ji = fj j j 6= i; max(1; i 0 5)  j  min(n; i + 1)g = (01; . . . ; 01)

where Starting point: Minimum value:

Problem 4.6 ([9])

x f3 = 0 (0)

X nn 0 X

Trigonometric function.

Objective function:

Starting point: Minimum value:

f (x) =

n

n

i=1

j =1

cos xj + i (1

0 cos xi) 0 sin xi

x = (1; . . . ; 1) f 3 = 0 at x3 = (2m; . . . ; 2m);

o

2

(0)

11

m = 0; 61; 62; . . .

Problem 4.7 ([11])

Quartic.

Objective function:

( )=

f x

x

Minimum value:

3 f

i=1

2

2

i xi

= (1; . . . ; 1) = 0 at x3 = (0; . . . ; 0)

(0)

Starting point:

n X

The starting points shown above are standard except the one for Problem 4.6, for which the standard starting point is (1=n; . . . ; 1=n). We used the starting point (1; . . . ; 1) because the standard one becomes close to the optimal solution (0; . . . ; 0) when n is large. The Hessian matrices of the objective functions in Problems 4.1 and 4.2 have block diagonal structure, with block diagonal parts being 4 2 4 matrices and 2 2 2 matrices, respectively. The Hessian matrices of the objective functions in Problems 4.3, 4.4 and 4.5 have band structure; in particular, those matrices in the rst two problems are tri-diagonal. On the other hand, the Hessian matrices of the objective functions in Problems 4.6 and 4.7 are dense. To apply the generalized block Jacobi method described in x2.3, we need to specify the index sets J` , ` = 1; . . . ; p, that satisfy P (8). The objective functions of the above test problems have the form f (x) = i fi2(x) except Problem 4.7. For Problems 4.1, 4.2, 4.3, 4.4 and 4.5, in particular, each fi is a simple function involving a xed small number of variables no matter how the total number n of variables involved in f becomes large. Then, as described in x3.2, the evaluation of the function f in the solution process of the `th subproblem in the parallelization phase can e ectively be decomposed into two parts; one is evaluated by using xJ` and a few additional variables, and the other is evaluated without using xJ` . For Problems 4.6 and 4.7, we do not have such P natural decomposition, because each fi involves a term of the form nj=1 h(xj ). In this case, however, we may still reduce the cost of function evaluations simply by computing the term P j2 = J` h(xj ) prior to the parallelization phase. 4.2.

Computational Results

The numerical experiments were carried out in double precision arithmetic on the VPP500 at the Kyoto University Data Processing Center. In order to solve subproblems in the parallelization phase and the synchronization phase, we used a quasi-Newton method with BFGS updates. 12

In the experiments, we stop the iteration of Algorithm PVT if the conditions

krf (x )k  " (k)

and

(10)

1



f (x(k01)) 0 f (x(k) )  "2 1 + j f (x(k01) ) j



hold at the beginning of the parallelization phase. Note that condition (10) is equivalent to the condition ` = 1; . . . ; p: kr'(k) ` (0)k  "1 ; In each PE, we always choose y` = 0 as an initial point in the quasi-Newton method for solving subproblem (2). Note that y` = 0 corresponds to the current major iterate x(k) , which is quite natural as a starting point for nding a candidate for the next iterate. Then we terminate the procedure when y` satis es the condition

kr' (y )k  max (k) `

(

kr' ";

`

1

(k) `

10

(0)k

)

:

(11)

(0)k=10 is introduced to truncate inner iterations prematurely especially when The term kr'(k) ` the current major iterate x(k) is far from the optimal solution. In practice, y`(k) satisfying the termination criterion (11) is regarded as an approximate solution of subproblem (2). On the other hand, we complete the synchronization phase if the conditions (k)

and

kr

(k)

(k) (z )  min '(k) ` (y` ) 1` p

(z )k  max

(

kr "; 1

(k)

10

(~z )k

(12) )

hold, where z~ is an initial point in the quasi-Newton method for solving subproblem (3). Note that condition (12) is essential in ensuring the global convergence of the algorithm (see (5) and Theorem 2.1). Since the choice of a starting point z~ may a ect the performance of the whole algorithm as well as that of the synchronization phase, we need to choose z~ carefully by (see, for example, (14) taking into account the structure of the transformation matrices A(k) ` below). 13

We rst tested the generalized block Jacobi method on the test problems described in the previous subsection. The number n of variables was xed at 1000. As mentioned in x2.3, the generalized block Jacobi method is a special case of the PVT algorithm in which n 2 m` consist of all the rows of the m` 2 m` identity matrix and additional n 0 m` zero matrices A(k) ` in the following manner: Let n` , ` = 0; 1; . . . ; p, be indices rows. In particular, we choose A(k) ` such that 0 = n0 < n1 < 1 1 1 < np = n: Moreover, we choose an integer m such that 0  m  n` 0 n` 01 for all ` = 1; . . . ; p, and set m`

= n` 0 n`01 + m:

be Then, for all k, we let the matrices A(k) ` (k) `

A

0 B =B B@

0n`01 Im

`

0n0n` 0m

1 CC CA

;

`

= 1; . . . ; p 0 1;

and

(k)

Ap

0 B 0 =B B@ 0 0 Im p

n

Im p

mp

1 CC CA

;

(13)

+

where Im` is the m` 2 m` identity matrix, Imp+ and Imp0 are submatrices of Imp which consist of the rst np 0 np01 rows and the remaining m rows of Imp , respectively, and 0q is the q 2 m` zero matrix for q = n`01 ; n 0 n` 0 m or n 0 mp . In other words, we solve subproblem (9) described in x2.3 with index sets J`

= fn`01 + 1; . . . ; n` ; n` + 1; . . . ; n` + mg;

and Jp

`

= f1; . . . ; m; np01 + 1; . . . ; ng

= 1; . . . ; p 0 1;

in the parallelization phase. Then the `th PE is responsible for updating its own n` 0 n`01 variables and m additional variables which are primarily managed by the adjacent (` +1)th PE. This strategy is expected to be e ective for solving large-scale problems in which an element in a block of the variables depends upon only a small number of elements belonging to the blocks contiguous to them. In the numerical experiments, we set n`

0

01 = n=p

n`

14



n

for all ` = 1; . . . ; p, which means that each PE is primarily responsible for the same number of variables. On the other hand, m out of m` (= n + m) variables are assigned to at most two PE's. Then we varied the ratio of m to n from 0% to 40%. In each case, we examined the performance of the algorithm under various choices of the number p of the subproblems to be solved in the parallelization phase. These p subproblems were solved on p PE's in parallel when p > 1. On the other hand, the column p = 1 in each table below gives the results obtained by applying the quasi-Newton method to the whole problem on a single PE. Speci cally, for (k) p = 1, the only transformation matrix A1 given by (13) reduces to the n 2 n identity matrix In . Note that the parameter p not only speci es the number of PE's used but also a ects the algorithmic scheme itself. In view of the condition (12) in the termination criterion of the synchronization phase, a simple choice of the starting point z~ would be the unit vector such that z~` = 1 for ` = `~, and z~` = 0 for all ` 6= `~. In the experiments where p is even, we chose the starting point z~ in the quasi-Newton method for solving subproblems in the synchronization phase as z~0

=

z~`

=

8 1 0 p=2; < 1; ` = `;~ `~ 6 2; `~ 6 4; . . . : 0; ` = `~ 6 1; `~ 6 3; . . .

;

1  `  p;

(14)

(k) (k) z~ is given by where `~ = argmin1`p '(k) ` (y` ). Then the j th element of the vector B (k) (k) (k) ~ `~ 6 2; `~ 6 4; . . ., with 1  `  p, and by xj otherwise. We xj + yj if j 2 J` for some ` = `; use the starting point given by (14), because we wish to exploit the information obtained in each PE as much as possible. Tables 1, 2 and 3 show the performance of the generalized block Jacobi method under the tolerance parameters "1 = 1003 and "2 = 1006, in which the ratio m=n was set equal to 0%, 20% and 40%, respectively. In the tables, the number of function evaluations means the cumulative number of those in the parallelization phase on each PE and those in the synchronization phase. Each table reveals that the PVT algorithm on p (> 1) PE's is generally more ecient than the quasi-Newton method on a single PE. Moreover, the CPU time decreases as the number p increases except for some problems like Problem 4.1 which can be solved on a single PE relatively fast. From these tables, we may recognize that, for each p, the CPU time gradually increases as the ratio of m to n (= n=p) becomes large. This is because subproblems

15

Table 1: Performance of the generalized block Jacobi algorithm when m = 0. Problem 4.1 4.2 4.3 4.4 4.5 4.6 4.7

CPU time[sec] (Number of function evaluations) p=1 p=2 p=4 p=8 1.336 19.57 43.33 5.737 27.22 15.07 51.67

(167) (3116) (5747) (962) (3641) (1945) (9365)

0.091 3.082 6.815 0.898 13.75 0.356 0.996

(113) (4082) (4821) (1019) (7210) (423) (2555)

0.095 (326) 0.096 (579) 0.797 (6102) 61.19 (166265) 3.216 (8222) 3.153 (26486) 0.399 (1778) 0.278 (2320) 7.625 (11487) 1.260 (2771) 0.337 (956) 0.447 (1780) 0.533 (3368) 0.353 (3387)

in the parallelization phase become large owing to the overlap of the variables. However, the di erence in the CPU time with respect to the ratio m=n decreases as the number p grows. On the other hand, the algorithm is sometimes unstable when the ratio m=n is small, as in the case of p = 8 and m = 0 for Problem 4.2. Thus the generalized block Jacobi method is e ective with moderate overlapping of the variables. We then made additional experiments in which quasi-Newton iterations for solving subproblems in both the parallelization phase and the synchronization phase were executed at most one iteration. In the experiments, we xed m at 0. Moreover, we chose the initial point z~ in the synchronization phase as the unit vector with components z~` = 1 for ` = `~, and z~` = 0 (y`(k) ), so that the resulting z (k) readily satis es for all ` 6= `~, where `~ = arg min1`p '(k) ` condition (5). Note that, with these choices, the PVT algorithm proceeds in the same manner as the PGD algorithm proposed in [8]. The results are shown in Table 4. The symbol (0) in the table indicates that the algorithm failed to solve the problem within 10000 iterations. Since each iteration of the PGD algorithm consists of the simple procedure that compounds the subvectors of the negative gradient at the current iterate, we may anticipate that the PGD algorithm consumes a large number of iterations. In fact, we see from the tables that the generalized block Jacobi method outperforms the PGD algorithm for the majority of the test problems. 16

Table 2: Performance of the generalized block Jacobi algorithm when CPU time[sec]

Problem

p

=1

p

= 0:2.

m=n

(Number of function evaluations) =2

p

=4

p

=8

4.1

1.336

(167)

0.107

(123)

0.092

(295)

0.137

(896)

4.2

19.57

(3116)

4.715

(5123)

0.912

(7245)

0.460

(7328)

4.3

43.33

(5747)

15.58

(13539)

3.656

(12138)

1.583

(10800)

4.4

5.737

(962)

0.665

(624)

0.276

(976)

0.226

(2225)

4.5

27.22

(3641)

11.93

(6861)

7.818

(14185)

4.474

(16179)

4.6

15.07

(1945)

0.837

(1048)

0.397

(903)

0.577

(1601)

4.7

51.67

(9365)

3.793

(6479)

0.796

(5564)

0.430

(5335)

Finally we compared the generalized block Jacobi method with the unconstrained version of the PVD algorithm originally proposed in [2]. We executed the PVD algorithm as a special case of the PVT algorithm by letting, for each `,

0 BB 0 BB BB BB =B BB BB BB @

where

(k) `

g

2

Suggest Documents