Solving Physical Design Problems on a Vector

3 downloads 0 Views 83KB Size Report
Even the simplest optimization problems in the physical design of integrated circuits are known to be NP-complete and are solved using heuristic tech- niques.
Solving Physical Design Problems on a Vector Machine C.P. Ravikumar Electrical Engineering Department Indian Institute of Technology, Delhi Hauz Khas, New Delhi, 110016, India Abstract In this paper, we describe parallel algorithms for two physical design problems. The target machine is the Alliant FX/80, which is a shared-memory multiprocessor with up to 8 advanced computational elements which are individually capable of vector operation. First, we describe the parallel implementation of an r-dimensional circuit placement algorithm. The objective of the placement problem is to arrange n logic blocks in an r-dimensional space so as to minimize the length of the interconnect wiring among the blocks. The problem is known to be NP-complete, and the numerical optimization technique presented in this paper finds a local optimum solution. Next, we discuss a parallel version of Kernighan and Lin's algorithm for the two-way circuit partition problem. Given n logic blocks and their connectivity matrix C, the problem is to generate a balanced two-way partition of the blocks such that the wiring across the partition is minimum. The partition problem is NPcomplete, but the Kernighan-Lin heuristic algorithm generates a near-optimal solution.

1

Introduction

Even the simplest optimization problems in the physical design of integrated circuits are known to be NP-complete and are solved using heuristic techniques. There is a tradeoff between the optimality of the final solution and the runtime requirements of the heuristic algorithm; it is generally true that a more expensive heuristic will generate a better quality solution to the optimization problem. Parallel computing changes this scenario in two ways : (1) For the same runtime, better quality solutions can be generated on a parallel machine, since we can afford to use more expensive heuristics (2) Fixing the heuristic, a solution can be generated much faster and hence problems of larger size can be solved quickly [9]. In this paper, we describe parallel algorithms for two important problems in physical design, circuit partition and circuit placement. These algorithms have been implemented on an Alliant FX/80, a vector

109

multiprocessor. The architecture of Alliant FX/80 is discussed in Section 2. In Section 3, we describe a parallel algorithm for r-dimensional placement. A parallel partitioning algorithm is described in Section 4. Conclusions are presented in Section 5. 1.1 r-dimensional Placement Consider the problem of positioning n logic blocks in an r-dimensional space. Special cases of the problem result when r = 1 (the linear placement problem), r = 2 (two dimensional placement of gate arrays or standard cells) and r = 3 (three dimensional placement of chips on printed circuit boards using surface mount technology). The simplest version of the r-dimensional placement problem is the linear placement problem, where a set of logic cells must be placed in a single row such that the amount of wiring is minimized. Let x\,X2,- • • ,xn be the coordinates of the n cells after placement. Let C,j- represent the connectivity between cells i and j. A cost function which represents the total amount of wiring is Z =

(i)

TT

Minimizing z is known to be computationally hard and is solved using heuristic methods [9]. One such heuristic technique is due to Hall [4], who pointed out that z in Equation 1 is a quadratic form and could be expressed as z = X'BX, where B is a positive semi-definite nxn matrix and X is the row vector of the coordinates #,-. (A matrix A is said to be positive semi-definite if, for every vector Y, the value Y'AY > 0.) Hall also showed that B could be computed using the connectivity matrix C as follows. B-D-C

(2)

Here D — [dij] is a 'diagonal matrix' i.e. all the nondiagonal elements of D are zero. Furthermore, da is given by d{i = £ J = 1 C , j . The linear placement problem can be expressed as the optimization of z = X'BX, subject to the constraints that the determinant \B\ > 0, and X'X = 1.

The constraint X'X = 1 avoids the trivial solution Xi = 0. Using the method of Lagrangean multipliers, Hall obtained (B - XI)X = 0 (3) where A is a Lagrange multiplier and I is the identity matrix. Equation 3 yields a non-trivial solution to X if and only if A is an eigenvalue of B and X is the corresponding eigenvector. Since X'X = 1, equation 3 can be further written as A = X'BX. Therefore, the minimization of cost z boils down to finding the eigenvector X of matrix B which minimizes z. The minimum eigenvalue represents the minimum cost function. To be exact, the first minimum eigenvalue does not give the minimum cost function, since it corresponds to the trivial solution x\ = X2 = • • • = xn = 4=. The second minimum eigenvalue corresponds to the optimum solution. Hall also gave a generalization of the above method to rdimensions, r= 1,2, • • •, n — 1. The linear placement problem corresponds to the case r = 1. Hall's method has been largely of theoretical interest, since it is a computationally expensive task to find the eigenvalues and eigenvectors of a large matrix. With the advent of vector supercomputers, the above formulation of the placement problem is a practical approach. In this section, we describe an Alliant FX/80 implementation of a placement algorithm based on Hall's technique. Hall's procedure returns the value of r and the rdimensional coordinates. In other words, the user cannot specify the value of r as an input to the procedure. As a result, the technique can only be used as a starting point in a placement algorithm. If the resulting value of r is larger than 2, a post processor becomes necessary to perform 'multidimensional scaling' i.e. map the r-dimensional coordinates to 2-dimensional coordinates by some suitable transformation. Hall manually solved small placement problems (n < 30) using his technique, using the output of his algorithm to guide him in the process [4]. Some researchers have improved Hall's technique to solve the quadratic placement problem [3, 12]. The first step in Hall's placement procedure is to use the connectivity matrix C and compute the B matrix (see Equation 2). B is a positive semidefinite, symmetric matrix. The second step is to compute the eigenvalues and eigenvectors of B. Sort the eigenvalues in the ascending order of magnitude. Let e(j) denote the jth smallest eigenvector. Let e(j,k) denote the ifcth element of e(j). In an rdimensional plane, the coordinates of cell k are given

a binary tree with 4 leaves (Figure 2(a)). The connectivity matrix for the binary tree is shown in Figure 2(b). The ,B matrix is shown in Figure 2(c). The eigenvalues for this example are, in the ascending order of magnitude, {0, .268,1, 1, 1.586, 3.732, 4.414 }. If we are interested in a two-dimensional placement of the 7 cells, two appropriate eigenvectors must be selected to define the coordinate system for the cells. The minimum eigenvalue, namely 0, is ignored. The next two eigenvalues are 0.268 and 1; the corresponding eigenvectors are e(2) = (0.0000, 0.3251, -0.3251, 0.4440, 0.4440, -0.4440, -0.4440) and e(3) = (0.0000, 0.0000, 0.0000, 0.7011, -0.7011, 0.0918, -0.0918). It is evident that cell 1 must be placed at (0.0,0.0) in the two-dimensional plane. Cells 2 and 3 are placed at (0.3251,0) and (-.3251,0) respectively. The complete placement of the cell is shown in Figure 3. In fact, the circuit shown in Figure l(a) corresponds to a binary tree with 4 leaves. The optimum placement for this circuit shown in Figure l(c) has striking resemblance to the symbolic placement of Figure 3. 1.2 Partition Since it may not be possible to place an entire system on a single VLSI chip due to area and pin constraints, the system must be partitioned into subsystems. Any wiring that extends across subsystems (called "external" wiring) can occupy a lot of area on a printed circuit board. Signals traveling across external wiring can be slowed down due to 'transmission line' effect. Therefore, it is desirable to minimize external wiring in order to enhance to area-time product of the whole system. The general system partition problem is to generate a fc-way partition of n logic blocks. A special case of the partition problem is when k = 2. If & is a power of two, then a fc-way partition can be solved by applying the two-way partition algorithm successively in a tree-like fashion. A popularly used two-way partition algorithm is due to Kernighan and Lin [7]. In this paper, we will use the terminology of [7] when referring to Kernighan-Lin algorithm. Due to space limitation, the description of the algorithm is omitted; it may be found in [7].

2

Target Architecture

Figure 5 shows the architecture of the Alliant FX/80 multiprocessor [1]. The configuration shown in Figure 5 corresponds to the machine Hydra at University of Southern California. The main building block of the Alliant is the advanced computational element (ACE), 8 of which are available for parallel operation. Each ACE is a microprogrammed, pipelined processor and supports the instruction set of Motorola 68020, floating point instructions, and Consider 7 circuit cells which are connected to form vector instructions which can handle double precision

no

(64-bit) arithmetic. The peak MFLOPS rating of an ACE is 23.6. The ACEs are connected by means of a 40-bit wide bus known as concurrency control bus. The memory unit of the Alliant consists of 8 modules of 8 MB. Each memory module is 4-way interleaved to increase the memory bandwidth. The memory bus is a high speed, synchronous bus with two 72-bit wide bidirectional data paths. The bandwidth of the memory bus is 188 MBytes/sec. A cache unit, consisting of two sections of 256 KB each, is interfaced between the ACEs and the memory unit. Communication between the ACEs and the cache takes place over a crossbar interconnection network. Apart from the ACEs, the Alliant also supports up to 12 interactive processors (IPs). While the ACEs carry out CPU-intensive tasks, IPs handle interrupts, terminals, disks, tapes and network traffic. The IPs interface to the memory bus through an IP cache. In a fully configured system, IP cache consists of 4 segments of 32 KB each. The Alliant FX/80 runs the Concentrix Operating System, which is based on the 4.2 BSD version of UNIX. Under Concentrix, three separate classes of computational resources may be identified. These are the computational complex, detached ACEs and interactive processors; the user has the flexibility of specifying which of the three platforms should be used to run his/her job. A computational complex (cluster) consists of two or more ACEs working in parallel. Scheduling and load balancing is done automatically by the operating system. The Concentrix operating system has the capability to dynamically configure the available computational resources. For example, ACEs 1,2,3 and 4 can form a computational complex to run a single job in a concurrent mode. At the same time, ACEs 5,6,7 and 8 can execute in the detached mode, running independent jobs. The user can create processes to run on a computational platform of his/her choice. These processes access the shared memory in the system. If two processes must share a variable, a lock and key mechanism must be used for correct operation. Thus a process P can only access a shared variable x it P can successfully lock x. After x has been suitably updated by P, the process unlocks the variable so that another process is able to obtain access to x. The commands to create processes and to access shared variables are available as library procedures that can be called from high level languages such as C. The Concentrix operating system supports compilers for several languages - FX/C, FX/Fortran, FX/Ada and FX/Pascal. FX/Fortran on the Alliant is equipped with a par-

allelizing compiler. Different types of optimization can be specified as command-level options to the Fortran compiler. These optimizations are scalar optimization, vectorization and concurrentization. One or more of these options can be specified at a time. The compiler analyzes the Fortran source code for data dependences and generates instructions that use the concurrency and vectorization features of the hardware. As a first step, the compiler looks for concurrency i.e. if loops and array operations can be scheduled on more than one processor for concurrent execution. Next, the compiler optimizes the program for vectorization i.e. the compiler aggregates several scalar operations into a vector operation. Finally, scalar optimization is carried out; this consists of eliminating redundant expressions and dead storage, moving invariant code out of DO loops, computing and propagating constants at compile-time, etc. The FX/Fortran compiler also provides an option to optimize programs for associative transformations. The FX/Fortran compiler can generate five different types of code, namely, scalar, vector, scalarconcurrent, vector-concurrent and concurrent-outervector-inner (CO VI).

3

Placement on Alliant

We have implemented Hall's algorithm on an Alliant FX/80 in FX/Fortran. The program was coded in order to exploit both the vector and concurrent modes of execution provided by the Alliant. The input to the program consists of a description of the circuit connectivity in the form of an n x n connectivity matrix C; the matrix is stored in the shared memory of the Alliant machine. The computation of the B matrix involves three steps : initialization of the B matrix, calculation of elements of the form B(i, i) (1 < i < n), and calculation of elements of the form B(i,j), i £ j (1 < i,j < n). All these calculations are done as part of do loops which can be efficiently parallelized on the Alliant. The procedure rsm was used to compute the eigenvectors of the B matrix. This Fortran subroutine was obtained from the eigensystem subroutine package EISPACK [11] and then vectorized, rsm is customized to operate on real symmetric matrices. The main inputs to this subroutine are the matrix B, its dimensionality n and the number m of eigenvectors desired. As output, the procedure generates m eigenvectors that correspond to the m smallest eigenvalues, m — 3 was used and the eigenvector e(l) was ignored. The program was tested against several benchmark circuits. These circuits were used by Saab and Rao to test their circuit partition algorithm [10]. The results of our experiments are shown in Figure 5; the execution time

in

includes both CPU time as well as I/O time. We observe that vectorization is more efficient than concurrent processing in the following sense. Increasing the number of computing elements (CEs) from one to two or more does not bring a proportional improvement in the performance of the algorithm. To investigate the reason for this behavior, the following three operations were individually timed: (1) Input/Output (Reading the Connectivity Matrix and printing the placement vector X.) (2) Computation of the B matrix (3) Computation of the Eigen Vectors of B. The resulting plots are shown in Figures 5-5. The speedup increases almost linearly with P for the computation of B matrix (Figure 5). The placement algorithm using the eigenvector method is also speeded up substantially through the use of vector concurrent execution (Figure 5). However, the I/O requirements of the algorithm become the bottle neck for large values of n. When a single ACE works in the scalar mode, the I/O time is only a small fraction of the total execution time (less than 5% of total execution time when n — 350.) But with 4 ACEs working in vector concurrent mode, the same I/O time occupies close to 50% of the total execution time (Figure 5). This observation confirms the well known fact that I/O time is the serious bottle neck in parallel computation.

4

Partition on Alliant

access the memory elements in an orthogonal fashion i.e. at any time processors access rows of memory or columns of memory. Due to this orthogonal access property, there cannot be a memory conflict. Therefore, the reduced array architecture can efficiently simulate the theoretical model of parallel computers known as EREW PRAM (Exclusive Read, Exclusive Write Parallel Random Access Machine). The essential idea in [8] is to parallelize each step of the algorithm of [7], taking advantage of the interprocessor communication primitives of the reduced array. In the Alliant FX/80, the model of parallel computation is significantly different from the reduced array. While the reduced array machine has a large number of simple processors, the Alliant has a small number of powerful processors. The former is an SIMD machine, while the latter is a multiprocessor. The common feature of both machines is that they are both shared-memory computers. These factors become important in mapping the partition algorithm to the Alliant FX/80. We have mentioned earlier that the connectivity matrix is stored in the shared memory of Alliant FX/80; in spite of this fact, notice that processors can concurrently access the matrix without having to enter a critical region. This is because the matrix is a "read only" object. We take advantage of the vector capabilities of each processor to achieve parallelism. Specifically, the following operations benefit most through vector execution computing the D values, updating the D values, selecting the cells which must be swapped in each iteration, computing the partial sums of the gains g,-, and calculating the index k such that the cumulative gain J2i=i 9* m maximum. The first two of these computations are parallelizable with no overheads. The remaining operations involve the associative operations max and partial sum and are parallelized using the "recursive doubling" technique [5]. Doubling is a method by which an associative operation on n elements can be performed in log2 n) time steps using n processors in a tree-like fashion. When p < n processors are available, the n elements are distributed equally among the p processors. Each processor applies the associative operation sequentially on n/p elements. The partial results of the p processors are then combined in log2 p steps. Therefore, the associative operation takes O(^ + log2p) time, representing an asymptotic speedup of p. When p is not a power of two, the doubling technique presents an overhead. This will be mentioned later in this section.

The partition program is based on Kernighan and Lin's algorithm described in Section 1.2 and is implemented in FX/Fortran [1]. The implementation of the Kernighan-Lin algorithm closely follows the algorithm described in [7]. The program can be used for graph bisection in general and two-way circuit partition in particular. The connectivity matrix is stored in a two-dimensional array C. The user has the option to specify an initial partition. Otherwise, the program generates a random initial partition using the following technique. A random permutation of 1 • • • n is generated in the array perm. The circuit modules numbered perm(l), • • •, perm(n/2) are placed in the set X. The remaining modules are assigned to set Y. The output of the program is the pair of sets X and Y, both of which are implemented as integer arrays of dimension n/2. The parallelization of various steps in the algorithm follows the ideas presented in a previous publication [8], except that a fixed number of processors are available in the target machine. In [8], an SIMD architecture called the reduced array was used to execute the partition algorithm. The reduced array has n processing elements and n2 memory elements; the memory is organized The program was tested against circuits provided as an n x n array. Processor i is connected to the by Saab and Rao [10]. In these circuits, problem sizes row i and column i of memory elements. Processors range from 20 cells up to 350 cells. The results are

112

Ckt Nune iapSO.l.l lnp50.1.1 inlOO.1.1 lnlSO.1.1

insoo.i.i

in300.1.1 it.350.1.1

n

Sola.

Seq. (•)

T (1)

30 SO 100 ISO 300 300 3S0

T5 403

.05 .55

.04 .33

.03 .30

1686 4355 T636 1T331 33469

4.43 33.3 44.6 135.T S13.T

1.9T 6.06 16.3 56.3 133.5

1.63 6.00 14.3 34.6 T6.8

(0

'1'J.J

T(4) (•)

.04 .31

.04 .31

1.59 4.63 13.61 43.4 6T.1

1.06 4.51 9.86 31.39 63.98

Table 1: Results of Kemighan-Lin on Alliant FX/80

Quinn and Breuer formulated PCB placement as a numerical problem of solving a set of non-linear algebraic equations [6]. Classical techniques, such as Newton-Raphson method, can be used to solve these equations. Since these algorithms is expressible in terms of vector operations, vector machines are good computing platforms for these algorithms. Apart from numerical optimization techniques, several other approaches are known to solve the placement problem approximately e.g. the Min-cut heuristic, the divide-and-conquer heuristic, iterative improvement, simulated annealing, and branch-andbound. Parallel algorithms for many of these approaches are explored in [9]. The partition algorithm described in this paper is based on a heuristic due to Kernighan and Lin. Other partition algorithms used by design community are based on simulated annealing, genetic algorithms, and Hopfield's neural networks; parallel versions of these are also discussed in [9].

shown in Table 1. We observed that for the same problem instance, the execution time varied considerably from one run to another. This is because the random initial solution is different for every execution. The initial solution affects the execution time in two principal ways. First, the number of iterations of the repeat loop are determined by the initial solution. Secondly, the memory access patterns depend on the initial solution. For example, consider the calculation of E(a) for a module a. This involves the access of all connectivity terms of the form C(a,b), where b is a module in the opposite block. Since the initial partition determines the opposite block, it is clear that different C(a, b) are requested for comput- Acknowledgement ing E(a) during different executions of the program. This work was done when the author was a student The connectivity matrix is stored in the shared mem- at University of Southern California. ory and a cache is used to access elements from the shared memory. The access sequence plays a signif- References icant role in determining the performance of cache [1] Alliant Comp. Sys. Corp. FX/Fortran Prog. as well as interleaved memory. In order to overcome Manual, 1987. this problem while reporting results, each experiment was repeated several times and the average execu- [2] C.-K Cheng and E.S. Kuh. Module Placement based on Resistive Network Optimization. IEEE tion time was used. Figure 5 shows the variation of Trans, on CAD, pp. 218-225, July 1984. speedup and execution time of the parallel partition algorithm as a function of the P, the number of computational elements. The case P = 0 corresponds to [3] J. Frankle and R. M. Karp. Circuit Placements and Cost Bounds by Eigenvector Decomposition. a sequential scalar implementation. The speedup is In Proc. ICCAD , pp. 414-417, 1986. seen to deteriorate for P = 3. This is chiefly due to the fact that associative computations by recursive doubling are inefficient when the number of proces- [4] K.M. Hall, r-dimensional Quadratic Placement Problem. Mgmt Sci., 17:219-229, 1970. sors is not a power of two. Figure 5 plots the variation of execution time as a function of the problem size n, [5] K. Hwang and F.A. Briggs. Computer Arch, and using P as a parameter. Parallel Proc. McGraw-Hill, 1985.

5

Conclusion

By formulating physical design problems as numerical optimization problems, they can be solved efficiently on vector multiprocessors. There have been several other attempts to solve VLSI design problems using numerical techniques. Cheng and Kuh [2] have formulated the placement problem as the minimization of power dissipation P in a resistive network. Kuh, Tsay and Hsu employed the above technique in their PROUD placement algorithm for seaof-gates integrated circuits [12]. The Successive Over Relaxation method (SOR) was used for optimization.

[6] N. Quinn Jr. and M. A. Breuer. A Force Directed Component Placement Procedure for Printed Circuit Boards. IEEE Trans, on CAD, CAD26(6):377-387, June 1989. [7] B.W. Kernighan and S.Lin. An Efficient Heuristic Procedure for Partitioning Graphs. Bell Sys. Tech. Jour., 49(2):291 - 307, Feb. 1970. [8] C. P. Ravikumar et al. Parallel Circuit Partitioning on a Reduced Array Architecture. CAD Journal, 21(7):447-455, Sep. 1989. 113

1

F

_L 3

2

4

5

1

6

7

1 2

1 1 0 0

0

0

0

1 0

(k)

0

0

1 1 0 0: \

-1-1

- 1 3

0

0 - 1 0

-1

0

0

0

0

0

0

0 - 1 0

0

0

0

0

0

0

0 - 1 0

1 0

1 1

1

0

0

0

1 0

0

0

0

0

0

1 0

0

0

0

- 1 0

3

-1 0 0

0

0 - 1

0

1

0

0

0-1

0

0

7

w.8

1

3

T 5

0

——

1

7

Figure 2: Hall's placement algorithm for a treestructured circuit.

6

4