Parallel Implementations of the Power System Transient ... - CiteSeerX

3 downloads 9357 Views 240KB Size Report
neous network of workstations (NOW) and the SP2 cluster of workstations. ... and a set of loads (electric distribution centers) interconnected via a transmission ...
Parallel Implementations of the Power System Transient Stability Problem1 on Clusters of Workstations Monika ten Bruggencate and Suresh Chalasani Dept. of Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706-1691 Email: ftenbrugg,[email protected]

Abstract Power system transient stability analysis computes the response of the rapidly changing electrical components of a power system to a sequence of large disturbances followed by operations to protect the system against the disturbances. Transient stability analysis involves repeatedly solving large, very sparse, time varying non-linear systems over thousands of time steps. In this paper, we present parallel implementations of the transient stability problem in which we use direct methods to solve the linearized systems. One method uses factorization and forward and backward substitution to solve the linear systems. Another method, known as the W-Matrix method, uses factorization and partitioning to increase the amount of parallelism during the solution phase. The third method, the Repeated Substitution method, uses factorization and computations which can be done ahead of time to further increase the amount of parallelism during the solution phase. We discuss the performance of the di erent methods implemented on a loosely coupled, heterogeneous network of workstations (NOW) and the SP2 cluster of workstations.

This research has been supported in part by a grant from the Graduate School of UW-Madison and by the NSF grants ECS-9216308 and CCR-9308966 . 1

1 Introduction A power system can be modeled as a set of generators (power plants) and a set of loads (electric distribution centers) interconnected via a transmission network [2]. The transmission network itself consists of nodes (buses) and branches. A model of a small power system is given in Figure 1. The circles correspond to generators, the triangles to loads and the lines connecting them are the branches of the transmission network. The example shown contains seven nodes, where power is fed into the transmission network and distributed to the loads. The powersystem stability analysis computes the response of a power-system to a sequence of large disturbances followed by operations to protect the system. Network short circuits, for instance, can be considered large disturbances [9]. The design engineer must analyze the ability of the system to withstand speci ed severe disturbances without failing to supply sucient power to all loads. Furthermore, the design engineer is often interested in the system response characteristics, such as damping of oscillations and recovery to almost normal operating conditions after disturbances occurred. The stability analysis of a power-system is a simulation in the time domain, lasting several seconds or minutes. First the power-system is simulated in its operating state, then large disturbances and protective actions are simulated and nally the simulation is continued for a few more seconds or minutes. Di erent components of the power system have their greatest in uence on the stability of the system at di erent points in time of the response simulation. The transient stability analysis of an electrical power system emphasizes the rapidly responding electrical components of the system [21], for instance voltages and currents at the loads. Besides stability analysis two other power-system computations are routinely performed. They are load- ow analysis and fault analysis. The stability problem attracted our attention since it is by far the most complex of the three in terms of modeling and solution methods. In engineering applications, it is often desirable to obtain information about the e ects of di erent fault locations, types of faults or initial operating states of the power system etc. In design studies, it is desirable to obtain information about the e ects of di erent networks, machine characteristics and control-system characteristics. Such studies require many response simulations, each of which involves the repeated solution of large sets of sparse linear equations. It is obvious that this problem requires fast computers and sophisticated algorithms. Signi cant improvements in the application of numerical and computational methods to the problem have been made in recent years. However, this alone is not sucient, since the computational demands of stability studies are increasing rapidly. Larger systems need to be solved, more complicated models are being used, systems are being simulated over longer periods of time and more frequently. Advances in parallel computer architectures have provided powerful tools for solving large power system problems better. Industry has a strong need for fast on-line sta2

: Generator

: Load

Figure 1: Power system consisting of generators, loads and transmission network bility analysis to determine the dynamic security of power systems. It is desirable to simulate power systems in real-time to avoid failure and to allow a less conservative operation of the system. Any breakthrough in transient stability analysis will be of tremendous bene t to the ecient operation of power systems. Prior Work. Research in the area of parallel algorithms for power system stability analysis has been done for many years, but has intensi ed recently due to rapid advances in parallel computer architectures. Many researchers participate in the search for better ways to exploit parallelism in the solution of sparse matrix problems. Alvarado, Yu and Betancourt present the partitioned sparse A?1 methods for solving large sparse systems in parallel environments [1]. Koester, Ranka and Fox implement sparse linear solvers for matrices having block-diagonal-bordered form on the CM-5 [18, 6, 17]. Falcao, Kaszkurewicz and Almeida implement a parallel electro-magnetic transient program on an 8-processor hypercube [11]. Lin and Van Ness present methods for solving sparse algebraic equations on the iPSC/860 hypercube and the Sequent Symmetry S81 [19]. La Scala, Bose, Tylavsky and Chai present a method well suited for large scale parallel processors for solving the transient stability problem [20]. Decker, Falcao and Kaszkurewicz use the conjugate gradient method to solve the transient stability problem on an 8-node hypercube based on the Inmos Transputer T800 [7]. Chai, Zhu, Bose and Tylavsky test parallel Newton-type algorithms for transient stability analysis on the local memory machine iPSC/2 and the shared memory machine Alliant [5]. Chai and Bose also compare Gauss, Newton and relaxed Newton type algorithms for power system stability analysis on a 32-node hypercube iPSC 32 and on a Sequent Symmetry using 26 cpus [4]. Ilic-Spong, Crow and Pai use waveform relaxation methods for parallel power system dynamic response calculations [16]. Granelli, Montagna, Pasini and Marannino apply the W-Matrix method to a load ow procedure on a 4-cpu CRAY Y-MP8/432 vector computer and a 4-cpu Alliant FX/80 [13]. Huang and Ongsakul analyze the bottlenecks 3

of parallel Gauss-Seidel algorithms for power ow analysis [15]. Hou and Bose present parallel implementations of the waveform relaxation algorithm on a Sequent Symmetry S81 shared memory machine [14]. Falcao also presents some ideas for the use of parallel and distributed processing in power system simulation and control [10]. To the best of our knowledge no other researcheres have done and published work in transient stability analysis on network of workstations. In Section 2, we present a mathematical formulation of the transient stability problem. Section 3 describes three approaches to solve the problem and Section 4 outlines the parallel implementations of those approaches. In Section 5 we brie y introduce the computer systems we use to run our experiments and in Section 6 we present and analyze the results. Section 7 concludes the paper.

2 Mathematical Formulation of the Transient Stability Problem The dynamical model for an interconnected power-system can be completely described by a set of highly nonlinear di erential equations: X_ = f (X; V ) (1) subject to the initial conditions: X (0) = X0 and a set of nonlinear algebraic equations 0 = g(X; V ); (2) where X are the state variables of the di erential equations, also known as generator variables, which describe the dynamics of the system [16]. The generator variables typically include rotor angle, velocity deviation, mechanical power etc. Further, V are the network variables, for instance currents in the loads and generators. The set of di erential equations can be solved using explicit or implicit integration methods. Implicit integration methods have the advantage of providing better numerical stability and being able to avoid the diculty of sti problems [21, 20]. For these reasons in power system applications they are preferred over their explicit counterparts. Hence, Equation 1 can be discretized using the implicit trapezoidal rule: (3) Xt = Xt?1 + h2 (f (Xt ; Vt) + f (Xt?1 ; Vt?1)) ; where t = 1; 2; : : : T denotes time steps, while h represents the step length. Next, we bring Equation 3 into residual form: (4) RG = Xt ? Xt?1 ? h2 (f (Xt ; Vt) + f (Xt?1 ; Vt?1)) = 0; 4

where RG is the residual with respect to the generator variables. Also Equation 2 can be written in residual form: RN = g(Xt; Vt) = 0; (5) where RN denotes the residual with respect to the network variables. equations 4 and 5 are linearized using Newton's method and then combined to form the linear system: "

where

#

"

AG B RG RN = ? C Y + Yld

G AG = R X ;

G B = R V ;

#"

N C = R X ;

#

"

#

X = ?J X ; V V

(6)

N JN = R V = Y + Yld: (7)

Y is the network admittance matrix and Yld is a diagonal matrix obtained from the derivation of nonlinear load currents with respect to the voltages at the nodes. Next, we show the structure of the time-variant Jacobian J in more detail [4]. 2

3

AG1

B1 6 : 0 : 777 6 6 6 : : 77 6 6 AGi Bi 777 ; (8) J = 666 : : 77 6 6 0 : : 777 6 6 AGm Bm 5 4 C1 : : Ci : : Cm JN In the above Jacobian, m is the number of generators in the system. After taking the Schur complement of the linear system in Equation 6, we obtain the set of equations: RdN = RN ? CA?1 (9) G RG ?1 d JN = JN ? CAG B (10) ?d JN V = RdN ; (11) where Equation 11 can be solved for V . Finally, we perform backward block substitution to nd X , (12) X = A?1 G (RG ? B V ) : Figure 2 shows the transient stability analysis as a simulation in the time domain. First, for a predetermined short amount of time, the original power-system is simulated starting from the equilibrium state. This is done by repeatedly solving the linear system in Equation 6, where the Jacobian J is usually very sparse with less than 1% 5

disturbance occurs

start

t1

protective action is taken

t2

finish

Figure 2: Transient Stability Simulation in the Time Domain. nonzero elements. Then, a large disturbance in the operation of the power system is simulated by changing one or several of the algebraic equations in Equation 5. This causes the Jacobian J in Equation 6 to change structurally. For a certain amount of time the power-system containing the disturbance is simulated; that is, the linear system based on the changed Jacobian is repeatedly solved. Then, protective action must be taken; that is, the power-system needs to be changed locally to compensate for the disturbances. This is simulated by restoring the original algebraic equations and repeatedly solving the original linear system for the remainder of the simulation period. If, after rst changing and then restoring the original linear system the solution converges to the equilibrium state again, the power system was able to withstand the severe disturbance and restore normal operating conditions.

3 Solution Approaches When designing algorithms for parallel machines, it may not be ecient to develop an algorithm independent of the underlying computer. Signi cant gains in performance of a parallel algorithm over its serial counterpart can only be achieved if the parallel algorithm takes the characteristics of the underlying parallel computer into consideration and if the implementation techniques take advantage of the special hardware and software features of the underlying machine. Factors limiting the performance of parallel algorithms are often high communication cost, unsolvable sequentiality, poor implementation techniques, and possibly slow convergence rate of parallel algorithms. In this paper, we explore the advantages and disadvantages of various implementations of the transient stability problem on di erent parallel machines. We present three direct methods to solve a sparse linear system Ax = b, all of which require that the matrix A is factored into A = LDU , where L is a unit lower triangular matrix, D is a diagonal matrix, and U is a unit upper triangular matrix. Note that A here corresponds to JbN in the mathematical formulation of the transient stability problem.

6

3.1 Very Dishonest Newton Method As described in Section 2, the nonlinear systems in Equation 4 and 5 are linearized using the Newton method, which results in the linear system in Equation 6. If the Newton method were used precisely to solve the transient stability problem, in every time step the Jacobian J in Equation 6 had to be updated. Consequently, the matrix JbN ( or A ) would have to be re-factored in every time step when a direct method is used to solve Equation 11. However, factorization is expensive in terms of time and gains in performance can be achieved by implementing a variant of the Newton method, called the very dishonest Newton (VDHN) method [5]. Here, the matrix is not factored in every time step. Instead, the same factors L; D and U are reused unless the system undergoes signi cant changes. This method performs very well in practice because the fast convergence characteristics of the Newton method can tolerate the error introduced by using approximate factors. The VDHN method is the fastest sequential algorithm known for transient stability simulation. Under all practical circumstances we encountered, the VDHN method converges at every time step in at most four iterations. We re-factor JbN only when an error is introduced into and removed from the algebraic equations. We also re-factor the matrix if the convergence of the VDHN method slows down. However, in practice we have not encountered this case yet. Thus, over the whole duration of the simulation JbN is factored only three times, which is, initially, second, when an error is introduced into the algebraic equations, and, third, when the error is removed again.

3.2 Forward and Backward Substitution To be able to solve Equation 6, we rst need to solve Equation 11 for V . One way to solve an nxn linear system Ax = b (Equation 11) is using a direct method, which rst factors A into A = LDU , and then performs forward and backward substitution to nd x. This means, the system 2 d1;1 u1;2 u1;3 : : u1;n 3 2 x1 3 2 b1 3 76 x 7 6 b 7 6 l : 76 2 7 6 2 7 6 2;1 d2;2 u2;3 7 7 6 76 6 7 6 x3 7 6 b3 7 6 l3;1 l3;2 d3;3 : 7 6 7 6 7 6 LDUx = 6 : 7 6 : 7 = 6 : 7 (13) : : 7 6 7 6 7 6 6 4 : : un?1;n 75 64 : 75 64 : 75 bn xn ln;1 : : : ln;n?1 dn;n is solved in three phases, which are, rst, solve Ly = b for y (forward substitution), then solve Dz = y for z (diagonal scaling), and nally, solve Ux = z for x (backward substitution). Forward substitution

7

proceeds as follows. The system 2 1 0 0 : : 0 3 2 y1 3 2 b1 3 6 l 0 : 777 666 y2 777 666 b2 777 6 2;1 1 6 6 : 77 66 y3 77 = 66 b3 77 (14) Ly = 66 :l3;1 l3;2 1 : 7 : 7 6 : 7 7 6 6 : 7 76 6 6 4 : : 0 75 64 : 75 64 : 75 bn ln;1 : : : ln;n?1 1 yn can be written as y1 = b 1 y2 = b2 ? l2;1y1 y3 = b3 ? l3;1y1 ? l3;2y2 etc: Hence y1 through yn can be computed in n serial steps. Diagonal scaling can be performed in only one step, which is zi = yi=di;i ; 8i, since all zi are independent of each other. During backward substitution the system 2 1 u1;2 u1;3 : : u1;n 3 2 x1 3 2 z1 3 6 0 1 u2;3 u2;n 777 666 x2 777 666 z2 777 6 6 7 7 6 76 6 Ux = 66 :0 0 1 : :u3;n 77 66 x: 3 77 = 66 z: 3 77 (15) 7 7 6 76 6 6 4 0 1 un?1;n 75 64 xn?1 75 64 zn?1 75 zn xn 0 : : : 0 1 can be solved in n serial steps as follows: xn = zn xn?1 = zn?1 ? un?1;n zn xn?2 = zn?2 ? un?2;n?1 zn?1 ? un?2;n zn etc: The advantage of forward and backward substitution is its ease of implementation. The disadvantage is that it is an inherently sequential process. This becomes obvious when we look at the n steps required to perform forward or backward substitution.

3.3 W-Matrix Method The problem Ax = b, where A is factored into A = LDU , can be solved by constructing factored components of the inverses of L and U [8, 1]. The number of additional ll-ins in the partitioned inverses of L and U can be made zero. However, allowing some ll-ins results in fewer partitions. Again, the solution of the problem LDUx = b proceeds 8

in three steps, which are forward substitution, diagonal scaling and backward substitution. De ne W l = L?1 and W u = U ?1. Then, W l is lower triangular and sparse and W u is upper triangular and sparse. The solution to the original problem can be written as x = W u  D?1  W l  b: (16) This equation can be solved in only three serial steps, z = W l  b; y = D?1  z and x = W u  y; (17) where all multiplications within each step can be performed in parallel. However, the number of ll-ins in the W matrices can be signi cant. The W matrices W l and W u can be partitioned to enhance sparsity. Assume L and U are nxn matrices. Then L can be expressed as L = L1  L2 : : :Ln ; (18) where each Li is an identity matrix except for the i-th column, which contains column i of L. Then, l l l ?1 ?1 (19) W l = L?1 = L?1 n : : :L2  L1 = Wn : : : W2  W1 : Note, Wil is Li with its signs reversed on the o -diagonal elements. The solution of the original problem can now be written as (20) x = W1u  W2u : : :Wnu  D?1  Wnl : : : W2l  W1l  b: The solution of this equation requires 2n + 1 serial steps, like forward and backward substitution discussed in the previous section. The very few multiplications within each step can again be performed in parallel. Adjacent Wi matrices can be combined to reduce the number of partitions. For instance, each W matrix can be grouped into two partitions Wa and Wb. Then the solution for x can be expressed as x = Wau  Wbu  D?1  Wbl  Wal  b (21) and x can be found in ve serial steps. The challenge with this approach is to nd an optimal balance between the two con icting goals of minimizing the number of partitions and minimizing the number of ll-ins.

3.4 Repeated Substitution To avoid the many serial steps involved in using forward and backward substitution, we consider an alternative method to solve the linear system LDUx = b. The idea behind this method, which we call repeated substitution (repsub), is as follows. Again, the solution 9

of the linear system LDUx = b is split into solving the three systems Ly = b, then Dz = y and nally Ux = z. The rst equation of the system Ly = b yields y1 = b1 (see Equation 14), the second equation yields y2 = b2 ? l2;1  y1 , the i-th equation yields yi = bi ? li;1  y1 ? li;2  y2 ? : : : ? li;i?1  yi?1, etc. Repeatedly substituting the equations for yj into the equation for yi where j < i for all i gives us expressions for yi which depend only on the elements of the right hand side b and the entries of the matrix L. We obtain expressions of the form: yi = bi ? factor(i; 1)  bi?1 ? factor(i; 2)  bi?2 ? : : : ?factor(i; i ? 1)  b1; for i = i; : : :; n;

or

yi = b i ? where

i?1 X k=1

factor(i; k)  bi?k ; for i = i; : : :; n;

(22)

?1 li;i?k+j  factor(i ? k + j; j ) (23) factor(i; k) = li;i?k ? Pjk=1 for i = 1; : : : ; n and k < i

describes how yi depends on bi?k . As described in Section 3.2, the system Dz = y is solved in one step, zi = yi=di;i ; 8i. Finally, Ux = z represents equations of the form xi = zi ? ui;i+1  zi+1 ? ui;i+2  zi+2 ? : : : ? ui;n  zn. Repeatedly substituting equations for xj into the equation for xi where j > i gives us the following expression for xi, which depends only on the entries of U and the elements of z:

xi = zi ? where

nX ?i

k=1

zi+k  factor(i; k); for i = 1; : : : ; n;

(24)

?1 u factor(i; k) = ui;i+k ? Pjk=1 i;i+k?j  factor(i + k ? j; j ) (25) for i = 1; : : : ; n and 1  k  n ? i describes how xi depends on zi+k where 1  k  n ? i.

The advantage of this method is that it involves only three serial steps to solve for x and all multiplications within each step can be performed in parallel once the vector b is known. The values factor(i; k) can be precomputed for all i and k since they depend only on the entries of L, respectively U . Consequently, during the solution phase, all xi can be computed in parallel in only three steps since all dependencies between elements of x, y and z are removed. The three steps are, rst, compute y in parallel, then compute z in parallel and nally, compute x in parallel. For varying right hand sides b the values factor(:; :) do not change. They only need to be computed once if systems LDUx = bi, for i = 1; 2; : : : are solved, which is frequently the case in power system applications. This fact justi es the cost of precomputing the values factor(:; :). 10

Transient Stability Analysis()

1. create children and read input data; / one broadcast / 2. save current values of V and X ; 3. gather X values; / one reduction / 4. from time-step = start to time-step = nish do f 4.1 if (time-step equals t1) then 4.2 4.3 4.4 4.5 4.6

insert fault by modifying algebraic equations; else if (time-step equals t2) then remove fault; evaluate generator residual RG ; evaluate network residual RN ; check termination; / one reduction, one broadcast / while (termination criterion not satis ed) do f 4.6.1 form Rb N ; / one reduction / 4.6.2 form JbN ; / one reduction / 4.6.3 if (time-step equals start or t1 or t2) then factorize JbN ; 4.6.4 solve for V ; / one broadcast / 4.6.5 solve for X ; 4.6.6 update X ; 4.6.7 update V ; / one broadcast / 4.6.8 gather X values; / one reduction / 4.6.9 update Jacobian J ; 4.6.10 evaluate generator residual RG ; 4.6.11 evaluate network residual RN ; 4.6.12 check termination; / one reduction, one broadcast /

g

g

4.7 save current values of X and V ; 4.8 increase time-step by step size; Figure 3: Pseudo Code for ForwardBackward. 11

Transient Stability Analysis()

1. create children and read input data; / one broadcast / 2. save current values of V and X ; 3. gather X values; / one reduction / 4. from time-step = start to time-step = nish do f 4.1 if (time-step equals t1) then 4.2 4.3 4.4 4.5 4.6

insert fault by modifying algebraic equations; else if (time-step equals t2) then remove fault; evaluate generator residual RG ; evaluate network residual RN ; check termination; / one reduction, one broadcast / while (termination criterion not satis ed) do f 4.6.1 form Rb N ; / one reduction / 4.6.2 form JbN ; / one reduction / 4.6.3 if (time-step equals start of t1 or t2) then factorize JbN and read precomputed values factor(:; :); / three broadcasts / 4.6.4 solve for V ; / two reductions, two broadcasts / 4.6.5 solve for X ; 4.6.6 update X ; 4.6.7 update V ; / one broadcast / 4.6.8 gather X values; / one reduction / 4.6.9 update Jacobian J ; 4.6.10 evaluate generator residual RG ; 4.6.11 evaluate network residual RN ; 4.6.12 check termination; / one reduction, one broadcast /

g

g

4.7 save current values of X and V ; 4.8 increase time-step by step size; Figure 4: Pseudo Code for RepSub. 12

4 Parallel Implementations In this section we describe di erent parallel implementations of the transient stability problem. These implementations exploit varying degrees of parallelism and have di erent communication requirements. The rst one is a parallel implementation of repeated forward and backward substitution, and will be referred to as ForwardBackward. A similar strategy has been used in [4]. The second on is a parallel implementation of the W-Matrix method and will be referred to as W-Matrix. The third one is a parallel implementation of repeated substitution and will be denoted RepSub. All implementations take the time-domain integration method and decompose parallelizable portions into sub-tasks among processors. This approach is called parallel-in-space. For clarity of understanding the pseudo code for ForwardBackward is shown in Figure 3 and the pseudo code for RepSub is given in Figure 4. The pseudo code for W-Matrix is identical to the one of RepSub except for steps 4.6.3 and 4.6.4. We will explain the similarities and di erences of the implementations in the following paragraphs. The main di erence between the three implementations is the amount of parallelism integrated in the solution of Equation 11. ForwardBackward performs all computations but the solution of this equation in parallel, which leads to load imbalance but minimizes communication between processes. W-Matrix and RepSub on the other hand perform all computations in a distributed manner requiring di erent amounts of communication. Data Partitioning: After the parent process spawned as many child processes as required, all processes read the input data les (step 1 in Figures 3 and 4). The data les contain the matrices AG; B; C and JN , as well as the initial stable state solutions for X and V (see equations 6 through 8). All matrices are stored in i-j-value-format, which is a widely used format for sparse matrices. All three, ForwardBackward, WMatrix and RepSub use the following data partitioning strategy for the generator data. Data and tasks are distributed according to generator numbers among processes, that is, each process is in charge of one or more generators. Assume the power system contains n generators and we have p processes. If n is equally divisible by p, then each process is in charge of n=p consecutive generators. Otherwise, assume the remainder of the division is r. Then, process 1 is in charge of generators 1 through b np c, processes 2 through 2 + r ? 1 are in charge of b np c + 1 consecutive generators each. Processes 2 + r through n are in charge of b np c consecutive generators each. Process 1, the parent process, is assigned overhead work in our implementations and for this reason we distribute the remaining r generators only over processes 2; 3; : : : . Each generator i has associated with it the submatrices AGi ; Bi and Ci. Every process reads only those submatrices AGi ; Bi and Ci of AG; B and C corresponding to the generators it is in charge of. ForwardBackward, W-Matrix and RepSub di er in the way JN and V are distributed. We will describe this in the following paragraphs. After reading the input data, the parent process collects the partial 13

X vectors from all child processes (step 3). For every time step from start of the simulation to nish (step 4), the linear system in Equation 6 is solved using the VDHN method (Section 3.1). The VDHN method is implemented as follows. VDHN Method: First, the overall error is computed in a distributed manner (steps 4.3 through 4.5). The overall error is the sum over the partial errors RG calculated by each processor with its locally available data and RN (see equations 4 and 5). The summation is done via a reduction operation. The parent process, which at the end of the reduction operation knows the overall error, broadcasts it to all other processes so that each process knows whether the termination criterion is satis ed for the current time step. In every iteration of the VDHN method (step 4.6) each process participates in the calculation of Rb N according to Equation 9 by forming CA?1 G RG using its ?1 local data.P The sum of all such terms, which equals RdN since ?1 CAG RG = i Ci  AGi  RGi , is collected by the parent process via a reduction operation. The matrix JbNPis formed similarly according to ?1 Equation 10, since d JN = CA?1 G B = i Ci  AGi  Bi . In ForwardBackward, W-Matrix and RepSub, the parent process computes the factors L; D and U of JbN three times: initially in the rst iteration, second when the fault is inserted into the algebraic equations, and third when the fault is removed from the algebraic equations. In ForwardBackward the parent process reuses these factors in all subsequent iterations until the next factorization takes place. Thus, in every iteration only Rb N in Equation 11 varies and forward and backward substitution is performed sequentially by the parent process to solve for V (step 4.6.3). Next, the parent process broadcasts V to all other processes. Each process independently calculates that part of X corresponding to the generators it is in charge of and updates the corresponding part of X (steps 4.6.4 and 4.6.5). Again, all processes share in the computation of the overall error. After the VDHN method converged for the current time step, the time step is increased by the step size and the solutions for X and V are used as initial solutions for the next time step. The main di erence between ForwardBackward and the other two implementations exists in step 4.6.3, the solution of Equation 11. In W-Matrix all processes reuse the factors L; D and U from the latest of the three factorizations to compute V in a distributed manner as follows. The number of partitioned inverses of L and U can be determined ahead of time depending on how many ll-ins the user allows (see Section 3.3). The number of partitions and their sizes are stored in a vector and made available to all processes. This information does not change over the duration of a simulation. Assume d JN is of size n  n and we have p processes available. Process i is in charge of solving for all elements Vkp+i for k = 0; : : : ; dn=pe and k  p + i  n. Now, for one partition at a time, the processes perform all multiplications in parallel (see equations 20 and 21). After the multiplications within 14

one partition are done, the processes need to exchange their intermediate results. This is implemented using a reduction operation, where the parent process builds the complete V vector from the individual elements computed by the child processes. The parent then broadcasts V to all child processes and the next partition can be started. Hence W-Matrix requires 2  q + 1 reduction operations and 2  q + 1 broadcast operations to compute V , where q is the number of partitions in a factor. In RepSub the values factor(:; :) are precomputed ahead of time and stored in i-j-value-format in a le. Three times during the simulation, which is, initially, then when a fault is inserted into the algebraic equations, and nally when the fault is removed again, all processes read the le storing the precomputed values so that V can be computed in a distributed manner. In all other iterations the values read, which are stored by each process in a sparse matrix data structure in local memory, are reused and no further reading of data les is required. We explore di erent data partitioning strategies when solving for V in a distributed manner. In one strategy, the processes read the le in an interleaved manner. Assume JbN is of size n  n, that is, factor(1; 1) through factor(n; n) have been computed. Note that only values which are nonzero are computed and stored. Further assume we have p processes. Then, process 1 reads and stores all values factor(1; :), factor(p + 1; :), factor(2p + 1; :), : : : , process 2 reads and stores all values factor(2; :), factor(p + 2; :) : : : . As a result of this data distribution, every processor is in charge of an approximately equal number of elements of the solution vector V . Another strategy partitions the data as follows. Processes read blocks of consecutive rows from the le. Rows are distributed in blocks containing approximately n=2p rows each. Process 1 receives the rst and the last block, which is block 2p. Process 2 receives the second block and block 2p ? 1, : : : . Process p receives blocks p and p + 1. If n is not equally divisible by 2p the remaining rows are distributed individually over the processes to minimize load imbalance. Independent of the data partitioning strategy used nding V in RepSub requires two reduction operations and two broadcast operations. After the second serial step (Dz = y, see Section 3.4) all processes reduce their elements of V to the parent process which in turn broadcasts the complete V to all child processes. The same communication pattern is repeated after the third serial step (Ux = z). Both, in W-Matrix and RepSub the processes next solve in parallel for the respective elements of V as described in Sections 3.3 and 3.4. The parent process collects the partial solution vectors and broadcasts V to all processes so that each process can compute the partial X as before. Note that in all implementations communication patterns, where the parent process receives data from all children, are implemented as reduction operations to avoid communication bottlenecks and achieve higher performance. 15

5 Test Beds In the past few years, distributed computing has been widely used for scienti c computing. Reasons for this development are the a ordability of loosely coupled network of workstations, the rapid increase in the performance of microprocessors and the availability of programming tools such as the Parallel Virtual Machine (PVM) [3]. For these reasons, power system engineers are particularly interested in networks of workstations. A power plant cannot a ord to buy a supercomputer, but might be able to invest in a network of workstations. Thus, we chose to implement our algorithms on networks of workstations instead of machines like CM-5. In all our implementations we use PVM. PVM is a public-domain software library that allows users to pass messages between processes running on di erent machines. One advantage of PVM is that it permits the user to get the e ect of a parallel machine from a network of heterogeneous workstations. The real power of PVM comes from the fact that is runs on virtually every unix host. PVM is made up of two parts: a daemon that runs on all computers of the network, and a set of user interface primitives that can be integrated into C or Fortran code. One of the challenges in distributed computing is the performance of the communication network and designing algorithms tailored for the underlying communication network. We used two di erent computing platforms to run our experiments. One is a loosely coupled network of workstations (NOW) and the other one is an SP2, which is a tightly coupled cluster of workstations.

5.1 Network of Workstations (NOW) The loosely coupled, heterogeneous network of workstations we used is installed at the Model Advanced Facility (MAF) at the UW-Madison and comprises 7 hprisc 735 machines and 4 DEC Alpha machines interconnected via a FDDI network. The hprisc 735 machines have a clock rate of 99MHz whereas the DEC Alpha machines have a clock rate of 150MHz. FDDI (Fiber Distributed Data Interface) is a 12.5 MBytes/s ber optic token ring network. It has two ber rings where either one can be used as backup if the other one breaks. Nodes use tokens to transmit data. The maximal frame size for FDDI is 4500 bytes. Since FDDI is a packet-switched network, only one message can travel over the network at any given time. If two or more nodes want to transmit messages simultaneously only one node can do so and the other nodes have to wait. This can have a negative e ect on the performance of an algorithm which often requires simultaneous communication among nodes [12]. We used PVM version 3.3.6 in our work.

16

5.2 SP2 The Scalable POWERparallel 2 (SP2) is the newest member of IBM's leading technology family of Scalable POWERparallel Systems. An SP2 consists of nodes (processors with associated memory and disk) connected by an Ethernet and by a high-performance switch. The nodes are standard POWER2 architecture RS/6000 processors, which have superscalar pipelined chips capable of executing 4 oating point operations per cycle. The processors run at 66.7 MHz, giving a peak performance per processor of 266 MFLOPS. The high-performance switch is a multi-stage packet switched Omega switch that provides a minimum of four paths between any pair of nodes. We used the SP2 at the Cornell Theory Center. It consists of 512 nodes and runs PVMe Version 1, release 3.1, which is an enhanced version of PVM, taking advantage of the switch.

6 Numerical Results and Discussion We ran experiments using data from real power systems of various sizes and found very similar behavior characteristics for all examples. Thus we will restrict ourselves here to discussing one representative example only. We select a power system consisting of 1390 buses and 181 generators. We model each generator by a set of rst-oder twodimensional di erential equations _ = ! (26) !_ = ?! ? 2  sin  + 1 (27) where  denotes the rotor angle and ! the rotor velocity for a generator. Thus, each AGi block is a 2  2 matrix. It would be beyond the scope of this paper to include and explain the algebraic equations we used to model the power system. The algebraic equations do consist of three sets of equations. The rst set describes the correlation between currents in the generators and loads, the admittance matrix, and the voltages in the generators and loads. The second set describes how the electrical power in the generators depends on the currents in the generators and the last set describes the correlation between voltages in the generators, currents in the generators and other speci ed system parameters, for instance impedances. In our example, which we call 1390bus, JN is a 3323  3323 matrix. Prior to running experiments with the various implementations of the transient stability problem, we tested the performance of the communication functions provided by PVM. We found that in most cases the PVM function is more ecient than our own implementation of the same function. For instance pvm mcast(), which performs a broadcast operation, outperforms our own implementation of a broadcast operation. However, pvm reduce(), which performs a reduction operation, performs much worse than our own implementation of a reduction operation. Thus, 17

1390 bus system: Overall execution time Time (in seconds) x 103 ForwardBackward - NOW W-Matrix - NOW RepSub - NOW ForwardBackward - SP2 W-Matrix - SP2 RepSub - SP2

4.20 4.00 3.80 3.60 3.40 3.20 3.00 2.80 2.60 2.40 2.20 2.00 1.80 1.60 1.40 1.20 1.00 0.80 0.60 0.40 10

20

30

Processors

Figure 5: Overall execution time for ForwardBackward, W-Matrix and RepSub on NOW and SP2.

18

1390 bus system: Overall communication time Time (in seconds) x 103 ForwardBackward - NOW W-Matrix - NOW RepSub - NOW ForwardBackward - SP2 W-Matrix - SP2 RepSub - SP2

2.30 2.20 2.10 2.00 1.90 1.80 1.70 1.60 1.50 1.40 1.30 1.20 1.10 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 10

20

30

Processors

Figure 6: Maximum communication time of all processes for Forward-

Backward, W-Matrix and RepSub on NOW and SP2.

19

in all our implementations we used our own reduction operations. On the NOW we ran all experiments for 2, 4 and 8 processors. On the SP2 we varied the number of processors from two to 32. We do believe this a sucient number of processors for the application we are looking at, since it is unlikely that power plants will have a large number of processors available for running applications in parallel. Overall Execution Time: Figure 5 shows the overall execution time required for all three algorithms ForwardBackward, W-Matrix and RepSub versus number of processors. Note that we include time spent for I/O in the overall execution time. Let us rst consider the heterogeneous NOW. Consistently ForwardBackward outperforms RepSub which in turn consistently outperforms W-Matrix. The good performance of ForwardBackward is partially due to the fact that W-Matrix and RepSub require more I/O than ForwardBackward. W-Matrix requires reading a le containing the partitioned inverses and RepSub requires reading three les containing the precomputed values factor(:; :). The sizes of these le are considerable. If the transient stability problem is solved repeatedly for varying right hand sides, as is often done in real world, the impact of the time spent in I/O becomes less signi cant and the gain due to parallelization more obvious. Another reason for the good performance of ForwardBackward in this example is that it requires less communication than W-Matrix and RepSub (see Section 4 and Figure 6, which we will discuss in more detail later in this section). However, this implementation su ers from load imbalance since the parent process sequentially solves Equation 11. As we move to bigger power system examples we expect the distributed implementations to certainly outperform ForwardBackward. In W-Matrix we varied the number of partitions from 388 to 1. In the case of 388 partitions no ll-ins occur and the case of 1 partition corresponds to computing L?1 and U ?1 , which is not an ecient way of solving a linear system. We found the optimal number of partitions in this example to be 2, which means the linear system can be solved in ve serial steps at the expense of 1:22% ll-ins which brings the total density of the matrix to 1:67%. When increasing the number of processors from two to four, we achieve a gain in performance for all implementations. The reasons are better load balance and only a limited increase in communication due to the limited number of processors. However, when the number of processors increases further, communication becomes the performance limiting factor for a network of workstations connected via a FDDI network. When we use 8 processors, the overall execution time increases due to a rapid increase in communication time which is shown in Figure 6. A second performance limiting factor comes into play when we use 8 processors. Our heterogeneous NOW consists of four DEC Alpha machines and seven hprisc 753 machines. Whenever we use only two or four processors, we use the faster DEC Alphas. Whenever we use 8 processors, we need to include four slower hprisc 753s. The overall speed of the cluster is limited by the speed of the slowest machine, which is lower when we include hprisc 753s. 20

Now, let us consider the overall execution time in the case of SP2. When comparing the results from the NOW with SP2, we need to keep in mind that all our experiments on the NOW were carried out in the multiuser mode; that is, PVM applications and sequential applications from other users were running on the machines and the network at the same time as our applications did. On the other hand, on SP2 jobs are scheduled as single threads, but are not guaranteed exclusive access of the switch. SP2 uses PVMe, the enhanced version of PVM, which uses the high-performance switch for communication. For SP2, we observe a behavior of the overall execution time depending on the number of processors which is quite opposite from the one for the NOW. For a small number of processors SP2 performs signi cantly worse than the NOW. The reason is the slow processor speed of the RS/6000 compared to the DEC Alpha machines. As the number of processors increases load balance is improved which leads to reduced overall execution time. The speed of the individual machines becomes less important since more machines participate in the computation. The advantage of using the high-performance switch in SP2 becomes obvious as the number of processors is increased in SP2. Instead the better performance of PVMe over PVM becomes obvious. Already for 8 processors SP2 gives better overall execution times than the NOW. ForwardBackward consistently performs the best, but now W-Matrix slightly outperforms RepSub. The two partitions used in W-Matrix provide a better load balance than RepSub and this leads to a better overall performance of W-Matrix. The slightly higher communication overhead of W-Matrix compared to RepSub (see Section 4) does not show in the overall execution time on SP2 due to the performance of the high-performance switch. When we use more than 16 processors, the large communication overhead of W-Matrix and RepSub overweighs the gain due to a larger number of processors. For ForwardBackward, where Equation 11 is solved sequentially in every iteration and all other computations are performed in parallel, increasing the number of processors leads to better overall performance. The communication overhead remains small even when 32 processors are used. Now we look at the communication overhead of the di erent methods in more detail. Communication Overhead: Figure 6 shows the maximum overall communication time of all nodes versus number of processors for all three implementations on the NOW and SP2. Remember that Figures 3 and 4 list all communication required by the di erent implementations. All three implementations have the same communication patterns except for step 4.6.4 (solving for V ). Note that we neglect the possible communication requirements in step 4.6.3. This step is executed only three times within thousands of iterations and any communication occurring in this step will have no noticeable e ect on the overall communication time. In step 4.6.4, ForwardBackward requires one broadcast operation, RepSub requires two reduction operations and two broadcast operations and W-Matrix requires ve reduction operations and ve broadcast operations (see Section 4). Since W-Matrix has the highest communication overhead, it is very sensitive to the speed of 21

the underlying communication network. For none of our experiments we had exclusive access to the network. Let us rst consider the NOW. Figure 6 shows that, as expected, W-Matrix spends the largest amount of time in communication. We observe a slight increase in communication overhead when we increase the number of processors from two to four. The reasons that the communication overhead for ForwardBackward here is higher than for RepSub is other message trac on the network. When we further increase the number of processors, the amount of time spent in communication increases by an order of magnitude. As expected, this increase is the most signi cant for W-Matrix, whereas ForwardBackward and RepSub show a rather similar behavior. We conclude that when using PVM, two reduction operations and two broadcast operations are not signi cantly more expensive in terms of time than one broadcast operation. For NOW, the rapid increase in time spent in communication clearly overweighs the gains in performance due to increasing the number of processors from four to eight. As a result, overall execution time increases as well. In the case of SP2, communication is routed in parallel. For ForwardBackward a slight decrease in communication overhead can be observed when we increase the number of processors. The reasons is that the amount of communication does not change much, but for larger number of processors communication can take advantage of messages being routed in parallel. For upto 16 processors the communication overheads for ForwardBackward and RepSub are very close. The communication overhead for ForwardBackward is slightly higher than the one for RepSub due to other message trac on the network. When we use 32 processors, the communication overhead of RepSub increases. The increase in communication due to a large number of processors can no longer be compensated for by parallel routing of messages. This increase in communication overhead leads to an increase in overall execution time as observed in Figure 5. As expected, W-Matrix consistently has the highest communication overhead. When we increase the number of processors from two to four we observe a decrease in communication time. For four processors the communication is no longer restricted to one pair of processors but can be routed in parallel, taking advantage of the switch. When we further increase the number of processors, also the communication overhead increases for the same reason as in the case of RepSub.

7 Conclusions In this paper we present three parallel implementations of the transient stability problem exploiting di erent amounts of parallelism and thus requiring di erent amounts of communication. All data partitioning and parallelization strategies which we explored can be classi ed as parallel-in-space approaches. We found communication to be the crucial, performance limiting factor if an algorithm is implemented on 22

a cluster of workstations, no matter whether they are loosely coupled (NOW) or tightly coupled (SP2). We found that load imbalance has much less impact on the overall execution time of an algorithm than the communication overhead. The algorithm with highest amount of load imbalance but smallest amount of communication consistently outperforms all other algorithms on the NOW and on SP2. Our next goal is to explore methods which reduce the amount of communication required between processors when solving the transient stability problem. We will implement and analyze parallel-in-time approaches, which split the problem into coarse grain blocks of computation where no communication is required within a block. We will also explore iterative methods when solving the Equation 11.

Acknowledgements We thank Prof. F. L. Alvarado, H. Dag and J. Gronquist for many helpful discussions.

References [1] Fernando L. Alvarado, David C. Yu, and Ramon Bertancourt. Partitioned Sparse A?1 Methods. IEEE Transactions on Power Systems, 5(2), May 1990. [2] P. M. Anderson and Benjamin Dembart. Computational Aspects of Transient Stability Analysis. In A. M. Erisman, K. W. Neves, and M. H. Dwarakanath, editors, Electric Power Problems, the Mathematical Challenge: Proceedings of a Conference, Seattle, Washington, March 18-20 1980. SIAM, Philadelphia, 1980. [3] A. L. Beguelin, J. J. Dongarra, G. A. Geist, W. C. Jinag, R. J. Manchek, B. K. Moore, and V. S. Sunderam. PVM version 3.3: Parallel Virtual Machine System. University of Tennessee, Knoxville TN, Oak Ridge National Laboratory, Oak Ridge TN, and Emory University, Atlanta GA, 1992. [4] J. S. Chai and A. Bose. Bottlenecks in Parallel Algorithms for Power System Stability Analysis. IEEE Transactions on Power Systems, 8(1), February 1993. [5] J. S. Chai, N. Zhu, A. Bose, and D. J. Tylavsky. Parallel Newton Type Methods for Power System Stability Analysis Using Local and Shared Memory Multiprocessors. IEEE Transactions on Power Systems, November 1991. [6] D. P. Koester and S. Ranka and G. C. Fox. A Parallel Gauss-Seidel Algorithm for Sparse Power Systems Matrices. Proceedings of the Super Computing Conference '94, November 1994. 23

[7] I. C. Decker, D. M. Falcao, and E. Kaszkurewicz. Parallel implementations of a power system dynamic simulation methodology using the conjugate gradient method. In Proceedings of the Power Industry Computer Applications Conference, May 1991. [8] Mark Enns, William Tinney, and Fernando Alvarado. Sparse Matrix Inverse Factors. IEEE Transactions on Power Systems, 5(2), May 1990. [9] A. M. Erisman. Sparse Matrix Problems in Electric Power System Analysis. In I.S. Du , editor, Sparse Matrices and their Uses. New York, Academic Press, 1981. [10] Djalma M. Falcao. Parallel and distributed processing in power system simulation and control. In IV Simposium of Specialists in Electric Operational and Expansion Planning, May 1994. [11] Djalma M. Falcao, Eugenius Kaszkurewicz, and Heraldo L. S. Almeida. Application of Parallel Processing Techniques to the Simulation of Power System Electromagnetic Transients. IEEE Transactions on Power Systems, 8(1), February 1993. [12] Rod Fatoohi and Sisira Weeratunga. Performance evaluation of three distributed computing environments for scienti c applications. Proceedings of the Super Computing Conference '94, November 1994. [13] G. P. Granelli, M. Montagna, G. L. Pasini, and P. Marannino. A W-Matrix Based Fast Decoupled Load Flow for Contingency Studies on Vector Computers. IEEE Transactions on Power Systems, 8(3), August 1993. [14] Lanjuan Hou and Anjan Bose. Parallel Across Space Implementation of Waveform Algorithms on Shared Memory Machine. [15] G. Huang and W. Ongsakul. Managing the Bottlenecks of a Parallel Gauss-Seidel Algorithm for Power Flow Analysis. In Proceedings of the 7-th International Parallel Processing Symposium, 1993. [16] M. Ilic'-Spong, M. L. Crow, and M. A. Pai. Transient Stability Simulation by Waveform Relaxation Methods. IEEE Transactions on Power Systems, 2(4), November 1987. [17] D. P. Koester, S. Ranka, and G. C. Fox. Parallel Block-DiagonalBordered Sparse Linear Solvers for Electrical Power System Applications. In Proceedings of the Scalable Parallel Libraries Conference. IEEE press, 1994. [18] D. P. Koester, S. Ranka, and G. C. Fox. Parallel Choleski Factorization of Block-Diagonal-Bordered Sparse Matrices. Technical report, School of Computer and Information Science and The Northeast Parallel Architectures Center (NPAC), Syracuse University, January 1994. 24

[19] Shyan-Lung Lin and J. E. Van Ness. Parallel solution of sparse algebraic equations. In Proceedings of the 18th Power Industry Computer Applications Conference, May 1993. [20] Massimo La Scala, Anjan Bose, Daniel J. Tylavsky, and Jian Chai. A highly parallel method for transient stability analysis. In Proceedings of the Power Industry Computer Applications Conference, May 1989. [21] B. Stott. Power System Dynamic Response Calculations. Proceedings of the IEEE, 67(2), February 1979.

25

Monika ten Bruggencate received the Diploma in computer sciences and

electrical engineering from the Technical University Munich, Munich, Germany, in 1990. She is currently working on her Ph.D. in computer engineering at the University of Wisconsin - Madison. Her research interests include parallel and distributed computing, asynchronous algorithms, and fault-tolerant systems. Suresh Chalasani Suresh Chalasani received the B. Tech. degree in electronics and communications engineering from J.N.T. University, Hyderabad, India, in 1984, the M.E. degree in automation from the Indian Institute of Science, Bangalore, in 1986, and the Ph.D. degree in computer engineering from the University of Southern California in 1991. He is currently an Assistant Professor of Electrical and Computer Engineering at the University of Wisconsin-Madison. His research interests include parallel architectures, parallel algorithms, communication networks, and fault-tolerant systems.

26

Suggest Documents