Siddhartha Kumar Khaitan and James D. McCalley, Fellow, IEEE ... AbstractâPower system simulations involving solution of thou- sands of ... is an integral part of power systems operation and markets. [12]. Traditionally, simulation studies are carried out for ... nuclear physics [19, 20, 21] and processor architecture design.
EmPower: An Efficient Load Balancing Approach For Massive Dynamic Contingency Analysis in Power Systems Siddhartha Kumar Khaitan and James D. McCalley, Fellow, IEEE Department of Electrical and Computer Engineering Iowa State University, Ames, Iowa, USA. {skhaitan, jdm}@iastate.edu
Abstract—Power system simulations involving solution of thousands of stiff differential and algebraic equations (DAE) are computationally intensive and yet crucial for grid security and reliability. Online simulations of minutes to hours for a large number of contingencies require very high computational efficiency. Furthermore, since the simulation times across the contingencies vary considerably, dynamic load balancing of parallel contingency analysis (CA) is required to ensure maximum resource utilization and minimum idle time. However, the existing state-of-the-art contingency analysis techniques fail to fulfill this requirement. In this paper, we present EmPower, an Efficient load balancing approach for massive dynamic contingency analysis in Power systems. For single contingency analysis, EmPower uses time domain simulations and incorporates efficient numerical algorithms for solving the DAE. Further, the contingency analysis approach is scaled for large scale contingency analysis using MPI based parallelization. For enabling an efficient, non-blocking implementation of work-stealing, multithreading is employed within each processor. Simulations of thousands of contingencies on a supercomputer have been performed and the results show the effectiveness of EmPower in providing good scalability and huge computational savings. Index Terms—Dynamic Load Balancing, Master-Slave, Workstealing, Static Scheduling, Parallel Contingency Analysis, Time Domain Simulation
I. I NTRODUCTION Real-life energy management systems (EMS) regularly perform dynamic contingency analysis (DCA) for assessing the ability of the power grid to sustain any possible component failures [1, 2, 3]. DCA provides useful information for timely preventive and corrective actions and thus it is a crucial requirement for the proper functioning and maintenance of the modern EMS. However, due to the complex operation of EMS, the number of contingencies to be tested has been increasing and thus, analyzing those contingencies in real-time is becoming infeasible. Recently, parallelism and other highperformance computing techniques have been deployed in several computation intensive applications (e.g. [4, 5, 6, 7, 8, 9]). However, using these methods for DCA presents challenges of its own. Analyzing different contingencies requires different time length and hence a na¨ıve approach for parallelization of DCA is likely to fail in providing load-balanced scheduling of different contingencies on the available processors. Moreover,
inefficient techniques prohibit the operators from full exploration of solution space and hence conclusions derived from such incomplete simulations/experimentation may possibly be inaccurate. Thus, novel techniques are required for conducting DCA in real-time while still maximizing resource utilization through load-balancing. In this paper, we present EmPower, an Efficient load balancing approach for massive dynamic contingency analysis in Power systems. For enabling efficient implementation, we take a three-pronged approach. First, for accelerating single contingency analysis, we use efficient numerical algorithms for solving stiff differential and algebraic equations (Section III). Second, we use MPI-based parallelization for scaling contingency analysis over a large cluster of processors. Third, we use dynamic work-stealing algorithm for achieving loadbalancing across available processors (Section IV). To fully tap the potential of the underlying architecture, we make several platform-specific optimizations. Most of the studies on parallelization of contingency analysis have focused on steady state contingency analysis [10, 11]. In this paper, we focus on the short-term dynamic contingency analysis, which presents significant challenge to parallelization because of large variation in the simulation time of different contingencies. We conduct simulations to analyze thousands of contingencies using conventional approach and EmPower approach (Section V). The results show that EmPower offers large speedup and outperforms a conventional scheduling schemes, namely master-slave scheduling. The rest of the paper is organized as follows. Section II discusses the related work in load-balancing techniques and high-performance computing for power system simulations. Section III describes the time domain simulation approach and Section IV presents different load-balancing techniques. Section V presents the results of massive contingency analysis using different load-balancing techniques. Finally, Section VI presents the conclusion and future work. II. R ELATED W ORK
AND
BACKGROUND
For secure power grid operation, contingency analysis (CA) is an integral part of power systems operation and markets [12]. Traditionally, simulation studies are carried out for
Fig. 1: Flow-diagram of EmPower
contingency analysis. However, with increasingly complex operation of modern EMS and increasing security requirements [10], analysis of large number of contingencies is becoming infeasible on existing platforms. To address this challenge, we use high performance computing resources. The advances in high performance computing over the last few decades has made it extremely popular in both research and industry. In literature, several high performance computing (HPC) solutions have been proposed for addressing computation intensive problems in different domains, such as power systems [13, 14, 15], bio-informatics [16, 17, 18], nuclear physics [19, 20, 21] and processor architecture design [22, 23, 24]. In terms of HPC programming models, several implementation tools such as OpenMP [25], message passing library (MPI) [26], graphical processing unit (GPUs) [27] have been widely deployed. However, because of the unique characteristics of power system DCA, several issues need to be addressed for benefiting DCA from HPC resources. To harness the potential of multiple processing units to solve large problems, efficient scaling and parallelization techniques are required. Further, contingencies vary in their nature and the time required to test them, and thus a na¨ıve parallelization approach is likely to lead to unbalanced distribution of tasks on the available processors. This problem increases dramatically with large number of processors. Thus, novel approaches are required to accelerate power system dynamic contingency analysis using HPC resources. Some researchers have proposed master slave based scheduling algorithm for achieving load balancing [10, 11]. To reduce the contention at a single master, a variant of the master slave uses multiple masters with multi counters [28]. We discuss more details of master slave scheduling algorithm in Section IV-B. Work stealing [29, 30] is a dynamic scheduling method which works on the intuition that scheduling with load balancing can be easily achieved if a process with no pending job is allowed to steal a pending job from other running processes. Work stealing is also known as task stealing and random stealing [31]. Some researchers have presented their implementation of work stealing algorithm. Pezzi et al. propose a hierarchical work stealing (HWS) algorithm for MPI platforms [32], which is an extension of single master
work stealing algorithm. HWS uses a hierarchical structure of managers (i.e. master) and workers (i.e. slaves), arranged in a binary tree structure [32]. In this tree, the inner nodes are all managers and the leaf nodes represent the worker nodes. In MPI platforms, processes can only communicate if they share an intercommunicator and hence using N processes will require collective synchronization between them. HWS aims to reduce this synchronization cost by using managers which mediate and facilitate the task of communication between processes [32]. This technique, however, has the limitation of requiring several extra manager nodes. Further, each work stealing request is slowed down, since it must go through manager nodes. A variant of this is cluster aware hierarchical work stealing [33, 34], which provisions that the processors be arranged in a specific topology (e.g. tree topology) to minimize the communication cost. Dinan et al. implement work stealing for MPI platforms [35]. To reduce the communication overhead due to stealing (polling) requests, they provision a fixed time interval (called polling interval) which must be elapsed before a process examines and answers incoming stealing requests. They have observed that the length of the polling interval has a significant impact on the algorithm performance. For implementing distributed termination detection, a modified version of Dijkstra termination detection algorithm is used. Work stealing has been studied in various application contexts such as wide area networks [31], distributed memory machines [36], and several variations of work stealing have been proposed in the literature (e.g. [37, 38, 39, 40]). Further, work stealing has also been used for achieving load balancing in GPUs [41] and several programming languages are designed for multithreaded parallel computing, such as Cilk [42]. To achieve load balancing in massive DCA, EmPower uses the work stealing based scheduling algorithm. The implementation of work stealing has been carried out using hybrid approach with MPI for parallelization across the nodes and multithreading within a single node for handling the stealing requests while still continuing to perform contingency analysis. III. E M P OWER : M ETHODOLOGY AND I MPLEMENTATION Figure 1 shows the overview of EmPower approach. In what follows, we describe each component of EmPower in detail.
Time (seconds)
Simulation time of individual contingencies 55 50 45 40 35 30 25 20 15 0
500
1000
1500 Contingencies
2000
2500
3000
Fig. 2: Simulation time variation across contingencies (sorted in ascending order of time)
A. Time Domain Simulation For dynamic contingency analysis, EmPower uses time domain simulation using a high speed simulator [43, 44, 45, 46]. This simulator has been verified against commercial software [47]. This simulator provides an interface to a number of numerical algorithms for integration, solution of nonlinear and linear system of equations. From the point of view of computational efficiency, proper choice of numerical algorithms is extremely important. Hence, we now discuss the numerical algorithms used in more detail. B. Numerical Algorithms A power system is represented by thousands of differential and algebraic equations (DAEs). These are highly stiff in nature due to multi-scale dynamics having components with widely varying time constants. These DAEs are discretized using an integration algorithm. The resulting set of nonlinear equations together with the algebraic equations from the network and the different power system components are solved through Newton iteration. The core of the solution of the nonlinear system of equations is the solution of the linear system of equations. 1) Integrator: The simulator is interfaced with a number of implicit and explicit integrators including trapezoidal method, Euler method, BDF method and so on. The simulator is also interfaced with the public domain variable order and variable time step BDF integrator IDAS [48]. In EmPower approach, we use IDAS, since our experiments have shown that it outperforms other integrators and is able to efficiently and accurately handle the stiff power system dynamics. 2) Non Linear Solver: Newton methods are the ubiquitous choice for the solution of nonlinear equations. We have developed several Newton-based nonlinear solvers, including a conventional Newton, a Newton algorithm with line search, and an algorithm to identify an initial solution based on a least square minimization (which gives very good starting point for fast convergence of the Newton method). We have also implemented the relaxed Newton method to avoid repeated jacobian calculations. This method is known as very dishonest Newton method (VDHN).
3) Linear Solver: The simulator also interfaces with a number of linear solvers. The IDAS integration solver does not have inbuilt direct sparse linear equation solvers. There are only inbuilt linear equation solvers for dense matrices (BLAS routines and LAPACK routines). These are not suitable for power system applications, which have high dimension sparse matrices. IDAS also supports iterative solvers like GMRES with user supplied preconditioner routines. However, the power system dynamics’ jacobian is highly ill conditioned and hence the convergence with iterative solvers is very poor. Of all the linear equation solvers we interfaced with IDAS, namely KLU, UMFPACK, SuperLU, PARDISO etc., [49, 50, 51], KLU was found to be most computationally efficient and fastest. Hence, EmPower makes use of KLU, which is a serial platform based linear equation solver. IV. S CALING F OR M ASSIVE C ONTINGENCY A NALYSIS AND U SE OF L OAD BALANCING T ECHNIQUES Out of the possible HPC programming models, viz. OpenMPI, GPU and MPI, EmPower utilizes MPI. This choice is motivated by three important reasons. Firstly, MPI is a standardized, vendor-independent and portable library which is designed for flexibility and efficiency. Secondly, MPI is a de-facto industry standard adopted by several vendors of commercial systems and thus MPI suits the design goal of EmPower of easy deployment in real-life systems. Finally, MPI allows easy integration with C++ and hence is suitable for code reuse, smaller development time and time-to-market. To fully utilize the computing resources of available parallel processors, effective scheduling techniques are required. As shown in Figure 2, there exists significant time variation across different contingencies, which leads to load imbalance across the processors. To address this, EmPower uses efficient work stealing based load scheduling technique to ensure minimum idle time for the processors and maximal resource utilization of each processor. In what follows, we first present and compare commonly used scheduling techniques and then discuss the platform specific optimizations made for using work stealing in EmPower. The scheduling techniques can be broadly divided into two categories, namely static and dynamic scheduling techniques.
A. Static Scheduling A static load balancing approach pre-computes the schedule for each process in advance and thus requires little runtime overhead of scheduling and monitoring. Algorithm 1 shows its pseudo code. (a) Initial contingency allocation Input: A task list T and a processor list P Output: A static allocation of tasks to the processors while True do foreach processor p in P do if T is empty then break; end Remove a task t from T; Allocate t on p; end end
(b) Load on individual processor as time progresses
The advantages of static scheduling is that it avoids any overhead of online scheduling and no extra processor is required for scheduling. However, use of this technique requires that all the jobs (contingencies) should be available and their characteristics (e.g. precise runtime) be known at the beginning of the work to achieve fairly balanced scheduling. If the simulation times of the individual contingency vary too much, it results in significant load imbalance. Except for a few processors, all the rest remain idle, waiting for other processors to finish. Thus the finish time of the schedule, which is the completion time of the last job/contingency, becomes very large. This idea of the static allocation is presented in Figure 3. Fig. 3a shows the loading on the different processors in the beginning. All the processors are (almost) equally loaded in terms of number of tasks/contingencies to be analyzed. Since different contingencies could have different simulation times, the work load on different processors can change with time as shown in Fig. 3b. With further evolution of time, different processors finish their tasks and sit idle waiting for the last job to finish as shown in Fig. 3c and Fig. 3d. Thus, static scheduling can lead to high load imbalance. To address the limitations of static scheduling, dynamic scheduling techniques are used which generate schedules in an online manner by dynamically allocating the contingency cases to the individual processors on demand. Dynamic scheduling generates evenly distributed schedules and thus, significantly reducing processor idle time. In the rest of this section, we discuss two well-known dynamic scheduling techniques, viz. master slave and work stealing.
master node has all the contingencies as shown in Fig. 4a. It then allocates tasks to each of the slave nodes as in Fig. 4b. When a slave node (processor) finishes analyzing its assigned contingencies, it requests a task from the master node (Fig. 4c). The master node assigns a task to the processor if it has pending contingencies to be simulated and reduces its queue by one (Fig. 4d). However a situation may arise where multiple processors finish their assigned tasks and send their requests to the master node for additional tasks simultaneously. This would lead to contention and need for potential synchronization resulting in extra overhead as shown in Fig. 4e. This problem increases dramatically with large number of processors when the chances for contention increase with number of processors.
B. Master Slave Dynamic Scheduling Master slave scheduling employs two kinds of nodes, namely master nodes and slaves nodes (e.g. [10, 11]). The master schedules the tasks on the slave nodes and on finishing a task, the slaves request the master for allocation of new tasks. Algorithm 2 shows the pseudo code for master slave scheduling algorithm. We now explain the working of master slave scheduling algorithm with an example (refer Figure 4). Initially, the
Master slave scheduling overcomes the limitation of static scheduling by providing better load balancing. Moreover, unlike work stealing scheduling (discussed next), it does not require mutual communication between slaves and thus it requires simple implementation. However, it also has the disadvantage that master processor becomes occupied with scheduling and hence cannot be used for useful work. Furthermore, if multiple slaves request from the master simultaneously, it leads to contention.
(c) One Processor finishes all its tasks and sits idle
(d) Multiple processors sitting idle waiting for last task to finish
Fig. 3: Static Scheduling Technique
Input: A task list T and a slave processor list S and a Master node m Output: A load balanced (best effort) allocation of tasks to processors //Initialization; foreach Slave Processor s in S do if T is empty then break; end Remove a task t from T; Assign t to s; end Algorithm for Master node m ; while T is not empty do Wait for the task request from a slave; if a task request arrives from slave s then Remove a task t from T ; Allocate t to s; end end Algorithm for any slave node s ; while there is a task to be run do Finish the task; Request a task from m; if no task is available then break; end end
(a) Initial contingency allocation
(b) Contingency allocation by the master to the slave nodes
C. Work Stealing Dynamic Scheduling Work stealing [29, 30] is a well known dynamic scheduling method, which works on the intuition that scheduling with load balancing can be easily achieved, if a process with no pending job (called thief) is allowed to steal a pending job from the job queue of other running processes (called victim). It has been shown [30] that using work stealing based scheduler, the expected time of executing a fully strict computation with P processors is given by: T = T1 /P + O(T∞ )
(c) Slave sends a request to the master upon job completion
(1)
Here T1 is the minimum serial execution time of the computation and T∞ is the minimum execution time with infinite number of processors. Further, the space required by the algorithm is S1 P , where S1 is the minimum serial space required. Also, the expected total communication cost of the algorithm is P T∞ (1 + nd )Smax , where Smax is the size of largest activation record of any thread, nd is the maximum number of times a thread synchronizes with its parent [30]. Thus, work stealing algorithm is efficient in terms of time, space and communication [30]. Algorithm 1 shows the pseudo code for the dynamic load balancing based the work stealing algorithm. Figure 5 explains the work stealing algorithm. Fig. 5a depicts the initial contingency allocation in the case of work stealing based dynamic load balancing. Fig. 5b shows the progression of the load on the different processors as time progresses. Till this point, work stealing behaves analogous to static scheduling. However, after this stage, work stealing completely diverges from static scheduling in the sense that after a processor finishes its assigned contingencies, it does not remain idle unless all of the other processors have no pending contingencies to be analyzed. Fig. 5c, 5d and 5e show
(d) Master responds to the request of the slave
(e) Multiple slave nodes request the master node leading to contention
Fig. 4: Master Slave Scheduling Technique
the progression of the work stealing based load balancing. When a processor finishes all its allocated tasks, it sends a request (as in MPI) to the different processors, to see if they have additional tasks to steal from and achieve fine grained load balancing. Work stealing scheduling leads to fine load balancing and avoids node idling. It has been proven to be efficient [30], and unlike master slave scheduling, it avoids
Input: A task list T and a processor list P Output: A load balanced (best effort) allocation of tasks to processors //Initialization while True do foreach processor p in P do if T is empty then break; end Remove a task t from T ; Allocate the task t to p; end end //Each Processor has two threads: Worker thread and Polling thread Algorithm for Worker thread on processor p ; while p has unfinished tasks do foreach unfinished task t assigned to p do Finish the task t ; end foreach p′ in P-{p} do try stealing a task from p′ ; if stealing was successful then assign stolen task to p ; break; end end end Algorithm for Polling thread on processor p ; while True do if A stealing request arrives from p′ then if p has an unstarted task t then remove t from tasks of p ; return t to p′ ; else return None to p′ ; end end end wait on a barrier for all processors; terminate all threads and processors;
(a) Initial contingency allocation
(b) Load on individual processor as time progresses
(c) The thief P3 tries to steal a task from the victim P2
(d) Successful stealing by P3 from P2
Algorithm 1: The Work Stealing Scheduling Algorithm
the possibility of contention at the single master. However, the disadvantage of work stealing is that it requires an implementation platform where each processor can communicate with any other processor, and such a special topology may be unavailable at some platforms. D. Implementation and Optimizations For the computer architects and software developers, HPC presents both challenges and opportunities, since the software interfaces must be fine tuned to utilize the architecture for achieving high performance. Developing an HPC implementation involves addressing several issues and challenges, such as data partitioning, functional decomposition, data dependencies, communication overheads, synchronization and load balancing. In what follows, we discuss the implementation specific issues we addressed to fully leverage the potential of work stealing approach. We implement the non-blocking version of work stealing algorithm [29]. This algorithm has been proven to be efficient in terms of space, time and communication overhead [30]. In our implementation, each processor executes with two threads, an executor/worker thread, which executes the tasks, and a polling thread, which polls for steal requests from other
(e) P1 trying to steal from Pn
Fig. 5: Work Stealing Scheduling Technique
processors at regular intervals. 1 The advantage of using a two threaded implementation is that the incoming requests can be serviced even when the victim (the processor from which the task is stolen) is currently executing a task. Figure 6 shows the hybrid MPI and multithreading framework used in EmPower. MPI is used for communication between two nodes, while multithreading is used for division of work within each node which facilitates implementation of non-blocking version of work stealing algorithm. For implementation of multithreading, we use POSIX (Portable Operating System Interface) threads, which are also referred to as Pthreads [52]. With Pthreads, different threads have their private data and they also share the same global memory. The Pthread API defines several routines for man1 Note that a listener implementation, which does a blocking receive on requests, would be more natural than polling and would allow immediate service of requests. While conceptually possible, the current MPI implementations on most architectures block the entire process for this and thus the executor thread would also be blocked.
Fig. 6: Hybrid MPI and Multithreading Framework
Fig. 7: An example of use of mutex lock
agement of Pthreads [53]. Use of POSIX threads enables the implementation to be portable across different operating systems such as Linux, Solaris etc. When there is no task in the process’ queue, the executor thread sends a requests to other processes. Based on the requests and responses to requests, the processes mark each other as idle. Further requests are not sent to processes known to be idle so as to reduce the request traffic. Using two threads increases the coding complexity though, as we need to synchronize on the list of tasks between the two threads. Synchronization ensures that critical sections 2 of the code are executed atomically in the threads. That is, when a thread is in critical section, the other thread cannot be scheduled to run in the same critical section. Specifically, when the executor thread is removing a task from the task list to perform the contingency, the polling thread should not be able to give this task to another process, and vice verca. We accomplish synchronization using mutex locks [54]. Mutex (mutual exclusion) locks are used to protect shared data structures against concurrent accesses. The concept of mutex lock is explained in Figure 7. Another optimization we make is that in the beginning of the execution, the polling thread polls after long intervals (5s 2 A critical section is a piece of code that accesses a shared resource (e.g. memory) that must not be concurrently accessed by more than one execution thread. When one thread is in the critical section, the other thread(s) have to wait till the first thread is still running.
in our implementation) and later, after receiving the first steal request (an indication of the start of the work stealing phase), it polls after shorter intervals (2s). This allows us to give the executor thread more time for executing tasks in the beginning, while still being responsive to steal requests in the later phases. During work stealing, a process with no remaining work (called free process) polls other processes to find a pending work. When using a large number of processes, the order of polling becomes important. A na¨ıve method of polling might always poll in a fixed order, for example, polling process 0, 1, 2, ... in order. This, however, is likely to create contention on the processes 0, 1 etc., since multiple free processes will poll processes 0 and 1 etc. To address this, we use random polling scheme. Thus, each free process chooses a victim in a random order. Statistically, random polling distributes the stealing requests to all the processes equally. V. R ESULTS AND D ISCUSSION A. Simulation Platform The simulations were performed on Cystorm supercomputer which consists of 400 dual quad core nodes with AMD processors distributed through 12 racks. Thus there are a total of 3,200 processors with a peak performance of 15.7 TF and data storage capacity of 44TB. We simulate a large system with 13029 buses, 431 generators, 12488 branches and 5950 loads. To simulate different possible disturbances appearing in
real life EMS operation, we generate multiple contingencies using events such as bus faults, branch faults, branch tripping, generator fault, generator tripping etc. and a combination of them. Depending on the number of events (bus fault, bus fault followed by line fault, duration of fault) and the time to settle to steady state, the simulation time of different contingencies varies from 10 seconds to 25 seconds. TABLE I: Master slave scheduling: simulation time (in seconds) No. of Contingencies 10000 20000 30000
P =8 18550 35293 52889
P =12 11757 22370 33366
P =16 8595 16469 24508
P =24 5599 11245 16825
P =32 4194 7970 12082
TABLE II: Work stealing scheduling: simulation time (in seconds) No. of Contingencies 10000 20000 30000
P =8 15983 30594 45846
P =12 10254 20108 31246
P =16 7989 15221 22832
P =24 5232 10707 16011
P =32 3929 7657 11578
We use wall clock time (in seconds) to present the simulation results, since this reflects the actual time needed for task completion and is also of direct significance to the end-users. B. Results Of Parallel Contingency Analysis We perform simulations on 10000 to 30000 contingencies. Given the large number of contingencies we simulate, using only few processors would lead to huge computation time. For example, by interpolating the simulation time with 8 processors, simulating 30000 contingencies with master slave algorithm using a single processor will take nearly 5 days. Hence, we only show results using 8, 12, 16, 24 and 32 processors. This also highlights the importance of using EmPower for analysis of large number of contingencies. Table I and II presents the results of simulation time taken by master slave based dynamic load balancing and the work stealing based dynamic load balancing (EmPower approach). To study the scalability of both master slave method and work stealing method, we use speedup metric, S(C, P ), which is defined as TM asterSlave (C, 8) S(C, P ) = (2) T (C, P ) Here C denotes the number of contingencies simulated and P denotes the number of processors. Here TM asterSlave (C, 8) shows the simulation time with master slave scheme for 8 processors for C contingencies. Note that we take the simulation time taken by master slave technique for 8 processors as the baseline. T (C, P ) shows the simulation time with either master slave or work stealing, for C contingencies and P processors. Using this, we compute the speedup values, which are shown in Table III. We first note that, for all processors and all contingencies, EmPower performs superior to master slave method. For P =8,
EmPower provides nearly 1.15× speedup. For P =12, EmPower offers upto 1.81× speedup. For P =16, EmPower offers nearly 2.32× speedup and for P =24, EmPower provides upto 3.55× speedup. For P =32, EmPower provides upto 4.72× speedup. It is noteworthy that for some cases, we observe superlinear speedup. The reason for this is that, with master-slave processor, for 8 processors, the number of slave (worker) processors is 7 and with 32 processors, the number of slave processors is 31. Hence, going from 8 processors to 32 processors, the scaling of number of slave processors is 31/7 = 4.43, which is more than 4. Figure 8 shows the amount of time saved by EmPower approach over master slave scheduling algorithm. Clearly, EmPower approach save a large amount of simulation time. For 30000 contingencies with 8 processors, EmPower saves nearly 1.9 hours. Similarly, for 20000 contingencies with 8 processors, EmPower saves more than 1 hour. Thus, the computational advantage offered by EmPower can allow the power system operational personnel to analyze larger number of contingencies. C. Further Analysis and Insights To further analyze the performance of EmPower and gain insights, in this subsection, we exclusively focus on results obtained using EmPower. To see how EmPower scales with the increasing number of processors, we redefine speedup and term it as E(C, P ), as follows: E(C, P ) =
TEmP ower (C, 8) T (C, P )
(3)
Here TEmP ower (C, 8) shows the simulation time with EmPower (work stealing) approach for 8 processors for C contingencies. Note that here baseline is taken as time taken to simulate contingencies for P =8 using EmPower approach itself. T (C, P ) shows the simulation time with EmPower approach (work stealing), for C contingencies and P processors. Table IV shows these speedup values. TABLE IV: Speedup Values E(C, P ) for EmPower 10000 20000 30000
8 1.00 1.00 1.00
12 1.56 1.52 1.39
16 2.00 2.01 2.01
24 3.05 2.86 2.86
32 4.07 4.00 3.96
Clearly, EmPower offers large speedup with the increase in the number of processors. This confirms that EmPower is a useful technique for improving the computational efficiency of contingency analysis in power systems. Clearly, the EmPower approach is better than the master-slave approach, as can be seen through the results. Any additional time saving achieved by using EmPower approach will enable the operator to cover a larger probability space of credible contingencies within the same time budget. It is also expected that with larger time-frame simulations, as in mid-term and extended-term simulations, the advantage and time savings provided by
TABLE III: Speedup Values S(C, P ) for both Master Slave Method and EmPower P =8 1.00 1.00 1.00
Time (seconds)
10000 20000 30000
Master P =12 1.58 1.58 1.59
Slave Technique P =16 P =24 2.16 3.31 2.14 3.14 2.16 3.14
P =32 4.42 4.43 4.38
P =8 1.16 1.15 1.15
8000 7000 6000 5000 4000 3000 2000 1000 0
Work Stealing (EmPower) P =12 P =16 P =24 P =32 1.81 2.32 3.55 4.72 1.76 2.32 3.30 4.61 1.69 2.32 3.30 4.57
10000
8
12
16 Number of Processors
20000
24
30000
32
Fig. 8: Amount of time saved by using EmPower approach over master slave algorithm (seconds)
EmPower over master-slave scheduling would increase. The speedup with EmPower (work-stealing algorithm) is slighly higher than the ratio of number of processors. This can be attributed to random differences in the load conditions on the host machine during experimentation time. D. Opportunities for Further Improvement EmPower uses work stealing where the overhead arises due to the cost of communication between processors, contention etc. The communication cost depends on the underlying hardware and to keep it small, a suitable choice of the implementation platform should be made. To reduce the contention, the ratio of computation to communication should be high. This is the case when the computation time of each task is sufficiently high, such that stealing a task is more beneficial than staying idle. Also, when the number of tasks is in proportion to the number of CPUs, the probability of a processor finding its runqueue empty becomes small. This also leads to reduction in the requirement of stealing which, in turn, leads to reduction in contention. VI. C ONCLUSION AND F UTURE W ORK In this paper, we presented EmPower, an efficient technique for real-life energy management systems which performs massive contingency analysis of independent tasks of varyinglength. A novel work-stealing based dynamic load-balancing implementation is proposed for massively parallel dynamic contingency analysis. The proposed algorithm is compared with master-slave based dynamic load balancing algorithm. A large number of contingencies (up to 30,000) of a large realistic system are simulated. The results show excellent scalability of the proposed algorithm and huge amount of time saving compared with the master-slave approach. This saving also translates into energy savings on the simulation platforms. Our future work will focus on further evaluating EmPower using larger number of processors and optimizing it for better scalability. The ability to efficiently solve large-sized problems will result in increased throughput and potential cost savings.
ACKNOWLEDGMENT This work was supported in part by DOE OE subcontract B601014. R EFERENCES [1] F. Li, W. Qiao, H. Sun, H. Wan, J. Wang, Y. Xia, Z. Xu, and P. Zhang, “Smart transmission grid: Vision and framework,” Smart Grid, IEEE Transactions on, vol. 1, no. 2, pp. 168–177, 2010. [2] J. Lopes, N. Hatziargyriou, J. Mutale, P. Djapic, and N. Jenkins, “Integrating distributed generation into electric power systems: A review of drivers, challenges and opportunities,” Electric Power Systems Research, vol. 77, no. 9, pp. 1189–1203, 2007. [3] A. Giani, S. Sastry, K. Johansson, and H. Sandberg, “The viking project: An initiative on resilient control of power networks,” in Resilient Control Systems, 2009. ISRCS’09. 2nd International Symposium on. IEEE, 2009, pp. 31–35. [4] Y. Zhang et al., “Efficient pairwise statistical significance estimation for local sequence alignment using GPU,” in Computational Advances in Bio and Medical Sciences (ICCABS), 2011 IEEE 1st International Conference on. IEEE, 2011, pp. 226–231. [5] S. Mittal, A. Pande, L. Wang, and P. Kumar, “Design exploration and implementation of simplex algorithm over reconfigurable computing platforms,” in IEEE International Conference on Digital Convergence, 2011, pp. 204–209. [6] D. Honbo, A. Agrawal, and A. Choudhary, “Efficient pairwise statistical significance estimation using fpgas,” Proceedings of BIOCOMP, vol. 2010, pp. 571–577, 2010. [7] S. Mittal, S. Gupta, and S. Dasgupta, “FPGA: An efficient and promising platform for real-time image processing applications,” in National Conference On Research and Development In Hardware Systems (CSIRDHS), Kolkata, India, 2008. [8] Y. Zhang et al., “Accelerating pairwise statistical significance estimation using NUMA machine,” Journal of Computational Information Systems, vol. 8, no. 9, pp. 3887–3894, 2012. [9] N. Nakka et al., “Predicting node failure in high performance computing systems from failure and usage logs,” in Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on. IEEE, 2011, pp. 1557–1566. [10] Z. Huang, Y. Chen, and J. Nieplocha, “Massive contingency analysis with high performance computing,” in IEEE Power and Energy Society General Meeting 2009. IEEE, July 2009. [11] I. Gorton, Z. Huang, Y. Chen, B. Kalahar, S. Jin, D. Chavarria-Miranda, D. Baxter, and J. Feo, “A high-performance hybrid computing approach to massive contingency analysis in the power grid,” in e-Science, 2009. e-Science ’09. Fifth IEEE International Conference on, dec. 2009, pp. 277 –283. [12] Y. Makarov, S. Lu, X. Guo, J. Gronquist, P. Du, P. N. N. L. (US), U. S. B. P. Administration, and U. S. D. of Energy, Wide Area Security Region: Final Report. Pacific Northwest National Laboratory, 2010.
[13] R. Green, L. Wang, M. Alam, and C. Singh, “Intelligent and parallel state space pruning for power system reliability analysis using MPI on a multicore platform,” in Innovative Smart Grid Technologies (ISGT), 2011 IEEE PES. IEEE, 2011, pp. 1–8. [14] R. Green, L. Wang, and M. Alam, “High performance computing for electric power systems: Applications and trends,” in Power and Energy Society General Meeting, 2011 IEEE. IEEE, 2011, pp. 1–8. [15] V. Jalili-Marandi, Z. Zhou, and V. Dinavahi, “Large-scale transient stability simulation of electrical power systems on parallel GPUs,” Parallel and Distributed Systems, IEEE Transactions on, no. 99, pp. 1–1, 2011. [16] Y. Zhang et al., “Par-psse: Software for pairwise statistical significance estimation in parallel for local sequence alignment,” International Journal of Digital Content Technology and its Applications (JDCTA), vol. 6, no. 5, pp. 200–208, 2012. [17] Y. Zhang, M. M. A. Patwary, S. Misra, A. Agrawal, W.-K. Liao, and A. Choudhary, “Enhancing parallelism of pairwise statistical significance estimation for local sequence alignment,” 2nd HiPC Workshop on Hybrid Multi-Core Computing, WHMC 2011, pp. 1–8, 2011. [18] A. Agrawal et al., “Parallel pairwise statistical significance estimation of local sequence alignment using message passing interface library,” Concurrency and Computation: Practice and Experience, 2011. [19] A. Srinivasa, M. Sosonkina, P. Maris, and J. Vary, “Dynamic adaptations in ab-initio nuclear physics calculations on multicore computer architectures,” in Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on. IEEE, 2011, pp. 1332–1339. [20] N. Berger, “GPUs in experimental particle physics,” Bulletin of the American Physical Society, vol. 57, 2012. [21] G. Collazuol, G. Lamanna, J. Pinzino, and M. Sozzi, “Fast online triggering in high-energy physics experiments using GPUs,” Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 2011. [22] S. Mittal and Z. Zhang, “Integrating sampling approach with full system simulation : Bringing together the best of both,” in IEEE International Conference On Electro/Information Technology (EIT). Indianapolis, USA: IEEE, 2012. [23] S. Mittal, S. Gupta, and S. Dasgupta, “System generator: The stateof-art FPGA design tool for dsp applications,” in Third International Innovative Conference On Embedded Systems, Mobile Communication And Computing (ICEMC2 2008). Global Education Center, India, 2008. [24] S. Mittal et al., “EnCache: Improving cache energy efficiency using a software-controlled profiling cache,” in IEEE EIT, May 2012. [25] K. Ishimura and S. Ten-no, “MPI/OpenMP hybrid parallel implementation of second-order møller–plesset perturbation theory using numerical quadratures,” Theoretical Chemistry Accounts: Theory, Computation, and Modeling (Theoretica Chimica Acta), pp. 1–5, 2011. [26] A. Agrawal, S. Misra, D. Honbo, and A. Choudhary, “Mpipairwisestatsig: Parallel pairwise statistical significance estimation of local sequence alignment,” in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 2010, pp. 470–476. [27] Y. Zhang, S. Misra, A. Agrawal, M. Patwary, W. Liao, Z. Qin, and A. Choudhary, “Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU’s power,” BMC Bioinformatics, vol. 13, no. Suppl 5, p. S3, 2012. [28] Y. Chen, Z. Huang, and D. Chavarr´ıa-Miranda, “Performance evaluation of counter-based dynamic load balancing schemes for massive contingency analysis with different computing environments,” in Power and Energy Society General Meeting, 2010 IEEE. IEEE, 2010, pp. 1–6. [29] N. Arora, R. Blumofe, and C. Plaxton, “Thread scheduling for multiprogrammed multiprocessors,” in Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures. ACM, 1998, pp. 119–129. [30] R. Blumofe and C. Leiserson, “Scheduling multithreaded computations by work stealing,” in Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on. IEEE, 1994, pp. 356–368. [31] R. Van Nieuwpoort, T. Kielmann, and H. Bal, “Efficient load balancing for wide-area divide-and-conquer applications,” in ACM SIGPLAN Notices, vol. 36, no. 7. ACM, 2001, pp. 34–43. [32] G. Pezzi, M. Cera, E. Mathias, and N. Maillard, “On-line scheduling
of MPI-2 programs with hierarchical work stealing,” in Computer Architecture and High Performance Computing, 2007. SBAC-PAD 2007. 19th International Symposium on. IEEE, 2007, pp. 247–254. [33] J. Baldeschwieler, R. Blumofe, and E. Brewer, “A tlas: an infrastructure
[34] [35]
[36]
[37]
[38] [39] [40] [41] [42] [43] [44] [45] [46]
[47] [48]
[49] [50] [51] [52] [53] [54]
for global computing,” in Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications. ACM, 1996, pp. 165–172. M. Backschat, A. Pfaffinger, and C. Zenger, “Economic-based dynamic load distribution in large workstation networks,” in Euro-Par’96 Parallel Processing. Springer, 1996, pp. 631–634. J. Dinan, S. Olivier, G. Sabin, J. Prins, P. Sadayappan, and C. Tseng, “Dynamic load balancing of unbalanced computations using message passing,” in Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International. IEEE, 2007, pp. 1–8. J. Dinan, D. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha, “Scalable work stealing,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 2009, p. 53. Y. Guo, R. Barik, R. Raman, and V. Sarkar, “Work-first and help-first scheduling policies for async-finish task parallelism,” in Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 2009, pp. 1–12. T. Hiraishi, M. Yasugi, S. Umatani, and T. Yuasa, “Backtracking-based load balancing,” in ACM Sigplan Notices, vol. 44, no. 4. ACM, 2009, pp. 55–64. A. Tzannes, G. Caragea, R. Barua, and U. Vishkin, “Lazy binarysplitting: a run-time adaptive work-stealing scheduler,” in ACM SIGPLAN Notices, vol. 45, no. 5. ACM, 2010, pp. 179–190. M. Michael, M. Vechev, and V. Saraswat, “Idempotent work stealing,” in ACM Sigplan Notices, vol. 44, no. 4. ACM, 2009, pp. 45–54. K. Zhou, Q. Hou, Z. Ren, M. Gong, X. Sun, and B. Guo, “Renderants: interactive reyes rendering on GPUs,” in ACM Transactions on Graphics (TOG), vol. 28, no. 5. ACM, 2009, p. 155. R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou, Cilk: An efficient multithreaded runtime system. ACM, 1995, vol. 30, no. 8. S. Khaitan, J. McCalley, and M. Raju, “Numerical methods for on-line power system load flow analysis,” Energy Systems, vol. 1, no. 3, pp. 273–289, 2010. S. Khaitan and J. McCalley, “A class of new preconditioners for linear solvers used in power system time-domain simulation,” Power Systems, IEEE Transactions on, vol. 25, no. 4, pp. 1835–1844, 2010. S. Khaitan, J. McCalley, and Q. Chen, “Multifrontal solver for online power system time-domain simulation,” Power Systems, IEEE Transactions on, vol. 23, no. 4, pp. 1727–1737, 2008. S. Khaitan, C. Fu, and J. McCalley, “Fast parallelized algorithms for on-line extended-term dynamic cascading analysis,” in Power Systems Conference and Exposition, 2009. PSCE’09. IEEE/PES. IEEE, 2009, pp. 1–7. S. K. Khaitan and J. D. McCalley, High Performance Computing in Power and Energy Systems, POWSYS. Springer, 2012, pp. 43–69. R. Serban, C. Petra, and A. C. Hindmarsh, “User documentation for IDAS v1.0.0,” https://computation.llnl.gov/casc/sundials/description/description.html, 2009. T. Davis, “Algorithm 832: Umfpack v4. 3—an unsymmetric-pattern multifrontal method,” ACM Transactions on Mathematical Software (TOMS), vol. 30, no. 2, pp. 196–199, 2004. T. Davis and K. Stanley, “Klu: a” clark kent” sparse lu factorization algorithm for circuit matrices,” in 2004 SIAM Conference on Parallel Processing for Scientific Computing (PP04), 2004. O. Schenk and K. G¨artner, “Solving unsymmetric sparse systems of linear equations with pardiso,” Future Generation Computer Systems, vol. 20, no. 3, pp. 475–487, 2004. D. Butenhof, Programming with POSIX threads. Addison-Wesley Professional, 1997. https://computing.llnl.gov/tutorials/pthreads/. A. Silberschatz, P. Galvin, G. Gagne, and A. Silberschatz, Operating system concepts. Addison-Wesley, 1998, vol. 4.