Replacing Processes with Threads in Parallel Discrete Event ...

5 downloads 6448 Views 79KB Size Report
Colorado Technical University. The use of simulation as a tool for examining the behavior of current and planned real-world systems continues to increase.
Replacing Processes with Threads in Parallel Discrete Event Simulation

Replacing Processes with Threads in Parallel Discrete Event Simulation Mark H. Butler TRW Systems, Missile Defense Division

Bo I. Sandén Colorado Technical University The use of simulation as a tool for examining the behavior of current and planned real-world systems continues to increase. Most simulations comprise processes that are driven by simulated discrete events. To speed up large simulations, multiple computer processors are used to execute simulation processes in parallel. Parallel discrete event simulation (PDES) refers to such implementations, which are typically executed on Unix-based platforms using separate processes for event processing. Unix processes incur more overhead than threads, which are smaller scale sequences of instructions than processes. In this paper, we describe an investigation of the use of threads, instead of processes, for event processing in such simulations. To facilitate this study, a thread-based PDES prototype was developed that provided a controlled environment for conducting experiments. The benchmark metric was performance, as represented by decreases in elapsed time and associated increases in speedup. Our experiments showed that—to a point—performance improved as processors were added. Beyond that, performance remained constant or degraded because the future-event list—which is a centralized list of events to be processed—is a shared resource. Simulations with the longest event processing times and least event-handling overhead showed the greatest improvement. Our experiments demonstrated that threads are an excellent alternative to processes for parallel simulation.

Introduction We recently completed a study that provides insight into the use of multithreading in a multiprocessor parallel discrete event simulation (PDES) environment [1]. In this paper, we contribute our experience to the PDES community in several areas: • Successful application of multithreading to a PDES environment • Performance benchmarking of this approach • Quantification of a performance profile using a multithreading library • Development of a PDES prototype that can serve as a basis for further study Many approaches have been explored for speeding up a PDES. One strategy involves various simulation time-management algorithms [2]. Other strategies involve the efficient use of multiprocessors on supercomputers from Cray, IBM, Sun, and SGI. Our approach improves PDES performance by using threads instead of processes.

Technology Review Journal • Fall/Winter 2001

19

Replacing Processes with Threads in Parallel Discrete Event Simulation

Threads can be thought of as lightweight processes that run in the same address space as their parent process. Consequently, less system overhead is required to switch between them. A review of the literature [1], as well as discussions with leaders in the PDES field, indicated that while this was definitely an area of interest, no one had yet investigated it in depth. Some investigations had been initiated but were subsequently discontinued for various reasons. The two major reasons, other than funding, were as follows: • Implementation difficulty. Multitasking is more difficult to implement with threads than with processes. In the Unix environment, a new process is typically created by duplicating (“forking”) an existing process. Each process runs in its own address space and communicates with other processes via interprocess communication services, such as shared files, signals, pipes, and message queues. Simulations using processes are relatively easy to implement, as is synchronization. In contrast, thread creation, communication, and synchronization are more difficult to implement. • Conversion difficulty. A PDES implemented using processes must have all its process interaction logic replaced with thread interaction logic. This effort is substantial because many assumptions and strategies that are valid in a processbased environment become invalid when using threads. Rather than attempt to convert an existing PDES that used processes, we created a thread-based PDES prototype whose run-time characteristics were parameter controlled. Our experiments showed that threads could be a viable alternative to processes in PDES.

Prototype Development The creation of a new PDES prototype required us to ask some basic questions and make fundamental implementation choices, including the choice of hardware and software platforms, as well as the supporting libraries. Choice of Hardware and Software Platforms. The hardware used for this study was the SGI Origin2000 at the Joint National Test Facility (JNTF). This platform hosts many Department of Defense war game simulations. The speed of the central processing units (CPUs) used in the measurement data is 250 MHz. The POSIX “pthread” library provides multithreading support. While possibly sacrificing some performance, this library provides an implementation base that is portable across multiple hardware platforms. It is a user-space multithreading library. Threads may be either bound or unbound. Bound threads are system space threads and incur significantly more overhead to operate. Unbound threads operate mainly in user space and therefore require a minimal amount of context switching overhead. Since PDES software has typically been developed in general-purpose programming languages rather than simulation-oriented languages, we chose to implement this software in a general-purpose language. We further decided to use an object-oriented language. Since Java’s interpreted nature does not lend itself to the high-performance requirements of PDES, we programmed in C++. A possible alternative approach would have been to use ANSI C wrappered with C++, but we discarded that option because of the availability of a C++-based library.

20

Technology Review Journal • Fall/Winter 2001

Replacing Processes with Threads in Parallel Discrete Event Simulation

We addressed buying or building a PDES library. Developing such a library from scratch seemed a formidable task. Additional investigation revealed that, while a library that met all our requirements did not exist, several existing simulation libraries could offer a jump start. Two main contenders were identified. The first contender was the CSIM library [3], written in ANSI C. Implemented on a number of different platforms, CSIM does provide complete simulation support, but it has several shortcomings. First, it is not object oriented. More significantly, it has no built-in multithreading support. A great deal of effort would be required to restructure the library and add that capability. A change of this magnitude would also have exposed the library to new errors introduced by the new structure, as well as the new code to support multithreading. The second contender was the HASE++ library, developed by Howell [4] at the University of Edinburgh, Scotland. It is a multithreaded library written in C++ and designed to use an external multithreading library such as POSIX pthreads, Solaris™ threads, the REX threading library, or the Cray threading library. HASE++ had three advantages for us: • Like CSIM, it is already developed and tested. • It is object oriented. • Multithreading support is built in. One shortcoming of HASE++ is that its basic algorithm makes it inherently sequential. HASE++ uses a standard discrete event simulation algorithm with four steps: • Remove the next event from the future-event list (FEL). • Enable the appropriate entity to process the events. • Wait for the entity to finish processing. • Repeat until there are no more events on the FEL. Parallelism is possible only if several events have exactly the same simulation timestamp. (All events with the same simulation timestamp are removed from the FEL during the first step and can be run in parallel, if several CPUs are available.) We overcame this shortfall by modifying the FEL management logic to allow all events present on the FEL at any one point in time to be processed. The simulation clock is advanced through the times indicated by the timestamps of the events. Simulation Prototype feltest.The PDES prototype we created is called feltest. Its run-time parameters include the number of threads, the number of CPUs, the number of events to process, event processing time, and FEL overhead. It allocates 64 threads across 1 to 32 CPUs. The prototype, feltest, was designed to run with the HASE++ library and thus achieve parallelism through it. More specifically, feltest creates a number of event generators and event processors. An event processor is assigned to each event generator. Consequently, the number of event generators and processors is always equal, and each is assigned to its own thread. This process simplified control of the environment. Figure 1 depicts the operation and interaction between an event generator thread and its event processor thread counterpart. With this architecture, the simulation runs as follows: • The simulation prototype feltest creates all event generator and processor threads. The number of event generators is specified in the input parameter named

Technology Review Journal • Fall/Winter 2001

21

Replacing Processes with Threads in Parallel Discrete Event Simulation

Event Generator Thread

Event Processor Thread

Generator Event

Wait for Event

Wait for Processed Event

Simulate Processing Event

Simulate Processing Processed Event

Generate Processed Event

Figure 1. Event generator and event processor operation







num_srcs. After all event generators and processors are successfully created and waiting on the start semaphore, the main simulation control starts the simulation by clearing the start semaphore. Each event generator schedules an event for its associated event processor. It then waits on a semaphore for a processed event. When the processed event is received, it simulates event processing of its own and then schedules another event for its event processor. This process continues until it has generated the number of events specified by the input parameter num_events. While the event generators start the simulation by generating events, the event processors start the simulation by waiting on events. When an event processor is awakened to process an event, it simulates processing the event and then schedules a processed event for some time in the future. It then waits for the next event. The FEL manager thread coordinates all these events. In response to a scheduled event from an event generator, the FEL manager releases the event to its respective event processor. When the event processor completes and schedules a processed event for its event generator, the FEL manager releases the event to the associated event generator.

With this approach, only half of the threads (the value of num_srcs)are running at any one time, because either the event generator or the event processor is waiting for an event from the other. This approach means that trying to use a CPU complement greater than num_srcs provides no additional speedup. Of the two threads assigned to each CPU, only one is running at any given time. This relationship was used to simplify simulation control. In a more typical simulation, this 1:1 relationship probably would not exist between the event generators and the event processors. The HASE++ library supports other, less restrictive approaches that require more complex logic to determine how far the simulation clock should be advanced. Also, rollback logic is needed for situations where entities post events with timestamps less than the current simulation clock. For those experiments, we forced all future events to be scheduled at fixed increments in the future from current simulation time.

22

Technology Review Journal • Fall/Winter 2001

Replacing Processes with Threads in Parallel Discrete Event Simulation

The event-generating portion of feltest has two functions: • Reading parameters from the parameter file specifying the number of events to generate, the average simulated event processing time to be used for each event, and the variance to apply to this average, if any • Generating the requested number of events The event generator loop operates as follows: • It schedules an event in the future for the associated event processor. Depending on the variance specified, this is either a fixed or a variable time increment. If the variance is 0, it is the actual increment specified. If the variance is greater than 0, it is a normally distributed variate with a mean equal to the increment specified and the specified variance. The first event for each thread is always an immediate event. • It waits for a processed event to be returned from the event processor. • It simulates event processing for the specified time (i.e., it consumes the number of CPU cycles necessary to pass the specified time on a dedicated CPU). The event processor portion of feltest has two functions: • Getting the average event processing time specified in the parameter file • Processing events generated by the associated event generator The event processor loop operates as follows: • It waits for an event from the associated event generator. • It simulates processing the event. • It schedules a processed event for the associated event generator. System Environment. The initial prototype development was done remotely on a 128-CPU, 300-MHz Origin2000 system running IRIX 6.5 at the Army Research Laboratory (ARL) at Aberdeen Proving Grounds, Maryland. The final timing runs were performed on a 64-CPU, 250-MHz Origin2000 system running IRIX 6.5 at the JNTF at Schriever Air Force Base in Colorado Springs, Colorado. Access to this system was through a LAN-based SGI O2 workstation using a command line and/or X-Windows interface. The JNTF environment allowed the use of several debugging and testing tools that were not practical to use in the ARL environment, especially over a modem. One of the most useful tools was a graphical representation of CPU use, updated four times a second, for all 64 CPUs on the system. It distinguishes between system CPU time and user CPU time with color-coded, stacked, bar charts, which allow the user to visually observe CPU loads. Another useful tool named top textually shows what, and where, processing threads are running at any time for a specific user. Its output is similar to the output of the Unix Process Status (ps) command. However, unlike ps, which generates a single snapshot display and then ends, top runs continuously once started, updating the screen every second. Experimental Runs. The basic measurement collection problem was that feltest was not the only process running in the environment. The system’s CPUs, caches, memory, operating system (IRIX) facilities, disks, etc., were shared with other programs. Elapsed time was the main measurement of improvement for the feltest program when CPUs were added. However, elapsed time could be adversely affected (i.e., increased) by other activity in the system. Other measurements were also affected by system activity (e.g., percent CPU busy).

Technology Review Journal • Fall/Winter 2001

23

Replacing Processes with Threads in Parallel Discrete Event Simulation

Several runs were usually needed in order to observe best performance or to build statistics to describe the typical profile of a particular scenario’s operation, because even when the system appeared idle, many processes and daemons were still active in the environment. These interfered with the operation of the test program, causing variations in many of the captured data items, such as percent CPU busy, elapsed time, system CPU time, and context switches. Many runs with many different scenarios were necessary to be able to understand where performance gains were highest and where bottlenecks occurred. Consequently, a parametric mode was implemented that allowed many scenarios to be constructed dynamically. A Perl script varied some parameters over a specified range and automatically ran each resulting unique scenario. Definitions of the most important time and resource use data items we measure, collect, and write to the output file are • elapsed time. Total (wall-clock) time a program takes to execute, in seconds and thousandths of seconds. • user CPU time. Time during a program’s execution when it is executing its own code or library routines in user mode. • system CPU time. Time during a program’s execution when the operating system is executing code on behalf of the program. • percent CPU busy. CPU use during the indicated elapsed time. It is calculated as ((user CPU time + system CPU time)/elapsed time) *100 and is displayed to an accuracy of one decimal place. Since elapsed time should decrease as CPUs are added, percent CPU busy for multiple CPU runs will exceed 100%. We ran 64,000 events through the prototype, with processing time varying from 100 to 1,000 µs per event; 1,950 runs were made, with 42,900 data points collected. Because of intermittent, unpredictable interference on the system, runs with each unique parametric configuration were repeated five times to find the best-case performance as closely as possible. The most important data point collected was elapsed time.

PDES Prototype Performance Results The elapsed time from runs using single CPUs typically experienced minimal variation from the shortest time obtained (average ~0.7%, maximum ~3.5%). Multiple CPU runs tended to vary more widely (average ~12%, maximum ~54%). In addition to elapsed time varying between runs, system CPU time and percent CPU busy varied the most between runs. Figure 2 depicts the speedup achieved with various combinations of FEL overhead (felovhd) and event processing times (e_time) for CPU complements ranging from 1 to 32. Overall, the performance of the PDES prototype improved as CPUs were added, up to a maximum of 12 to 14. Beyond that, performance remained constant or degraded. That outcome is similar to results of benchmarks with a process-based PDES framework (Synchronous Parallel Environment for Emulation and Discrete-Event Simulation (SPEEDES) [5]) on the same platform. The maximum speedup attained within the run

24

Technology Review Journal • Fall/Winter 2001

Replacing Processes with Threads in Parallel Discrete Event Simulation

32.0 perfect (ideal) threadtest (100% overlap) feltest(felovhd=27 µs, e_time=100µs) feltest(felovhd=27 µs, e_time=500 µs) feltest(felovhd=27 µs, e_time=1000 µs) feltest(felovhd=155 µs, e_time=100 µs) feltest(felovhd=155 µs, e_time=500 µs) feltest(felovhd=155 µs, e_time=1000 µs)

28.0 24.0

Speedup

20.0 16.0 12.0 8.0 4.0 0.0

1

2

4

6

8

12 10 14 Number of CPUs

16

20

24

28

32

Note: We ran timing tests to compare the effects of varying numbers of CPUs on performance. From 2 to 16 CPUs, we increased the number in increments of 2. From 16 to 32, we increased the number in increments of 4. We achieved the best performance with 12 to 16 CPUs. The tests for numbers above 16 were run to check for anomalies, but we found none.

Figure 2. Speedup versus number of CPUs

profiles tested with the prototype was about five. Configurations with the longest event processing times and least FEL overhead showed the greatest improvement. Conversely, configurations with the shortest event processing times and most FEL overhead showed the least performance improvement. Figure 2 compares the results from the feltest runs with perfect speedup results obtained when we ran a variant program called threadtest, which uses totally independent threads without shared resources. This experiment showed that even 100% overlap among threads did not yield perfect (ideal) speedup. Performance improved well through the entire range of CPUs, but more overhead was incurred as each CPU was added. At the 32-CPU level, speedup proved to be only about 92% of potential— a result that could significantly affect massively parallel systems with hundreds or thousands of CPUs. A significant measurement from these tests indicates that the system’s CPUs were not running at full utilization. The chart in Figure 3 represents overall CPU utilization observed for various complements of CPUs ranging from 1 to 32 with representative fixed felovhd of 50 µs and e_time ranging from 100 to 1000 µs in 100-µs increments. The maximum CPU utilization achieved around the apparent optimum complement of 12 to 14 CPUs was ~470% out of a potential of 1200% to 1400%.

Technology Review Journal • Fall/Winter 2001

25

Replacing Processes with Threads in Parallel Discrete Event Simulation

500

100 µs 200 µs 300 µs

400

400 µs 500 µs 600 µs

% CPU Busy

300

700 µs 800 µs 900 µs

200

1000 µs

100

0

1

2

4

6

8 10 12 14 16 20 24 28 32 Number of CPUs

Figure 3. CPU utilization by number of CPUs

Summary Our experiments showed that threads can be an excellent alternative to processes in PDES. Performance improves as processors are added—up to a point. Beyond that point, performance remains constant or degrades because the FEL becomes a bottleneck. Areas for further investigation became apparent as this work progressed. One area is to provide more precise control over the thread-scheduling mechanism by running in a privileged, or superuser, mode. Another area, although sacrificing some portability, uses the native, platform-specific multithreading support rather than the more generic POSIX pthreads library. A third, and especially interesting, area is to use multiple FELs, with perhaps multiple threads per FEL.

References 1. M.H. Butler, Scalability of Parallel Discrete Event Simulation Using Threads, Doctoral Dissertation, Colorado Technical University, Colorado Springs, Colo., 2001. 2. P.J. Talbot, “Time Management Algorithms for Parallel Discrete Event Simulations,” TRW’s Technology Review Journal, Vol. 6, No. 1, Spring/Summer 1998, pp. 17–27. 3. K. Watkins, Discrete Event Simulation in C, McGraw-Hill Book Company, London, UK, 1993.

26

Technology Review Journal • Fall/Winter 2001

Replacing Processes with Threads in Parallel Discrete Event Simulation

4. F.W. Howell, HASE++ (Version 1.0)—A Discrete Event Simulation Library for C++, http://www.dcs.ed.ac.uk/home/hase/userguide/hase++.html, 1996. 5. Metron, Inc., Synchronous Parallel Environment for Emulation and DiscreteEvent Simulation (SPEEDES), http://www.speedes.com, 2001. Mark H. Butler has more than 38 years of experience in the computer industry, the past four with TRW Systems in Colorado Springs, Colorado. He has worked with several major computer hardware and software companies and was with TRW in Los Angeles, California, for several years in the early 1980s. He is currently a senior software engineer assigned to the Joint National Test Facility in Colorado Springs, where his work includes the optimization of parallel discrete event simulation and alternative platforms for highperformance computing, including Beowulf clusters. He holds a BS in management and finance from Golden Gate University in San Francisco, California. He also holds an MS and a Doctor of Science in computer science from Colorado Technical University in Colorado Springs, where he serves as an adjunct professor of computer science. [email protected]

Bo I. Sandén is a professor of computer science at Colorado Technical University in Colorado Springs, Colorado, where he teaches software engineering, simulation, and modeling. His main research interest is software design using multithreading, and he has published multiple papers on that topic. He is also the author of a book on software construction. Before entering academia, he spent 15 years as a software designer and project lead with UNIVAC and Philips. He holds an MS in engineering physics from the Lund Institute of Technology, Lund, Sweden, and a PhD in computer science from the Royal Institute of Technology, Stockholm, Sweden. [email protected]

Technology Review Journal • Fall/Winter 2001

27

Suggest Documents