Chapter 21 PARALLELIZATION OF AN AIRLINE FLIGHT-SCHEDULING MODULE ON A SCI-COUPLED NUMA SHARED MEMORY CLUSTER Marcus Dormanns, Stefan Lankes, Thomas Bemmerl Lehrstuhl f¨ur Betriebssysteme, RWTH Aachen Kopernikusstr. 16, D-52056 Aachen, Germany
[email protected]
Georg Bolz, Erik Pfeiffle Lufthansa Systems, Decision Support Systems Frankfurt Airport Center, D-60549 Frankfurt, Germany
{Georg.Bolz,Erik.Pfeiffle}@LHSystems.com Abstract
This paper reports about the parallelization of a module of a commercial airline flight-scheduling decision support system. The primary target platfom for this effort was a cluster of SCI-interconnected PCs. SCI’s capability that enables a shared data programming model was one of the major arguments for this platform, besides the fact that Windows NT operated PCs become more and more important even for professional enterprise computing. Two additional platforms have been considered for evaluation purposes: a HP/Convex SPP1200, the current parallel computing platform at Lufthansa Systems, as well as a 6-processor ALR PC server. The results are very promising in that it could be demonstrated that PCs are highly competitive in terms of absolute sequential performance and that the parallelized flight-scheduling module scales considerably better on the SCI-clustered PCs than on the ALR and even on the high-end HP/Convex multiprocessor.
Keywords:
NUMA, shared memory, application parallelization, cluster computing.
219
220
21.1
HIGH PERFORMANCE COMPUTING SYMPOSIUM 1999
INTRODUCTION
Airline flight-scheduling is a typical example of a complex decision support system. The problem is to determine a flight plan (i.e. departure/destination times and stations) for an airline that tries to serve the customers’ requests best while maximizing profit. There are a lot of restrictions and goals that have to be considered, e.g. the types of the aircrafts, crew-scheduling (working time regulations and skills), passengers’ comfort (number of intermediate stops and total travel time), etc. Only some years ago, flight plans have been determined and tuned entirely by hand. With increasing complexity, this became more and more impossible. Decision support systems like the considered one represent an area in the commercial computation world that requires enormous amounts of computational power. Therefore, parallel computing is a highly reasonable issue in this area, although its degree of acceptance is not yet as high as e.g. in scientific computing. This paper considers the parallelization of a time consuming module of Lufthansa Systems’ flight-scheduling system SPEED: the Connection Builder. The task of the Connection Builder is to assemble the data basis for the succeeding flight plan assessment and optimization modules. It takes the raw data of a flight plan, i.e. information about each individual flight, and assembles all resulting meaningful travel routes that can be offered to customers. For each given departure/destination city pair, there might exist several different ways to travel, e.g. with a different number of intermediate stops in different locations. This is a very time consuming process and therefore subject to parallelization. The SPEED code is fully implemented in C++. The main target platform for this parallelization effort was a cluster system (Paas et al., 1997). It consists of commodity Intel-based PC compute nodes that are interconnected with a SCI network (Scalable Coherent Interface) (IEEE, 1992). SCI falls into the class of the so-called SANs (System Area Networks) (Mukherjee et al., 1998): a high performance network for cluster systems. In distinction to other SANs (e.g. Myrinet or ServerNet) that enable just message passing, SCI offers transparent read and write access to remote memory. Memory segments of each compute node can be mapped into the virtual address space of all cluster compute nodes and used to assemble globally shared data structures therein. Accesses to segments of memory that are physically located on remote compute nodes are transparently mapped to the address range of the SCI adapter and served via a respective network transaction. Due to the different memory access latencies, depending on whether it is a local or a remote access, such a parallel system is called a NUMA system (Non-Uniform Memory Access). This distinctive feature of SCI enables a shared data programming model what is one of the reasons for Lufthansa Systems’ interest in it.
Parallelization of an Airline Flight-Scheduling Module
221
In addition to the NUMA cluster, a HP/Convex SPP1200 (Abandah et al., 1998) and a ALR Revolution 6x6 (ALR, 1996) have been considered for evaluation purposes. The ALR Revolution is a UMA (Uniform Memory Access) shared memory multiprocessor system with a central memory subsystem. This applies also to the HP/Convex SPP1200 for the largest configuration under disposal with 8 processors. Generally, in larger configurations, it is also a NUMA machine. The major architectural difference between the ALR and the HP/Convex is the type of interconnection between the processors and the memory. While the HP/Convex possesses a high performance and very expensive cross-bar network between 4 dualprocessor modules and 4 memory modules, the processors within the ALR machine share a single common bus for their memory accesses to a single memory module. Today’s trend in parallel computer architecture that is also mirrored by the NUMA cluster is to avoid the central bottleneck of a single memory subsystem but to provide a global address space, implemented by physically distributed shared memory (DSM). It is commonly assumed that the fraction of expensive remote memory accesses in such a NUMA system are payed-off by the entirely decoupling of the majority of all memory accesses. Another advantage of the employed PC NUMA cluster system is its low price. It is assembled from commodity components with an interconnect whose adapters are simply plugged into the system’s I/O bus and do not require any (proprietary and expensive) architectural modifications of the compute nodes. While this is very attractive, such a cluster definitely lacks some valuable properties from dedicated parallel systems, e.g. single system image (Pfister, 1995). Currently, the platform employed for parallel processing at Lufthansa Systems is the HP/Convex SPP1200. One goal of the described parallelization effort is to discover the differences of shared memory parallel processing on a SCI-coupled NUMA PC cluster and on a dedicated parallel system like the HP/Convex SPP1200. Clusters of PCs are of interest to Lufthansa Systems for three reasons: Intel based PCs are much less expensive than RISC workstations. PCs allow to run Windows NT, which is a very interesting operating system platform for Lufthansa Systems. It might be a serious argument for a customer to purchase the software package if it is available for Windows NT. Especially smaller companies might have a compute infrastructure that is dominated by Windows NT and might not want to change from this just to be able to run a certain software package. Decision support problems of the given type require enormous compute capabilities. To solve typical problems within reasonable time, parallel processing is necessary.
222
HIGH PERFORMANCE COMPUTING SYMPOSIUM 1999
This paper is structured as follows. Section 2 introduces the SPEED flightscheduling program package with a focus on the Connection Builder module. Its functionality is described, the integration of the Connection Builder into the whole system and its major code and data structures, as far as it is necessary to describe the parallelization. Before the actual parallelization is described in the succeeding section 4, the parallelization interface is briefly sketched in section 3, together with a discussion of the specific requirements. Section 5 presents and discusses the performance results for the three parallel systems. Finally, some general assessments of the lessons learned are provided and some conclusions are draws in section 6.
21.2
THE SPEED PROGRAM PACKAGE AND THE CONNECTION BUILDER MODULE
21.2.1
GENERAL ISSUES
SPEED is a program package that has been developed by Lufthansa Systems for flight plan determination and optimization. For a long time flight plans have been tuned manually. But with increasing complexity of this task (e.g. more flights with more complex degrees of freedom, more knowledge about customers’ needs, etc.) the introduction of a decision support system became necessary. A flight plan is based on the averaged customer booking data for a single week. I.e. there will be a large demand for flights on Monday morning and on Friday afternoon, whereas fewer customers wish to fly on Wednesday at lunchtime. Variations on larger scales (e.g. during a month or a whole year) are not considered as they would increase the complexity (and therefore the computational demand) considerably. SPEED tries to determine a flight plan which serves the customers’ needs best under economical considerations, e.g.: availability of planes (under consideration of their fuel demands, maintenance intervals, etc.) availability of a suitable crew (considering their skills, working time regulations, etc.) SPEED also allows to modify a flight plan manually. The Connection Builder is a module of the SPEED program package that is executed each time during the start-up phase of the program. It assembles the data basis which is used to determine and optimize the flight plan, i.e. all meaningful travel routes between cities. Afterwards, the assumed customer bookings are mapped onto these routes (i.e. it is determined which route with which intermediate stops a passenger with a specific departure and destination airport at a specific point in time will take). This data is then used to assess the flight plan, i.e. the customers’ satisfaction and the company’s profit. The
Parallelization of an Airline Flight-Scheduling Module
Segment
01 01
223
Market
1 0 0 1
Itinerary
1 0 0 1 Figure 21.1
The data structures of the Connection Builder.
result provides the personal with an impression of the quality of a flight plan, how a modification behaves and states the basis for further tuning efforts.
21.2.2
CODE AND DATA STRUCTURE OF THE CONNECTION BUILDER
For each given departure/destination city pair, called a market, the Connection Builder creates all existing itineraries. An itinerary is a concatenation of individual flights, also called segments, that is offered to customers as a meaningful route to travel between the two cities. Each itinerary is put together from one or more segments. For example, a passenger can directly fly from D¨usseldorf to New York. Alternatively, he firstly takes a flight from D¨usseldorf to Frankfurt. In Frankfurt, he changes the plane and takes another one to his destination city New York. So this itinerary is put together from two flights. The C++ class SWMM Segment manages all information (flight number, flight distance, etc.) of the segments. In the Connection Builder, each market is represented by the class SWMM Market. It contains information about the departure and destination city. For each market, the Connection Builder assembles all meaningful itineraries. The pointers to this itineraries are stored in a dynamic array within the data structure SWMM Market. Analogously, each itinerary contains an array of references to all those segments that form the itinerary. Furthermore, because other modules of Lufthansa Systems’ flight-scheduling system SPEED require this, each segment contains a list of pointers to all itineraries to that it contributes. It is also the Connection Builder that creates these lists. The relationships among these data structures are depicted in figure 21.1. The algorithm of the Connection Builder is relatively straight-forward. The main loop of the Connection Builder iterates over all potential markets. For
224
HIGH PERFORMANCE COMPUTING SYMPOSIUM 1999
each market, the algorithm assembles all itineraries. However, some itineraries are not very meaningful. Consider e.g. the market D¨usseldorf - New York, for which the itinerary D¨usseldorf - Frankfurt - D¨usseldorf - New York is nonsensical due to the contained cycle. The Connection Builder eliminates these nonsensical itineraries. The parallelization strategy is to parallelize the outermost loop. This seems to be very simple, but is complicated by three aspects: Different markets contain a highly varying number of itineraries. This results in enormous differences in computational complexity for the individual markets and therefore raises a load balancing problem. The list of pointers to itineraries in the class SWMM Segment might be accessed from different processes/threads. A single segment can be contained in a huge number of itineraries of different markets that might have been assigned to different processes/threads for processing. This raises a mutual exclusion problem when the list is assembled. Lock operations itself are more time consuming than the list operations protected by them. So the lock operations would state a bottleneck of a parallel Connection Builder. The Connection Builder is quite a large code with strong demands regarding maintainability. Its parallelization must neither require to many code changes nor complicate maintainability. In section 4, this paper describes dynamic loop scheduling methods as well as a parallel list that does not require locks to append an element to the list to deal with these problems.
21.3
THE PARALLELIZATION INTERFACE
The parallelization of the Connection Builder is based on the SMI (Dormanns et al., 199a7) (short for Shared Memory Interface) programming interface. SMI has been developed with the aim to serve as the basis for comfortable and efficient parallelization on shared memory platforms, especially on SCI-interconnected cluster systems. The current implementation supports Windows NT and Unix (Solaris and Linux) operated PC clusters as well as Solaris operated Sun workstation clusters. It comes with a C as well as with a Fortran binding and has already been used within several parallelization efforts (Dormanns et al., 1997; Dormanns et al., 1998, Tholen, 1998). SMI pays attention to the special demands of the SCI cluster platform. First of all, this regards the process model. While other programming interfaces, e.g. POSIX threads (Kleiman et al., 1996) or OpenMP (Dagum et al., 1998), assume a single common address space, SMI has to deal with the situation that every compute node runs an individual instance of its operating system. This fact
Parallelization of an Airline Flight-Scheduling Module
225
prohibits a single common address space with several threads that spawn the whole cluster. Instead, a SMI application consists of one or more processes per compute node, each containing one or more threads. The processes share just individual segments of memory that are accessible within the whole application and execute in a SPMD-like fashion (Single Program Multiple Data). SMI offers several services to enable a comfortable and efficient parallelization, e.g.: Allocation of globally shared memory regions with different physical distributions to account for the NUMA performance characteristic. Dynamic memory allocation within globally shared memory regions. Synchronization services (mutexes and barriers). Dynamic and adaptive loop scheduling. To abstract even more from platform issues, PTL, the Parallel Template Library, has been developed on top of SMI with the aim to embed the parallelization as transparent as possible into the framework of the class and template concept of C++ (Stroustrup, 1997). For the sake of the parallelization effort discussed here, we’ve been faced with the demand that the resulting parallel program should make sense for a SCI-coupled NUMA cluster environment as well as for a single address space symmetric multiprocessor that is operated with a single operating system instance. To account for this, the chosen execution model provides a single process for each compute node with a single thread initially. So in the case that the application is executed on a symmetric multiprocessor, which states simply a single compute node with a couple of processors, just a single process is created. Program phases that really require parallelism (the really time consuming phases, in distinction to pre- and postprocessing phases that are redundantly executed by each process) are typically based on loops whose index ranges are partitioned for parallel processing. Using PTL’s loop scheduling services, as many threads as processors are available are spawned inside each process when the loop is entered. All threads then make use of the PTLencapsulated basic SMI loop scheduling facilities (Sinnen, 1997; Yan et al., 1997) to partition the work among each other. After the loop has been entirely processed, the threads are joined. The entire functionality is already implemented in SMI. To allow a transparent usage for the parallelization of the Connection Builder, it has been further encapsulated by the PTL with the aid of C++’s template concept. The result is a simple-to-use iterator for this kind of loop that hides all parallelization, partitioning and load balancing issues from the user.
226
21.4
HIGH PERFORMANCE COMPUTING SYMPOSIUM 1999
THE PARALLELIZATION
The parallelization strategy is to parallelize the outermost loop of the Connection Builder. This loop iterates over all markets. One of the challenges was to distribute the work evenly over all processors because an optimal distribution of load obtains the highest speed-up. For this purpose, PTL’s/SMI’s dynamic and adaptive loop scheduling facilities which have been described above have been employed. These are based on individual work queues for all processes. Initially, the entire loop index range is distributed evenly among them. Each process, or thread within a process respectively, can then obtain load from its local work queue without any interference with others. Figure 21.2 gives an example of the operational principle of the loop scheduler. It is assumed that two processes (P1 and P2) operate in parallel to process the markets 1...8. Initially, the work queue of the first process contains the markets 1...4, while the second one contains the markets 5...8 (figure 21.2a). Iteratively, each process/thread fetches a fraction of work from its local work queue (figure21.22b-d). If a process finds its work queue empty, it tries to get a piece of work from the work queue of the other process in order to balance the load (figure 21.2e-f). At the beginning, all processes fetch larger portions of their work queue at once to decrease loop scheduling overhead. The granularity decreases monotonically to allow to establish a fine-grain load balance at the end. The most important data structures of the parallelized Connection Builder are the lists, especially those within each segment that contain references to itineraries. The Connection Builder creates these lists, which are used by other modules of Lufthansa Systems’ flight-scheduling system SPEED afterwards. These lists have to be allocated in globally accessible shared memory. If normal list types would be used (as is done in the sequential code) one would have to guarantee atomicity of the append operations with the aid of locks. Unfortunately, requesting a lock is substantially more expensive than appending a new item to the list. For this reason a parallel list data type has been developed. This type of a list allows all threads (i.e. M) of all processes (i.e. N) to assemble a list in a shared memory region concurrently without synchronization as long as nobody requires to inspect it during that time. Afterwards, separated with the aid of a single synchronization operation, all threads within all processes can concurrently execute all normal access operations on the list, e.g. reading or searching. To allow code reuse and to hide parallelization issues from the programmer, also this data type has been implemented fully transparent within PTL using the template mechanism of C++, very much similar to C++’s Standard Template Library (Stroustrup, 1997) (STL). The operational principle of the new list type is relatively simple and is explained in figure 21.3. Hidden by the user interface, the list is structured hierarchically in sublists. Each process maintains a copy of an array of references
Parallelization of an Airline Flight-Scheduling Module a)
c)
4
8
3
7
2
6
1
5
P1
P2
4
8
3
7
11 00 00 11 2
6
1
5
b)
4
11 00 00 11 00 11
8
3
7
2
6
1
5
P1
P2
8
3
7
11 00 11 00 d)
P1
e)
4
f)
2
6
1
5
P1
P2
4
8
11 00 00 11 00 11 00 11 00 11 11 00 00 11 11 00 00 11 3
7
2
6
1
5
P1
P2
4
8
3
7
2
6
1
5
P1
P2
Figure 21.2 An example of the loop scheduler functionality. Six snapshots (a-f) of the work queues are depicted.
227
N
M
process 1
Figure 21.3
process N
Structure of the parallel list.
to pointers to the heads of the threads’ partial lists. If a thread adds an item, it does so in its own sublist. To guarantee that this is located in a shared memory segment that is physically located on its own compute node, the memory allocator has been overloaded with an appropriate PLT allocator that dynamically manages shared memory segments with the aid of SMI functionalities. The entire used shared memory region is concatenated of individual segments, one of which is physically located on each compute node. This allows the memory allocator to return a piece of memory to the requesting thread/process which is located locally and therefore can be accessed efficiently. There is no demand for any synchronization to guarantee atomicity. If later on threads search or inspect the list, its hierarchical structure is not seen. All common list functions, e.g. get the first element, get the next element, etc. are transparently offered by the PTL and mapped to the hierarchical list structure. If a thread wants to look-up an item in the list, it searches all sublists that are locally located first. This saves expensive remote memory accesses. Only if the requested item is not found in that local part, the search process is continued in other sublists. Since these list parts are located on another nodes, this may decrease the performance. Also the list class follows strongly the STL class for lists, so that an experienced programmer will not have any problems to use the new class.
228
HIGH PERFORMANCE COMPUTING SYMPOSIUM 1999 Parallelized Application
Standard
Parallel Template Library (PTL)
Template Library (STL)
Figure 21.4
Shared Memory Interface (SMI)
Interrelations between the different programming libraries.
All those new classes that have been developed for parallelization, the loop iterator, the list class, the memory allocator class, etc. are contained in the PTL. The interrelations are summarized in figure 21.4.
21.5 21.5.1
PERFORMANCE EVALUATION PLATFORMS
ALR Revolution 6x6. The ALR Revolution 6x6 (ALR, 1996) is a (more or less) traditional low-cost bus-based UMA multiprocessor, based on Intel’s Pentium Pro processor running at 200 MHz. Originally, the bus and processor specification allow only to build four-way systems. The limiting factor is the two bit round-robin bus arbitration scheme. ALR extended this by using 2 individual 3-way processor modules that mutually appear just as a single superprocessor. To meet the requirements of the increased number of processors, the central memory system has been enhanced to be four-way interleaved. The ALR machine was operated under Microsoft’s operating system Windows NT 4.0 (server version). HP/Convex SPP1200. The configuration of the HP/Convex SPP 1200 (Abandah et al., 1998) that was under disposal at Lufthansa Systems for this work constitutes a high-end UMA multiprocessor. The eight processors (PA-RISC 7200, 120 MHz) are assembled in modules of two. Each module possesses a port to a four-by-four cross-bar interconnect that connects the processors to four memory modules. This allows to serve memory accesses from all processor groups at the same time without any collisions as long as they only refer to different memory modules. Such an eight-processor cabinet is the basic module for larger configurations which are clustered with a SCI network to build larger NUMA multiprocessors. The SPP1200 is no longer the top-of-the-line model, regarding both: memory system and processor performance. Its successor (Abandah et al., 1998) is equipped with PA-RISC 8200 processors with a SpecInt95 performance of 15.5 (at 220 MHz), compared to only 6.4 of the PA-RISC 7200. The floating-
Parallelization of an Airline Flight-Scheduling Module
229
point performance is not of interest for the specific type of application under consideration. The operating system of the HP/Convex SPP1200 under disposal was SPPUX, which is an appropriately adapted HP-UX.
SCI-interconnected NUMA PC Cluster. The SCI-interconnected NUMA PC cluster has been assembled out of dualprocessor Pentium Pro (200 MHz) compute nodes. Among the alternatives that range from single- to quadprocessor multiprocessors to be used as compute nodes, dualprocessors show the best price/performance ratio (Paas et al., 1997). The cluster under disposal consists out of six of these nodes, clustered with PCI-SCI adapters from the Norwegian company Dolphin (Dolphin, 1996). An access to local memory, as is performed on a cache miss, takes about 250ns, the sustained memory bandwidth within a compute node is about 80 MByte/s (according to the STREAM benchmark (McCalpin, 1995)). These are the comparative figures to assess SCI remote memory performance. A single remote read operation takes about 4.5s, a remote write operation about 2.5s, which is about one order of magnitude higher than the local memory performance. To increase the performance in the case that larger fractions of remote memory are accessed consecutively, Dolphin implemented so-called stream buffers, 8 for read and 8 for write transaction, each 64 byte wide. On a read, not just the single requested datum is fetched, but the entire 64 byte block that contains it. Succeeding reads to data within that block can then be served more efficient by the contents of the stream buffer without another remote network transaction. When data is written consecutively, the stream buffers are exploited to combine several individual writes into a single larger transaction. Furthermore, supported by the processors’ capability to deal with several outstanding writes, the set of stream buffers can be exploited to serve the writing of even larger memory sections in a pipelined way. The resulting maximum read bandwidth is 8 MByte/s, the write bandwidth 28 MByte/s in the configuration used for performance evaluation of the Connection Builder. It should be noted that with newer device driver versions and more suitable bus bridges between the system and the PCI bus, write bandwidths of up to 80 MByte/s are obtainable. The latency for small amounts of data as well as the bandwidth for larger amounts of data are quite impressive, comparing them e.g. to Fast Ethernet or ATM. The SpecInt95 performance of the Pentium Pro processors is 8.2, which is about 1/3 higher than that of the SPP1200 processors. Also the Pentium Pro processor is no longer the top-of-the-line Intel processor. Currently, a 400 MHz Pentium II processor is capable of 15.8 SpecInt95, which compares very well to the 15.5 of the latest 220 MHz PA-RISC 8200.
230
HIGH PERFORMANCE COMPUTING SYMPOSIUM 1999 8 linear Amdahl‘s law one thread / node two threads / node
7
Speed-up
6 5 4 3 2 1 1
Figure 21.5
21.5.2
2
3 4 5 6 No. of processes/threads
7
8
Speed-up on the SCI-interconnected NUMA PC cluster.
RESULTS
The figures 21.5, 21.6 and 21.7 show the speed-up of the Connection Builder on the three evaluation platforms. Besides the actual speed-up, the maximum achievable speed-up according to Amdahl’s law is depicted, taking into account that a fraction of the program is inherently sequential, e.g. the initial data loading phase. For the SCI-interconnected NUMA PC cluster, two different speed-up graphs are presented in figure 21.5. The higher one that ranges across a degree of parallelism from 1 to 4 reflects the case that only one process with one thread is executed on each machine. The other one that ranges from 1 to 8 threads depicts the case that 2 threads are executed on each machine. In both cases, the scalability is quite good. The performance in the case of two threads per compute node is a little bit worse due to the memory access contention. The scalability of the HP/Convex is not as good as for the PC cluster. Although this machine possesses a high performance memory system with a sophisticated (and expensive) cross-bar to interconnect processors and memory system, contention at the memory systems limits the scalability. Even worse is the scalability of the Connection Builder on the ALR Revolution, as is depicted in figure 21.7. Using 6 processors, the speed-up is already lower than with 5 processors. This is not surprising, as the memory system and its interconnection is less powerful than that of the HP/Convex, while the processors are more powerful at the same time. Figure 21.8 summarizes the results. Neither the HP/Convex SPP1200 UMA multiprocessors nor the ALR Revolution 6x6 is able to outperform the PC cluster in terms of scalability. The results show that the entire decoupling of memory accesses as long as shared data is not affected, outperforms both UMA multiprocessors.
Parallelization of an Airline Flight-Scheduling Module 7 linear Amdahl‘s law speed-up
6
Speed-up
5 4 3 2 1 1
2
Figure 21.6
3
4 5 No. of threads
6
7
Speed-up on the HP/Convex SPP1200.
6 linear Amdahl‘s law speed-up
Speed-up
5
4
3
2
1 1
Figure 21.7
2
3 4 No. of threads
5
6
Speed-up on the ALR Revolution 6x6.
8 ALR Revolution 6x6 HP/Convex SPP1200 SCI NUMA cluster linear
7
Speed-up
6 5 4 3 2 1 1
Figure 21.8
2
3 4 5 6 No. of threads/processes
7
8
Summary of the speed-up curves for the different platforms.
231
232
HIGH PERFORMANCE COMPUTING SYMPOSIUM 1999
Table 21.1
Sequential run time on the different platforms and their processing power index.
System SCI cluster HP/Convex SPP ALR Revolution
sequential run time seq. normalized seq. run time run time 577.7 860.2 502.1
1.00 1.49 0.87
processing power SPEC Int 95 normalized reciproc index SPEC 95 index 8.2 6.4 8.7
1.00 1.28 0.94
However, scalability is just one aspect. The other is the absolute execution time. Table 1 states the sequential run times. The slight difference between the PC cluster and the ALR results from different second level cache sizes (256 vs. 512 KByte). Altogether, not only in terms of scalability but also in terms of absolute run time, the SCI-interconnected PC cluster performs best. Although the performance of the PA-RISC 8200 processor is just 28% worse than the performance of the Intel Pentium II, comparing the relevant SPECInt95 performance indices, its application performance is even 49% worse. During the parallelization, some peculiarities have been observed that are worth mentioning: The original sequential code contained a huge amount of fine granular file output. This turned out to be a serious bottleneck for thread parallelization. Although all threads use separate files for output and it should therefore really be possible to perform output without any logical contention, SPP-UX as well as Windows NT ensure that output is done under these circumstances under mutual exclusion by locking. This is not such a huge problem on the SCI cluster as just two threads per process are used, but it resulted in an enormous performance degradation on the ALR and the HP/Convex systems that exploit up to 6, respectively 7 threads. The solution for this problem was to implement an intermediate output stream layer that combines several individual outputs of each stream to fewer larger ones and therefore required fewer locking.
Programs, compiled with Microsoft’s Visual C++ compiler (version 5 has been used) and linked with the multi-threaded standard libraries perform about 5% worse in the pure sequential case than linked with the singlethreaded libraries. All cited performance figures above are based in the sequential case on the single-threaded performance figure.
Parallelization of an Airline Flight-Scheduling Module
21.6
233
CONCLUSIONS
The presented results are interesting in several ways. First of all, it could be validated that expensive dedicated parallel systems do not automatically show performance advantages, compared to ordinary PC systems. This holds also, if the presented performance figures are projected to the latest hardware available on both sides. Workstations definitely still show performance advantages for floating-point intensive applications. But the type of workload considered, rarely exploits this. Not just for the specific workload considered, it became clear that the performance of the memory systems becomes more and more important. While memory system performance is quite easy to assess and to quantify for singe-processor systems (McCalpin 1995), it is much more difficult for parallel systems. Going along with the common trend in parallel computer architecture, it was shown that a NUMA system architecture fulfills the demands best. It was not really a great surprise that the Connection Builder did not scale well on the ALR. However, we did not expect that the SCI-based PC cluster could outperform the HP/Convex SPP1200 with its high performance memory system in terms of scalability. The PCs, employed as compute nodes, are known not to possess a memory system with really high performance. Latest PC systems do not only show improvements in processor performance, but as well considerable improvements regarding their memory system. Memory coupling of PCs with a SCI interconnect has proved to be a meaningful approach within several parallelization efforts (Dormanns et al., 1997; Dormanns et al., 1998; Tholen, 1998). Our experiences show furthermore, that also a shared memory programming model on such a platform is a meaningful approach, although SCI is often used just as a fast medium for message passing (Huse et al., 1997; Eberl et al., 1998). Employing the Shared Memory Interface (SMI) and the Parallel Template Library (PTL) on top of it, an efficient parallel version of the Connection Builder has been developed in that parallelism as well as platform specialities are hidden from the programmer and therefore is as simple to maintain as the original version.
References Abandah, G. A.; Davidson, E. S. (1998). Characterizing Shared Memory and Communication Performance: A Case Study of the Convex SPP-1000. IEEE Trans. on Par. and Distrib. Systems, Vol. 9, No. 2, pp. 206-216. Abandah, G. A.; Davidson, E. S. (1998). Effects of Architectural and Technological Advances on the HP/Convex Exemplar’s Memory and Communication Performance. Proc. 25th Int. Symp. on Comp. Architecture (ISCA), ACM Comp. Arch. News, Vol. 26, No. 3, pp. 318-329. ALR (1996). ALR Revolution 6x6 Six-way SMP Architecture. White Paper.
234
HIGH PERFORMANCE COMPUTING SYMPOSIUM 1999
Dagum, L.; Menon, R. (1998). OpenMP: An Industry-Standard API for SharedMemory Programming. IEEE Comp. Sci. & Engin., Vol. 5, No. 1, pp. 46-55. Dolphin Interconnect Solutions(1996). PCI-SCI Cluster Adapter Specification. White Paper, http://www.dolphinics.no Dormanns, M.; Sprangers, W.; Ertl, H.; Bemmerl, T. (1997). Performance Potential of an SCI Workstation Cluster for Grid-Based Scientific Computing. Proc. High Perf. Computing (HPC), pp. 226-231. Dormanns, M.; Sprangers, W.; Ertl, H.; Bemmerl, T. (1997a). A Programming Interface for NUMA Shared-Memory Clusters. Proc. High Performance Computing and Networking (HPCN), pp. 608-707, LNCS 1225, Springer. Dormanns, M. (1998). Shared-Memory Parallelization of the GROMOS96 Molecular Dynamics Code on a SCI-Coupled NUMA Cluster. Proc. SCIEurope Conference (held as a stream of EMMSEC’98) pp. 111-122, France. Eberl, M.; Hellwagner, H.; Karl, W.; Leberecht, M.; Weidendorfer, J. (1998). Fast Communication Libraries on a SCI Cluster. Proc. SCI-Europe Conference (held as a stream of EMMSEC’98), pp. 159-164, France. Huse, L. P.; Omang, K. (1997). Large Scientific Calculations on Dedicated Clusters of Workstations. Int. Conf. on Par. and Distr. Systems, Euro-PDS. IEEE (1992). ANSI/IEEE Std. 1596-1992, Scalable Coherent Interface (SCI). Kleiman, S.; Shah, D.; Smaalders, B.(1996). Programming with Threads. Prentice Hall. McCalpin, J. D (1995). A Survey of Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE TCCA Newsletter, Dec. Mukherjee, S. S.; Hill, M. D. (1998). Making Network Interface Less Peripheral. IEEE Computer, Vol. 3, No. 10, pp. 70-76. Paas, S. M.; Dormanns, M.; Bemmerl, T.; Scholtyssik, K.; Lankes, S. (1997). Computing on a Cluster of PCs: Project Overview and Early Experiences. 1st Workshop Cluster-Computing, published as Technical Report CSR-9705, Dept, of Comp. Science, TU Chemnitz, Germany, pp. 217-229. Pfister, G. F. (1995). In Search of Clusters. Prentice Hall. Sinnen, O. (1997) . Loop-Scheduling and -Splitting Methodologies on NUMA Multiprocessors. Diploma thesis (in German), Lehrstuhl f¨ur Betriebssysteme, RWTH Aachen. Stroustrup, B. (1997). The C++ Programming Language, Third Edition. Addison-Wesley. Tholen, S. (1998). Studies of the parallelization of simulation algorithms for room acoustics. Diploma Thesis (in German), Lehrstuhl f¨ur Betriebssysteme, RWTH Aachen. Yan, Y.; Jin, C.; Zhang, X. (1997). Adaptive Scheduling Parallel Loops in Shared Memory Systems. IEEE Trans. on Par. and Distrib. Systems, Vol. 8, No. 1, pp. 70-81.