Jun 7, 1999 - Steve Young, Jaime Frey and others at the National Center for ... modi ed by Walfredo Cirne, Jaime Frey, Shava Smallen, Fran Berman and.
Combining Workstations and Supercomputers to Support Metacomputing Applications: The Parallel Tomography Experience Walfredo Cirney Jaime Freyz Shava Smalleny Francine Bermany Rich Wolskix Steve Youngz Mark Ellismanz Mei-Hui Su{ Carl Kesselman{ June 7, 1999 Abstract
Resource-intensive parallel applications are typically developed to target either workstation clusters or parallel supercomputers. In this paper, we describe a strategy that leverages the advantages of both types of platforms to improve the performance of a parallel tomography application. We demonstrate that this strategy promotes superior application performance compared to strategies which target the application either to parallel supercomputers or workstation clusters alone.
1 Introduction Both parallel supercomputers and clusters of workstations are being used successfully to execute parallel applications. Parallel supercomputers oer fast execution performance but are typically space-shared. Jobs wishing to execute must wait in a batch queue until a sucient number of the machine's processors become available. Once processors are allocated, the execution time of the application may be short, but the queueing delay is often lengthy before execution can begin. Interactive (time-shared) workstation clusters oer immediate access This research was supported in part by NSF Grant ASC-9701333, Dod Modernization contract 9720733-00, NPACI/NSF awards ASC-9619020, NIH/NCR grant RR04050, CAPES grant DBE2428/95-4, and Advanced Research Projects Agency/ITO under contract #N6600197-C-8531. y Department of Computer Science and Engineering, University of California San Diego (http://www-cse.ucsd.edu/groups/hpcl/apples/index.html) z National Center for Microscopy and Imaging Research (http://www-ncmir.ucsd.edu) x Department of Computer Science, University of Tennessee (http://www.cs.utk.edu/ rich { Information Sciences Institute, University of Southern California (http://www.isi.edu)
1
to resources but execution times are often slower on than parallel supercomputers since the machines and networks are shared by other competing processes. Few applications currently can exploit both the fast execution (but potentially long queue delay) of parallel supercomputers and the immediate access of interactive clusters. In this paper, we explore a strategy for exploiting both supercomputer and interactive platforms simultaneously to achieve good execution performance for distributed parallel applications. We demonstrate this strategy for a parallel tomography application (PTomo) developed by Mark Ellisman, Steve Lamont, Steve Young, Jaime Frey and others at the National Center for Microscopy and Imaging Research (NCMIR). We show that a simple and eective scheduling strategy (based on the identi cation and use of immediately available resources) improves the turn-around time (the wall-clock time measured from when the application is initiated by its user until all of the results are available) of PTomo on a set of distributed resources which include both an interactive cluster and a parallel supercomputer. We believe that this strategy will be eective for other coarse-grained parallel applications as well. PTomo is a real-life, non-trivial, embarrassingly-parallel application used at NCMIR to perform 3-D reconstruction from a series of images produced by their 400 KeV Intermediate High Voltage Electron Microscope (IVEM). As is the case with many laboratories, NCMIR owns about a dozen workstations (which are used both as desktop machines and as a platform for parallel processing) and has access to supercomputer time (at the San Diego Supercomputer Center - SDSC). NCMIR's research biologists using parallel tomography were satis ed with the execution time performance they achieved on the parallel supercomputers at SDSC, but their programs often experienced long delays waiting for adequate numbers of nodes to become available. On the other hand, their workstations delivered immediate but slower execution performance. We developed an AppLeS application scheduler [BWF+ 96] for PTomo which targets both parallel supercomputers and interactive workstation clusters to achieve good application execution performance. The version of PTomo that we used was developed by NCMIR researchers, adapted to use Globus by MeiHui Su, Carl Kesselman, and other Globus project members [vLSI+ 99], and modi ed by Walfredo Cirne, Jaime Frey, Shava Smallen, Fran Berman and Rich Wolski to to run in both clustered workstation and SP2 environments by using AppLeS scheduling directives. We performed a set of experiments in a multi-user production environment, and found that the best overall scheduling strategy for the PTomo AppLeS was a hybrid strategy which targeted application tasks simultaneously to the immediately available SP2 nodes and the interactive workstation cluster. Our experiments show this scheduling strategy improves the execution performance of the application over strategies which target the application to the interactive cluster alone and to the parallel supercomputer alone. The next section describes in detail the distributed PTomo application. Section 3 describes our strategy for scheduling PTomo over workstations and supercomputers. Section 4 presents the results of comparing our strategy against 2
other alternatives. Section 5 concludes the paper and discusses future work.
2 Parallel Tomography Tomography allows for the reconstruction of the 3-D structure of an object based on 2-D projections through it taken at dierent angles. It is used in many elds, including radio astronomy, geology, oceanography, materials science and medical imaging (CAT and PET scans). At NCMIR, it is used for electron microscopy. Biological specimens on the cellular and sub-cellular level are viewed with the center's 400KeV Intermediate High Voltage Electron Microscope and their images recorded at a number of dierent angles. These images are then aligned and reconstructed into 3-D volumes using analytic and iterative tomographic techniques [PRY+ 97]. Reconstructing a typically sized volume using a simple algorithm ( ltered back-projection) takes several hours on a workstation. Tomography researchers are interested in increasing the computation speed for two reasons. First, they want to make use of more elaborate tomographic algorithms, which produce more re ned 3-D volumes. These algorithms are more computationally intensive than the algorithms currently used. Second, NCMIR is interested in on-line tomography, i.e., producing a rendered volume while the biologist is still collecting data on the microscope. This provides immediate feedback about the specimen being viewed and thus may prompt the researcher to change the experiment as a whole, or just some parameters of it (e.g., orientation and/or number of projections). For this to be useful, a rough reconstruction would have to nish in 5 to 10 minutes. No single processor can achieve this presently, leading NCMIR researchers to exploit parallelism. In the PTomo application, specimens are only rotated about a single axis as images are acquired. Thus, for any slice orthogonal to the axis of rotation, all information for that slice will fall onto a single line on all of the projections (see Figure 1). Most important, any such slice can be reconstructed independent of projection information for the rest of the volume. This makes the reconstruction embarrassingly parallel. Our version of PTomo uses Globus services to run over a heterogeneous, distributed environment. There are four types of application processes: driver, reader, writer, and ptomo. The driver controls a work queue: it assigns one slice to a free ptomo until no more slices remain. The driver is invoked by the user and starts up the other processes. The reader and writer are I/O processes and hence usually have direct access to the user le system. The reader reads input les o the disk and sends them to the ptomos for processing. The writer receives output les from ptomos and writes them to disk. The ptomo receives input les from a reader, does all the computational work, and sends output to a writer. In this study, we use one reader, one writer, and any number of ptomos, although other process con gurations are supported by our implementation. Due to the multi-threaded nature of Globus' Nexus communications library, one reader can service I/O requests for many ptomos simultaneously (the same 3
Figure 1: Projection geometry relating to a single-axis tilting experiment. Each image can be thought of as being composed of strips perpendicular to the tilt axis. Each strip is the projection of a slice. Thus, the 3-D reconstruction problem can be reduced to a series of 2-D reconstructions. (from [FR86]) applies for the writer). Figure 2 depicts the structure of PTomo.
3 Scheduling PTomo Generally speaking, the set of potential resources available consists of workstations 1 ! and supercomputers 1 . A request to run a process on workstation causes to start immediately, but time-shares with other processes. To use a supercomputer , one has to specify how many processors are to run and for how long . The copies of do not necessarily start immediately. They might wait in the queue until nodes become available for seconds. In practice, queue times may range from seconds to days. However, supercomputer executions run over dedicated resources once they are acquired (through space-sharing). w ; :::; w
w
s ; :::; s
p
p
p
w
s
n
p
t
n
p
n
t
4
driver
ptomo reader
writer ptomo
.. . ptomo
Figure 2: Application Components of PTomo. Solid lines represent transfer of input and output. Dotted lines denote control connections An issue for the application scheduler is how to partition the program between supercomputers and available workstations. Since PTomo is a coarsegrain application, we can use a simple work queue strategy that assigns work on demand. Also, for each supercomputer that is going to be used, the application scheduler has to determine performance-ecient values of and for each available supercomputer. Large values of mean faster execution time, but may also increase the queue time. Small values of reduce the queue time, but imply slower execution. Our goal is to select and in a way that minimizes Ptomo's turn-around time (i.e., the time elapsed between submission and completion: the sum of queue wait-time and execution time in the batch case). However, diculty in predicting supercomputer queue time makes it dicult to nd the optimal and . (For some eorts in predicting supercomputer execution and queue-wait time, see [Dow97, STF99, Gib97]). We avoid this problem by simply using the nodes that are currently idle. We assume that the supercomputer scheduler provides us with the maximum values of and for which execution will begin immediately. Therefore, the AppLeS s
n
n
n
n
n
n
t
t
5
t
t
application scheduler can make a well-founded choice for and . Our implementation uses the showbf command supplied by the Maui Scheduler [Mau] that controls SDSC's SP2. In order to access showbf remotely, we've created a Globus wrapper to it, called queueprobe. Queueprobe can be used by any Globus machine to query the Maui Scheduler for the maximum request that will run immediately. Note that the nodes immediately available in the SP2 may not be available for the full duration of the application. Therefore, the AppLeS application scheduler has to cope with ptomo processes that detach themselves from the application before execution has completed. We have added a fault recovery mechanism to PTomo, which enables us to treat this problem as a ptomo failure. Whenever a ptomo fails, the slice it was processing is returned to the work queue. We can use such a simple scheme because processing a slice has no side eect. The advantage of reducing this problem to fault recovery is, of course, that it also covers real faults. n
t
4 Experimental Results In order to evaluate how well the PTomo AppLeS strategy (which we denote as SP2Immed/WS) performs, we compared it against other possible scheduling strategies: using workstations only (WS), using only the nodes that are immediately available in the SP2 (SP2Immed), and requesting a predetermined number of nodes in the SP2 and probably waiting for them in the queue (SP2Queue). WS and SP2Queue are respectively the standard ways to use a cluster of workstations and a parallel supercomputer. SP2Immed represents a newer way to use a parallel supercomputer that is supported by the Maui scheduler. SP2Immed/WS, the strategy described in the previous section, combines workstations and a supercomputer to improve the application performance. Note that it is problematic to design experiments which compare all scheduling strategies under the same load and queue conditions in real multi-user production environments. In such environments, the load and availability of resources change over time, so reproducibility of the same ambient load conditions is not an option. For our experiments, we performed runs of SP2Immed, SP2Immed/WS, WS, and SP2Queue back-to-back hoping that experiments within the same set would enjoy roughly similar load conditions. Moreover, we monitored the amount of free nodes in the SP2 and used this information to discard an execution set in which the nodes available to SP2Immed/WS and SP2Immed diered by more than two. That was the case in 17 (out of 51) experiment sets. We therefore ended up with 34 valid experiment sets. The nal version of this paper will report on additional experiments. There are two other details to note in the design of our experiments. First, when we use only the resources immediately available in the SP2, it might be that there are no nodes available to execute the application. In this case, we do not run the set of experiments until the necessary resources become available. This happened 6 times out of 34 attempts. Figure 3 shows the distribution of 6
how many submissions were necessary. We did not account for the time spent with resubmissions for SP2Immed. Notice that by excluding the retry time, we present optimistic turn-around times for the SP2Immed method. A user using this method would have experienced longer delays. number of resubmissions on the SP2Immed runs 30
25
number of experiments
20
15
10
5
0
0
5
10
15 number of resubmissions
20
25
30
Figure 3: Number of SP2Immed submissions Second, we needed to decide on and when we used the SP2 in the traditional way (i.e., SP2Queue). We conducted benchmarks on the SP2 to determine the processing time of one slice. This enabled us to conservatively determine given (a conservative estimate is needed because a job is killed when its execution exceeds ). Note that the determination of the best requires accurate queue prediction. Since such predictions are often not available, we rotated among values of likely to be used by PTomo users: 8, 16, and 32 nodes. We performed sets of experiments for each number of nodes that might be requested for the SP2 in SP2Queue (i.e. =8, 16, or 32). Figures 4-6(a) show how the dierent strategies (WS, SP2Immed/WS, SP2Immed, and SP2Queue) performed in each set of experiments. The number of nodes SP2Immed and SP2Immed/WS acquired in each set of expriments is shown in Figures 4-6(b). The SP2Immed/WS strategy yielded the best performance in all cases except one (Figure 5, run 8). In that one case, the dierence between SP2Immed/WS and the strageties that beat it was small. In addition, the SP2Immed/WS strategy exhibited much less variance than SP2Queue and SP2Immed. As expected, SP2Queue had the greatest variance. While its turn-around time was sometimes close to SP2Immed/WS (544s for SP2Queue vs. 601s for SP2Immed/WS for the one time it beat SP2Immed/WS), its worse time was almost two orders of n
t
t
n
t
n
n
n
7
magnitude greater than that of SP2Immed/WS (27480s for SP2Queue vs. 369s for SP2Immed/WS).
5 Discussion and Conclusions In this work, we show how to combine workstations and supercomputers to run PTomo, a real-life coarse grain parallel application. Our solution automatically selects all resources immediately available across the system. We leverage the Maui Scheduler on SDSC's SP2 (a commonly used production parallel machine) to obtain information on immediately available nodes. This strategy has the advantage of not requiring predictions of how long requests wait on the supercomputer queue. Our results show that the PTomo AppLeS scheduling strategy consistently outperforms three strategies researchers might use for scheduling in an archetypal laboratory setting where the local resources include a cluster of workstations and researchers also have access to supercomputer time. The problem of scheduling large distributed applications continues to be investigated for applications which execute in batch-scheduled metacomputing environments (e.g. [BDG+ 98, MFS94, WK94]) as well as interactive environments (e.g. [AMR98, BWF+ 96, SBWS99, AMR98, AYIE98, Wei98]). The work reported herein extends the target domain for PTomo by targeting both parallel supercomputers and interactive resources simultaneously. We have learned three interesting lessons about metacomputing in general as a result of this eort. First, the interface exported by the resource scheduler has great impact on application schedulers. In fact, we can implement our strategy in a very straightforward manner thanks to the Maui Scheduler's showbf command. On the other hand, the Maui Scheduler (as with other supercomputer schedulers, for that matter) precluded us from trying something more sophisticated due to the diculty in predicting queue times for supercomputer requests. Second, evaluating solutions for real applications running over production environments has proven to be dicult due to the impossibility of reproducing the system load and queue conditions for comparison runs. Others have encountered the same problem. Indeed, a simulation environment speci cally targeted toward metacomputing such as the evolving Bricks project [TMN+ 99] would be very useful. Third, fault tolerance is likely to be even more important in metacomputing then it is in parallel computation. For our solution in particular, fault recovery was a natural way to deal with the time expiration of SP2 requests. In general, using autonomous and distributed resources increases the chance that some component of the application will fail. Note that our scheduler could be improved in several ways. First, we plan to extend our strategy to add processors that become available (both interactive and in the SP2) during execution. In addition, we plan to consider scaled versions of the program in which a single reader or writer is likely to become a bottleneck. We are developing a strategy that determines how many readers 8
and writers are needed to avoid contention. Finally, we intend to integrate the AppLeS PTomo scheduler with larger end-to-end telemicroscopy applications for which the tomography reconstruction presented here is a fundamental part.
Dedication and Acknowledgements This paper is dedicated to Steve Young who was a cornerstone of this work. Steve was a respected scientist and treasured friend and we will miss him greatly. We are grateful to the NPACI partnership and SDSC researchers for their assistance with this work. We bene tted greatly from discussions with the AppLeS team, the Globus team, and our colleagues at NCMIR.
References [AMR98] Alessandro Amoroso, Keith Marzullo, and Aleta Ricciardi. Widearea nile: A case study of a wide-area data-parallel application. ICDCS'98 International Conference on Distributed Computing Systems, May 1998. [AYIE98] D. Andersen, Tao Yang, O. Ibarra, and O. Egecioglu. Adaptive partitioning and scheduling for enhancing WWW application performance. Journal of Parallel and Distributed Computing, 49:57{85, Feb 1998. [BDG+ 98] S. Brunett, D. Davis, T. Gottschalk, P. Messina, and C. Kesselman. Implementing distributed synthetic forces simulations in metacomputing environments. In Proceedings of the Heterogeneous Computing Workshop, March 1998. [BWF+ 96] Francine Berman, Richard Wolski, Silvia Figueira, Jennifer Schopf, and Gary Shao. Application level scheduling on distributed heterogeneous networks. In Proceedings of Supercomputing 1996, 1996. [Dow97] Allen Downey. Using queue time predictions for processor allocation. In 3rd Workshop on Job Scheduling Strategies for Parallel Processing, in conjunction with IPPS'97, 1997. [FR86] J. Frank and M. Radermacher. Three-dimensional reconstruction of nonperiodic macromolecular assemblies from electron micrographs. In J. K. Koehler, editor, Advanced Techniques in Biological Electon Microscopy III. Springer-Verlag, 1986. [Gib97] Richard Gibbons. A hisotrical application pro ler for use by parallel schedulers. In 3rd Workshop on Job Scheduling Strategies for Parallel Processing, in conjunction with IPPS'97, 1997. [Mau] Maui webpage at http://wailea.mhpcc.edu/maui/. 9
[MFS94]
[PRY+ 97]
[SBWS99]
[STF99]
[TMN+ 99]
[vLSI+ 99]
[Wei98] [WK94]
C. R. Mechoso, J. D. Farrara, and J. A. Spahr. Running a climate model in a heterogeneous, distributed computer environment. In Proceeding of the 3rd IEEE International Symposium on High Performancd and Distributed Computing, pages 79{84, August 1994. G.A. Perkins, C.W. Renken, S.J. Young, S.P. Lamont, M.E. Martone, S. Lindsey, T.G Frey, and M.H. Ellisman. Electron tomography of large multicomponent biological structures. J. Struct.Biol., 120:219{227, 1997. Alan Su, Francine Berman, Richard Wolski, and Michelle Mills Strout. Using AppLeS to schedule a distributed visualization tool on the computational grid. International Journal of Supercomputer and High-Performance Applications, 1999. Warren Smith, Valerie Taylor, and Ian Foster. Using run-time predictors to estimate queue wait times and improve scheduler performance. In 5th Workshop on Job Scheduling Strategies for Parallel Processing, in conjunction with IPPS'99, 1999. Atsuko Takefusa, Satoshi Matsuoka, Hidemoto Nakada, Kento Aida, and Umpei Nagashima. Overview of a performance evaluation system for global computing scheduling algorithm. To appear in the 8th HPDC conference, 1999. Gregor von Laszewski, Mei-Hui Su, Joseph Insley, Ian Foster, John Bresnahan, Carl Kesselman, Marcus Thiebaux, Mark Rivers, Steve Wang, Brian Tieman, and Ian McNulty. Real-time analysis, visualization, and steering of tomography experiments at photon sources. Ninth SIAM Conference on Parallel Processing for Scienti c Computing, Apr 1999. Jon Weissman. Gallop: The bene ts of wide-area computing for parallel processing. Journal of Parallel and Distributed Computing, 54, Nov 1998. M. Wu and A. Kuppermann. Casa quantum chemical reaction dynamics. In Casa Gigabit Network Testbed Annual Report, 1994.
10
"8 nodes on the SP2" 2500
turnaround time (secs)
2000
WS SP2Immed/WS SP2Immed SP2Queue 1500
1000
500
0
1
2
3
4
5
6
7 8 runs
9
10
11
12
13
(a)
"8 nodes on the SP2" 120
number of SP2 processors used
100
SP2Immed/WS SP2Immed
80
60
40
20
0
1
2
3
4
5
6
7 8 runs
9
10
11
12
13
(b)
Figure 4: Experiment results when 8 nodes were requested for SP2Queue. (a) Turnaround time. (Values too large to t on graph: run 4/SP2Queue - 3078s, run 9/SP2Queue - 6164s, run 10/SP2Immed - 2615s) (b) Number of SP2 nodes used by SP2Immed/WS and SP2Immed. 11
"16 nodes on the SP2" 2500
turnaround time (secs)
2000
WS SP2Immed/WS SP2Immed SP2Queue 1500
1000
500
0
1
2
3
4
5 6 runs
7
8
9
10
(a)
"16 nodes on the SP2" 120
number of SP2 processors used
100
SP2Immed/WS SP2Immed
80
60
40
20
0
1
2
3
4
5 6 runs
7
8
9
10
(b)
Figure 5: Experiment results when 16 nodes were requested for SP2Queue. (a) Turnaround time. (Values too large to t on graph: run 3/SP2Queue - 20484s, run 5/SP2Queue - 17268s) (b) Number of SP2 nodes used by SP2Immed/WS and SP2Immed. 12
"32 nodes on the SP2" 2500
turnaround time (secs)
2000
WS SP2Immed/WS SP2Immed SP2Queue 1500
1000
500
0
1
2
3
4
5
6 runs
7
8
9
10
11
(a)
"32 nodes on the SP2" 120
number of SP2 processors used
100
SP2Immed/WS SP2Immed
80
60
40
20
0
1
2
3
4
5
6 runs
7
8
9
10
11
(b)
Figure 6: Experiment results when 32 nodes were requested for SP2Queue. (a) Turnaround time. (Values too large to t on graph: run 8/SP2Queue - 27480s, run 9/SP2Queue - 9869s) (b) Number of SP2 nodes used by SP2Immed/WS and SP2Immed. 13