Tuning High-Performance Scientific Codes: The Use of ... - CiteSeerX

Tuning High-Performance Scientific Codes: The Use of Performance Models to Control Resource Usage During Data Migration and I/O Jonghyun Lee

Marianne Winslett

Xiaosong Ma

Shengke Yu

Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 USA

fjlee17, winslett, xma1, [email protected]

ABSTRACT

Large-scale parallel simulations are a popular tool for investigating phenomena ranging from nuclear explosions to protein folding. These codes produce copious output that must be moved to the workstation where it will be visualized. Scientists have a variety of tools to help them with this data movement, and often have several dierent platforms available to them for their runs. Thus questions arise such as, which data migration approach is best for a particular code and platform? Which will provide the best end-to-end response time, or lowest cost? Scientists also control how much data is output, and how often. From a scienti c perspective, the more output the better; but from a cost and response time perspective, how much output is too much? To answer these questions, we built performance models for data migration approaches and veri ed them on parallel and sequential platforms. We use a 3D hydrodynamics code to show how scientists can use the models to predict performance and tune the I/O aspects of their codes. 1. INTRODUCTION

Large-scale parallel simulations regularly output data structure snapshots or checkpoints to disk. A snapshot operation stores the current \image" of simulation data which will be needed for future visualization or analysis. A checkpoint operation saves enough simulation data for computation to restart from the most recent checkpoint if the system crashes. Each output operation is often fairly large, making these applications I/O intensive. Figure 1 shows the typical I/O behavior of this kind of application. Typical I/O approaches coded by scientists are to have one of the participating processors gather all the data from the others and write the data out, to have each processor write out its own

Read a small amount of data for initialization; For simulated_time = 0 to End_of_simulation { compute_current_state_of_simulated_system(simulated_time); if (simulated_time mod CHECKPOINT_INTERVAL == 0) output a copy of the most important data structures for fault tolerance; if (simulated_time mod SNAPSHOT_INTERVAL == 0 OR changed_a_lot(current_state_of_the_simulated_system, state_of_the_system_during_the_last_snapshot)) output a copy of the data structures you will want to visualize afterwards; }

Figure 1: The typical I/O behavior of a scienti c simulation. data, or to divide the processors into several groups with one processor from each group taking care of writing the group's data. For better I/O performance and/or exibility, scientists can also use parallel I/O systems and libraries. Data migration is often slow, due to the data size and low network bandwidth from the parallel platform to the remote workstation. Traditionally, scientists have performed data migration by moving all output les from one machine to another after the application is completed. This task is usually done manually and is tedious, time-consuming, and inconvenient. Further, end-to-end application response time is high, because the scientists have to wait until the end of the run to see the results, even though each snapshot has been ready for migration ever since its output operation. We de ne application turnaround time as the time measured from the beginning of the application run to the end of the migration of data it generated. Today, smarter approaches to data migration are known [4, 8, 11, 12, 18], and scientists could use tools to help answer their questions about the most eective use of I/O and migration resources. To this end, we have built performance models for a variety of data migration approaches and empirically veri ed them on the IBM SP, the Origin 2000, and

workstations. Section 2 discusses the data migration strategies, Section 3 presents the performance models, and Section 4 contains the validation. In Section 5, we use a 3D hydrodynamics code called ZEUS-MP to show how a scientist can use the models to predict performance and tune the I/O aspects of their codes. Section 6 discusses related work and Section 7 concludes this paper. 2. DATA MIGRATION STRATEGIES FOR PARALLEL SCIENTIFIC CODES

To avoid the ineciency of the traditional approach to data migration, we can migrate output while the application runs. This overlap will hide part or all of the migration cost and result in shorter application turnaround time. We can stage the intermediate output to local disk, which typically has higher bandwidth than the network used for migration. To migrate the output, it is read from its le on disk (which may be cached in the local memory le cache) and then sent to the remote machine.

The processors participating in an application run fall into two possibly overlapping groups: compute processors and I/O processors. Compute processors are processors where only the application's computation is taking place, and I/O processors are the processors which are in charge of the application's le I/O. In this paper, we assume that the I/O processors are also in charge of data migration, as I/O1 is closely related to data migration. This approach also makes it easy for the data migration facilities to take advantage of collective I/O facilities for reorganizing the data. For example, many scientists want their output as a single Hierarchical Data Format (HDF) [2] le, rather than small nonstandard-format output les from each compute processor. The latest version of HDF supports parallel I/O, but many scientists are wedded to the earlier versions of HDF because HDF5 is not backwardly compatible. Dedicated I/O processors are used for I/O and migration, but nothing else. Non-dedicated I/O processors act as compute processors when computation is taking place, but serve as I/O processors at I/O time. On a per-processor basis, one can usually expect better I/O performance with dedicated I/O processors since a non-dedicated I/O processor has two roles to play at I/O time (sender and recipient of data). However, the dedicated I/O processor approach may require more processors than the non-dedicated approach, in order to attain the same aggregate level of I/O performance and computation time. Non-dedicated I/O processors show better processor utilization and can be a good choice when the available processors are limited.

Dedicated I/O processors. Dedicated I/O processors are

convenient for migrating data while an application runs, because they are often idle while computation is taking place on compute processors. During these idle periods, the output generated and staged during the previous computation phases can be migrated. Figure 2 depicts this scenario. C, O and M in the gure indicate each computation, output and migration phase respectively. The numbers in parentheses represent the time spent in each phase. 1 For brevity, in this paper we will use the term \I/O" to refer to le I/O.

This approach works well when the computation period after an output operation is longer than the time needed to migrate the output data (Figure 2a) because data migration time is completely hidden (except for the nal migration). If the migration actually takes longer than the next computation period, one option is to block the compute processors from performing the next output operation until the current migration nishes (Figure 2b). Another approach is to temporarily stop the current migration when the application issues a new output request, and resume migration after the new request has been handled (Figure 2c), so that migration never blocks computation. The approach in Figure 2c is appealing when some migrations are faster than some computation phases. For example, assume each computation period takes 15 sec., each output 5 sec., and the rst two migrations take 17 and 12 sec. respectively (this is reasonable due to network bandwidth

uctuation). With the approach in Figure 2b, the third output phase starts at 57 seconds because the computation processors are idle for 2 seconds waiting for the rst migration to nish. In the approach in Figure 2c, the third output phase can start at 55 seconds, because the rst migration can be stopped for the next output and resumed afterwards. For this reason, in this paper, we follow the approach in Figure 2c when the same thread is used for migration and I/O, and call it shared-thread migration (implicitly, with migration interrupted for output). Another approach we can consider here is to have an execution thread dedicated to migrating data on the I/O processors (Figure 3a). The migration thread will run concurrently with the I/O thread on the I/O processors, performing migration whenever data is ready to be sent. During computation phases, the I/O thread will be idle and the migration thread will be the only active thread, as in the previous approach. At I/O time both threads will be working and the concurrency may slow down both threads. We consider both this dedicated-thread approach and the shared-thread approach for data migration using dedicated I/O processors in this paper. The dedicated-thread approach can be implemented either with two separate processes, or a single process with multiple threads of execution.

Non-dedicated I/O processors. Because non-dedicated

I/O processors are not idle when computation is taking place, the only possible way to migrate data with this type of I/O processors is to have a dedicated migration thread on each I/O processor. Then migration can be overlapped both with computation and the output phase (Figure 3b). Slowdown of concurrent computation, migration and output is also possible here.

Immediate migration. With remote interactive applica-

tion steering, we cannot overlap data migration with subsequent computation periods because the user must see and respond to the output before computation resumes. The needed output should be migrated immediately, without disk staging. During this \immediate migration", we still move the data to the I/O processors and then migrate the data. We do not send the data directly from the compute processors because the user at the remote side might want the data distributed dierently from how they are in memory

C1(15)

compute processors (a)

(b)

C2(15) O1(5)

I/O processors

C3(15) O2(5) M2(12)

M1(13) C2(15)

C1(15)

compute processors

C3(15)

M1(17)

O1(5)

(c) I/O processors

O3(5) M2(12)

C2(15)

C1(15)

compute processors

M3(14)

O2(5)

O1(5)

I/O processors

O3(5)

M3(14)

C3(15) O2(5)

M1(15)

O3(5)

M1(2) M2(12)

0

M3(14) 55 57

time

Figure 2: Data migration using dedicated I/O processors with one thread of execution. (a) Migration shorter than computation. (b) Migration delays next output staging. (c) Migration interrupted for next output staging. Ci is the ith computation phase, Oi is the ith output phase, and Mi is the ith migration phase. Durations are given in parentheses.

compute processors (a)

C1

C2

C3

O1

O2 M1

(b)

O3

dedicated I/O processors

compute processors non−dedicated I/O processors

C1

M2

C2 O1

M3

C3 O2

C2

O3

C3

C1 M1

M2

M3 time

Figure 3: Data migration using a dedicated migration thread. The dotted line shows the slowed down part of each operation. (a) Migration using dedicated I/O processors. (b) Migration using non-dedicated I/O processors.

on the supercomputer. This approach would also be advantageous when the time to migrate the output is shorter than the time to locally stage the data, but this situation does not arise at any supercomputing center that we have ever used. Immediate migration can be used with dedicated or non-dedicated I/O processors. 3. PERFORMANCE MODELS FOR MIGRATION

In this section, we present a performance model for each migration approach discussed in the previous section. Recall that as shown in Figure 1, our target codes have a xed number of output operations that bracket each computation period. Table 1 shows the major parameters used in the models. Section 4 discusses the important issue of how to measure these parameters.

MRC (n) = n. If Tmig > Tcomp , then all computation periods except the rst will be completely overlapped with migration, and the only catchup point here is the beginning of the rst migration. Thus we have: ( n(Tcomp + Tout) + Tmig ; if Tcomp Tmig ; (2) Ttotal = Tcomp + n(Tout + Tmig ); otherwise. This equation assumes that Tcomp is independent of migration. Most applications running on distributed memory platforms use message communication between processors during their computation. Although the data trac going out of I/O processors to a remote machine may share the interconnect with local message trac among compute processors, our equation here does not consider the possibility that the combination of those two messages will saturate the interconnect and aect the performance of computation phases of communication-intensive applications, because current supercomputer center Internet bandwidth is very low compared to the local message passing bandwidth, and therefore it would not require much processing and not be a burden. We also assume that receiving data on a workstation at the other side is not a bottleneck. This is reasonable because aggregate Internet throughput is still less than aggregate disk bandwidth, for most workstations.

Data migration using dedicated I/O processors. Under the shared-thread approach, we temporarily stop an ongoing migration whenever there is an output request from the compute processors, and resume the suspended migration once the output request has nished executing. We will not begin the ith migration until the (i ? 1)st migration is completed, even if the ith output has been written to a le and is ready for migration. We make this restriction because it allows users to receive the data in the right order for visualization. We start the ith migration when the ith output With a dedicated thread for data migration, concurrent mior the (i ? 1)st migration is nished, whichever comes later. gration and output operations may slow each other down, The execution ends after the nth migration is nished: which we represent by slowdown factors in the equations below. These factors will be discussed further in Section 4. Ttotal = tmig begin (n) + Tmig (n); (1) Since the nal migration is not overlapped with anything, where tmig begin (n) is the starting time of the nth migration T total is still given by equation (1). tmig begin (i) is also unperiod.2 To predict the value of tmig begin (n), we de ne a changed if the most recent catchup point is right after the catchup point as the time point where output from a previith output; if not, it is simply when the (i ? 1)st migration ous computation period can be migrated immediately after is nished since output does not stop migration in process. the data is written. For example, in Figures 2 and 3, the 8 P (T (j ) + s (j )T (j )); comp out out > starting points of the rst and the third migration are the > < 1ji if MRC(i) = i; catchup points. We de ne MRC (i) as k i such that the tmig begin (i) = > ending point of the kth output is the last catchup point be(i ? 1); > fore the ith migration starts, and we de ne the most recent : tmig begin (i ? 1) + smig (i ? 1)Tmig otherwise. output, MRO(i), as the index of the last output that completed before tmig begin (i). Then tmig begin can be written as follows: Again, we can assume constant parameters to compute Ttotal. 8 P (T (j ) + T (j )); If Tcomp > Tmig , then formula (2) still applies. If smig Tmig > > comp out 1j i > Tcomp + soutTout, all output operations but the rst one are if MRC ( i ) = i ; > > < completely overlapped with migration. We must determine P tmig begin (i) = > t what fraction of the entire migration is slowed down by overmig begin (MRC (i)) + MRC (i)j i?1 Tmig (j ) lap with output operations. For this analysis, we divide the > P > ); total migration time (nTmig ) into two parts, Tmig slow and > : + MRC(i)+1jMRO(i) Tout(jotherwise. Tmig normal , i.e. nTmig = Tmig slow + Tmig normal, where data migration during Tmig slow time will be slowed down by the overlap with output. Since (n ? 1) outputs slow down The equations presented above can only be used when we the migration, we have: have estimates for each Tcomp(i), Tout(i) and Tmig (i). However, even without these estimates, we can still use the avsmig Tmig slow = (n ? 1)sout Tout erage values to estimate performance. Suppose that Tcomp , Tmig normal = nTmig ? Tmig slow Tout, and Tmig are the average values of Tcomp(i), Tout(i), and Tmig (i). If Tmig Tcomp , all migrations but the nal = nTmig ? (n ? 1) ssout Tout mig one will be completely overlapped with computation and 2 Then, Ttotal is the sum of the rst computation period, the In the equations in this paper, t represents a speci c output periods where all but the rst one are slowed down, amount of real time since the beginning of execution, and T represents the duration of a given operation. and the migration periods which are not slowed down by

Name n

Meaning

total number of computation periods or output operations Ttotal total application turnaround time, measured on the supercomputer Tcomp (i) time spent in the ith computation period Tout(i) time spent in the ith output operation Tmig (i) time spent for the migration of the data generated and output by the ith computation period and output operation respectively scomp (i) slowdown factor for the ith computation period when it runs concurrently with data migration sout(i) slowdown factor for the ith output period when it runs concurrently with data migration smig (i) slowdown factor for the ith migration period when it runs concurrently with computation and/or output

Table 1: Parameter List. All parameters are per processor, not aggregate. output. Ttotal = Tcomp + Tout + (n ? 1)soutTout + Tmig normal ?1 = Tcomp + nTmig + (1 + (n ? 1) smig smig sout)Tout

A similar formula applies when Tmig Tcomp , but smig Tmig Tcomp + soutTout.

Non-dedicated I/O processors. In this approach, com-

putation and output both overlap with migration and slow down each other. Equation (1) for Ttotal still holds, and we have: 8 P (s (j )T (j ) + s (j )T (j )); comp comp out out > > < 1ji if MRC(i) = i; tmig begin (i) = > mig (i ? 1); > : tmig begin (i ? 1) + smig (i ? 1)Totherwise.

However, the slowdown of migration caused by computation will be dierent from the slowdown caused by output. Therefore, for the analysis with constant parameters, we introduce two dierent slowdown factors for data migration, smig comp and smig out, which represent the slowdown caused by computation and output respectively.3 If scomp Tcomp > smig comp Tmig , then all but the last migration are overlapped with computation. Ttotal can be derived using a method similar to the one used in the previous subsection and written as ?1 Ttotal = nTcomp + nTout + (1 + (n ? 1) scomp scomp smig comp )Tmig If smig Tmig > scomp Tcomp + soutTout, Ttotal can be written as comp ? 1 Ttotal = (1 + (n ? 1) smig smig comp scomp )Tcomp + out ? 1 s )T + nT (1 + (n ? 1) smig out out mig s mig out

If scomp Tcomp smig Tmig scomp Tcomp + soutTout, Ttotal can be derived similarly. 3 smig in the dedicated I/O processor model is actually smig out because only output is overlapped with migration in that approach.

Immediate migration. This is a very straightforward approach. X Ttotal =

in

(Tcomp(i) + Tremote output(i));

1

where Tremote output(i) is the time spent to transfer the data after the ith computation period to a remote machine. Unlike Tmig , Tremote output may include local communication cost if the output operation is collective, and does not include le read costs. With constant parameters, Ttotal = n(Tcomp + Tremote output). This model does not include user interaction time, which is usually required for interactive application steering. 4. USING THE PERFORMANCE MODEL

The performance model formulas presented in the previous section are meaningless without a discussion of how to measure or estimate the values of the parameters in the model: Tcomp , Tout, Tmig , and the slowdown factors. The appropriate measurement approach depends on what the model will be used for, ranging from a back-of-the-envelope calculation to precise comparisons of two neck-and-neck approaches. However, no matter what the intended use of the model is, in most cases the user will want to make use of two microbenchmarks we have developed for measuring the underlying performance of the Internet and the le system used for staging. These microbenchmarks are described in Section 4.1. Microbenchmarks would be of little help in estimating Tcomp. However, a value for Tcomp may already be at hand, from measurements of debugging runs on the target platform. If no previous runs are available, then presumably the code has been debugged elsewhere and is being ported to a new platform. In this case, the estimate of Tcomp will have to rely on extrapolation from the application's performance on other platforms. The remaining parameters in the performance model are the slowdown factors. Section 4.3 describes methods of estimating these factors. 4.1 Microbenchmarks and Rules of Thumb for Tout and Tmig

The major determinant of Tout is the aggregate throughput of the le system used for staging data that are to be migrated. The le system throughput is not the only determinant, because the higher-level facilities used for parallel I/O will impose their own additional overhead. In fact, an appropriate value for Tout may already be available from a previous run on the target platform, or the user might be willing to do a special run just to determine Tout. When this is not the case, the aggregate le system throughput and an additional rule of thumb can be used to estimate Tout. We have built a microbenchmark that measures the le system throughput on a platform by having n writers concurrently send streams of write requests to the le system (http://drl.cs.uiuc.edu/panda/microbenchmark/ filebenchmark.html). In our microbenchmark, each writer writes a separate le using a series of large sequential write requests, to mimic the expected behavior of a collective I/O facility. The same approach is used for reads. The microbenchmark measures performance for a range of values of n, because aggregate le system performance often drops when the number of concurrent writers exceeds a platformdependent threshold. The microbenchmark results will show the point at which the le system becomes saturated, if saturation does occur. This result may in uence the user's choice of n in application runs. Once the aggregate le system performance has been determined, the user can estimate Tout for a particular number of concurrent readers/writers by a rule of thumb for the parallel I/O system in use. Parallel I/O system developers and users usually know what fraction of aggregate le system performance their system attains on average: 30%, 50%, 80%, whatever. This estimate can be used together with the microbenchmark results to predict Tout. For example, for a run with one writer, 40 MB of data, a le system with 20 MB/sec of throughput for a single writer, and a library that delivers 50% of le system throughput, Tout will be 40=(20 :5) = 4 seconds. The same possibilities and considerations apply to the problem of measuring Tmig . Today's Internet throughputs are quite low; we found that FTP throughput between platforms around the world that we have access to is generally less than 1 MB/sec. With throughputs so low, the sender is unlikely to overrun the recipient, who will normally be able to write the received data to disk as quickly as it arrives. This means that the underlying determinant of migration performance is the Internet throughput between sender and recipient. To measure this throughput, we built a microbenchmark that uses TCP/IP to send a series of messages across the Internet (http://drl.cs.uiuc.edu/panda/microbenchmark/ networkbenchmark.html). This Internet data transmission benchmark measures performance with n concurrent data senders, all sending to a single recipient. As shown later, performance generally increases with an increasing number of senders, until a saturation point is reached. The results of the benchmark can be helpful in tuning the number of processors used to migrate data. For Tmig , we used the Internet data transmission benchmark

result for the intended number of data senders, to calculate the data transmission portion of it. For example, if the data transmission benchmark shows 1 MB/sec of throughput with two concurrent senders, then Tmig for 40MB of data and two senders will be 40 sec, plus the time required to read the 40MB from its le. 4.2 Handling the Variances

Network bandwidth uctuates, but we used a constant value for the average bandwidth to predict Tmig before the application execution. To be more precise, we can use a probability distribution for the Internet transfer bandwidth. For example, monitoring Internet performance will allow us to construct a histogram showing the chance that performance will reach each possible level. We could use this information to choose the migration strategy that minimizes turnaround time, over all the dierent possible values for data transfer time over the Internet. The question is, will this extra precision allow us to reduce the expected turnaround time? The rst obstacle to leveraging this extra precision is that the Internet bandwidth values where we would want to change the current migration method are rather extreme. The rst case is where one might wish to change from a staging migration method to immediate migration when the network bandwidth exceeds the aggregate le system bandwidth, but this level of performance is unrealistic in current supercomputer con gurations. The only other case is where bandwidth becomes so low that migration to the originally chosen destination becomes too costly to be practical, as discussed below. Thus extra modeling precision would allow us to predict the application turnaround time more accurately, but not in general to reduce the expected turnaround time as the network bandwidth uctuates. Other researchers have characterized the bandwidth uctuation of the Internet (e.g., [18]). Their long-term measurements of the Internet bandwidth show that the bandwidth during the daytime and the nighttime can be signi cantly dierent. In contrast, bandwidth within the space of several nighttime minutes spans perhaps a 30% range. This means that short-running applications are unlikely to see extreme changes in Internet bandwidth. On the other hand, codes running for many hours or days will see such extreme changes in available bandwidth that migration performance may be satisfactory in the nighttime, but so slow in the daytime that a local destination is needed for migrated data (e.g., a storage device at the supercomputing center). The automated choice of destinations for migrated data is beyond the scope of this paper. Another possible source of variance is uctuations in the duration of each computation phase. In many cases, this variance will be small, because our target applications usually take a snapshot after a xed number of timesteps of computation is nished, and each timestep typically takes roughly the same computation time. However, some applications perform output operations only when \interesting" changes occur in the data. In this case, if migration of one snapshot usually takes longer than any computation phase, the equations presented in the previous section can still be applied with minor changes, since it is still data migration which dominates the turnaround time. On the contrary, if a

migration usually is faster than any computation when the length of each computation phase varies, we can use the total length of the computation phases instead of using constant durations for each computation phase. If a migration could be longer or shorter than a computation, we recommend determining the rough fraction of each case and using a mix of the above two approaches, weighted by the fractions, to get a more accurate performance prediction. 4.3 Estimating the Slowdown Factors

Precise estimation of slowdown factors is a tedious task| one that we will describe in detail below. However, perhaps a more important question is whether the slowdown factors can be ignored completely, i.e., all set to 1.00. After all, with Internet throughput generally so low, won't Tcomp and Tmig be the deciding factors for almost any question we might want to put to the performance model? The answer is: probably, but not necessarily. Some target platforms will have very high slowdown factors for certain operations. For example, if you run a communicationintensive application on SMP PCs that have only a single I/O bus, the combination of disk activities, interconnect trac, and TCP/IP messages may swamp the bus, causing signi cant increases in turnaround time. On the other hand, most of the slowdown factors were very close to 1 in our experiments on the IBM SP and the SGI Origin 2000. The microbenchmark4 to measure scomp of a computationintensive application calculates how much a sequence of oating point operations is slowed down by the components of migration: concurrent le read operations and data transmission operations. For communication-intensive applications, it measures how much a sequence of message exchanges between two processors is slowed down. The size of data used for concurrent reads or transmission is calibrated so that the concurrent reads or transmission take longer than the computation (or communication). To ensure that concurrent reads read data directly from the disk, not just from the le cache located in main memory, the microbenchmark purges the cache of useful data before the rst read. This is to capture the slowdown caused by reading a large le, which is not likely to t in the le cache, and eventually will be read directly from the disk. Reading data from the le cache is simply a memory access, and will not cause signi cant computation slowdown. A computation phase in real applications will be a mix of pure computation and communication, corresponding to two dierent slowdown factors. As discussed in Section 5, the percentage mix of pure computation and communication can be used to calculate a percentage mix of the slowdown factors produced by the two microbenchmarks, to give a single overall slowdown factor. For the remainder of this section, we only consider a computation phase with pure computation.

sout and smig are measured in the same manner as scomp . An output period can be overlapped with migration, and a migration can be overlapped with both computation and 4 All the microbenchmarks described in this section can be found at http://drl.cs.uiuc.edu/panda/ microbenchmark/slowdown.html.

communication. The microbenchmark outputs or migrates a certain amount of data concurrently with the operations which may be overlapped with it, and calculates how much the output or migration is slowed down. Again, we control the duration of the overlapped operations to take longer than the operation whose slowdown factor is being measured. For an output operation, we issue an fsync at the end of the output to ensure that the data will go to the disk, not the le cache, for the same reason as we ushed the cache for reads. 4.4 Parameter Measurements on the SP and Origin

In this section, we present the results of the microbenchmarks and measurements described in the previous sections, for an IBM SP at Argonne National Laboratory (ANL) and the Origin 2000 at the National Center for Supercomputing Applications (NCSA). The ANL SP is a distributed memory machine. It has 80 SP3 thin nodes, each equipped with a 120 MHz P2SC CPU, 256 MB of memory and 2GB of scratch space on local disk, running AIX 4.2.1. The Origin 2000 is a distributedshared memory machine with 256 250MHz MIPS R10000 processors running IRIX 6.5. It has 128GB of main memory and 456GB of scratch space on shared RAIDs. To measure the network throughput and slowdown factors caused by concurrent data transmission, we used a local workstation named Bunny, which is located at the University of Illinois at Urbana-Champaign. Bunny is a Sun SPARCstation 10 running Solaris 2.7. The le system benchmark numbers shown in Table 2 were obtained by writing out 512MB of data using 1MB write requests. The numbers were averaged over ve or more runs. On the SP, the aggregate throughput scales up well as the number of concurrent writers increases, due to the distributed disks on the SP. However, with the shared le system on the Origin, the throughput does not scale well although it increases as the number of writers increases (up to 4 I/O processors). According to [5], too many concurrent writers/readers will eventually hurt the performance on this platform. Though the le system throughput does not scale up well, the RAIDs on the Origin show much higher performance than the disks on the SP. Aggregate network bandwidth increases as the number of concurrent streams increases, but does not scale up linearly. On both platforms, the bandwidth tends to increase up to 4 streams, not reaching a saturation point. The bandwidth between the Origin and Bunny is a little bit higher than between the SP and Bunny. Since network bandwidth can vary according to the time of the day or the amount of network trac, we conducted the experiments late at night when we usually expect a less crowded network, for consistency of results. Again, the numbers were averaged over 5 or more runs. On the SP, the only signi cant slowdown factor was the computation slowdown, scomp , showing 27% and 5% slowdown by concurrent reads and data transmission respectively. sout and smig were both 1, so users can safely ignore these factors

ANL SP

1 I/O

2 I/O

4 I/O

NCSA Origin 2000

1 I/O

2 I/O

4 I/O

aggregate le system write throughput 6.7(0.10) MB/s 13.4(0.15) MB/s 26.8(0.18) MB/s aggregate network bandwidth to Bunny 0.56(0.022) MB/s 0.75(0.003) MB/s 0.85(0.004) MB/s scomp caused by concurrent reads 1.27 scomp caused by concurrent transmission 1.05 sout 1.00 smig 1.00 aggregate le system write throughput 64.63(1.19) MB/s 72.55(2.81) MB/s 79.66(3.02) MB/s aggregate network bandwidth to Bunny 0.78(0.028) MB/s 0.90(0.017) MB/s 1.00(0.009) MB/s scomp 1.00 sout 1.00 smig 1.00

Table 2: The results of the microbenchmarks and measurements on the ANL SP and NCSA Origin 2000. The amounts in parentheses show the 95% con dence interval for the mean. on the SP.5 We hardly experienced any slowdown for concurrent operations on the Origin. The main reason for this is that on the Origin, newly created migration processes will run on dierent processors from the ones where their parents are running if there is hardware available, and since we performed our tests in a 128 processor dedicated queue for consistency, there were plenty of extra processors available. We should note that the slowdowns caused by concurrent data transmission will depend on the network bandwidth. For example, if the network bandwidth between two machines is 10MB/sec, concurrent data transmission will require more processing per unit time than when bandwidth is only 1MB/sec; therefore higher network bandwidth will increase the value of scomp . 4.5 Model Validation with Synthetic Applications

To validate the performance models, we measured the data migration performance of two kinds of synthetic application programs with the Panda parallel I/O library [16]. Panda supports multidimensional array I/O for single-program multiple-data (SPMD) style applications running on distributed and distributed-shared memory machines. Arrays are distributed across multiple compute processors on which Panda clients are running and can have dierent data distributions in memory and on disk. Panda's collective I/O strategy is called server-directed I/O [15], where I/O servers and clients actively cooperate to carry out an I/O request to store or read part or all of an array. Panda supports both dedicated and non-dedicated I/O processors. To be used in migration performance measurement, all the migration strategies 5 When measuring sout, we measured only the output slowdown caused by concurrent transmission, not by concurrent reads. In our implementations of data migration, when the dedicated-thread migration is used, we do not perform le reads while the migration is overlapped with an output phase. This will avoid performance-degrading disk seeks caused by interleaving reads and writes on the same disk. Also we found that on the SP, concurrent reads will be stalled until the concurrent writes are nished. Thus concurrent writes could stall data transmission as the migration facility waits for more data to be read, even if a thread is dedicated to migration. Data migration implementations can use double buering or other techniques to avoid this problem.

modeled in the previous section were implemented in the current version of Panda, 3.0 alpha. The dedicated-thread approaches were implemented using a separate process for migration. This avoids worries over thread safety, but increases context-switching costs. The synthetic applications follow the pattern in Figure 1. They repeat a given number of computation and output phases, and each output phase stores 16MB of data to disk. A computation phase in the rst version of the application consists of a certain number of oating point operations, to simulate computation-intensive applications. The second version of the application pingpongs messages between a pair of processors to simulate communication-intensive applications. We used eight compute processors and one, two and four I/O processors (both dedicated and non-dedicated) to run the application program. All the compute and I/O processors were synchronized before starting each computation and I/O phase. Although the simulated applications do not need this synchronization, real applications usually do because of the need to exchange data between neighboring operations, and the need to cooperate to carry out collective I/O operations.

ANL SP n

Tcomp Tout Tmig scomp sout, smig

NCSA Origin 2000 n

Tcomp Tout Tmig scomp ; sout; smig

1 I/O

2 I/O

4 I/O

1 I/O

2 I/O

4 I/O

5 35.6 sec (long), 11.9 sec (short) 2.69 sec 1.34 sec 0.67 sec 28.57 sec 21.33 sec 18.82 sec 1.05 1.00 5 26.7 sec (long), 8.9 sec (short) 0.29 sec 0.26 sec 0.24 sec 20.51 sec 17.78 sec 16.00 sec 1.00

Table 3: The parameters plugged into the performance models for validation for the computationintensive application, for 16MB of output and 16MB of migrated data. The parameters which will be plugged into the equations were measured prior to the run of the simulated application

(Table 3). All the values presented here are averaged over ve or more runs. The length of the computation cycle, Tcomp , was measured by running the computation part of the simulated application once. Each computation period was measured at the last compute processor to nish the computation. In this experiment, we controlled the number of oating point operations in a computation phase to perform data migration with a long computation cycle (computation period is longer than the migration period), and with a short computation cycle where a migration takes longer than a computation period. With immediate migration, the length of a computation phase does not impact the performance, so we only used the short computation cycle.

Tout was calculated assuming that Panda can achieve on average 85% of the peak le system throughput, as advertised in [15]. We used the same value of Tout for both dedicated and non-dedicated I/O, because both I/O methods perform similarly for small data. Also, due to the small output size, we assumed that each output will sit in the le cache and be read from there at the time of migration. Thus, Tmig was directly calculated from the network benchmark results from Table 2 without considering le read cost. The computation slowdown scomp on the SP was calculated considering only the concurrent transmission, for the same reason. Total application turnaround time was measured on the supercomputer where the application ran, from the beginning of the run until it nished sending out the last output to the remote machine. Tables 4 and 5 show the normalized difference between model-predicted performance and measured performance. The normalized dierence was calculated as the dierence between the predicted and the measured performance, divided by the measured performance. The normalized dierences in Tables 4 and 5 are all under 10%, showing that our performance models can predict migration performance accurately. 5. TUNING AN APPLICATION: ZEUS-MP

ZEUS-MP is a 3D non-relativistic hydrodynamics code developed at the Laboratory for Computational Astrophysics at the University of Illinois at Urbana-Champaign [7]. It is a parallelized version of legacy code ZEUS-3D with MPI and can run on massively parallel distributed and distributedshared memory machines such as the IBM SP2 and SGI Origin 2000. It is used to model a wide variety of astrophysical phenomena in three dimensions and for a variety of boundary conditions. It periodically outputs intermediate simulation results, such as velocity, density and internal energy, to HDF les. We ran ZEUS-MP on the same platforms used in the previous section: the IBM SP at ANL and the SGI Origin 2000 at NCSA. Our runs required 8 compute processors, and we used 1, 2 and 4 dedicated and non-dedicated I/O processors to output the data as we did in the previous section. Again, we used the smallest dedicated queue with 128 processors on the Origin for result consistency. Each output phase writes out 5 8MB arrays, and each run consists of 10 computation and output phases plus an initialization phase and an initial dump, so 440 MB total of data is written and has to be migrated. Panda can write out arrays in binary les or in HDF les, and we used binary output here.

The data were migrated to a workstation located at Argonne named Pitcairn, which is a Sun Enterprise 4000 with 8 248MHz processors running Solaris 2.7. We chose to use Pitcairn instead of Bunny for faster network bandwidth, because the low bandwidth between each platform and Bunny would make the migration of 440MB of data very long. The network bandwidth between the ANL SP and Pitcairn with a single sender averages 2.67 MB/sec, almost three times the bandwidth between the SP and Bunny when using 4 streams. ANL and NCSA are directly connected via an advanced backbone network called Abilene [1], which makes the connection between two sites very fast. The measured bandwidth between the Origin and Pitcairn averages 11 MB/sec. 5.1 Predicting the Performance of ZEUS-MP

To predict the best method to migrate data generated from ZEUS-MP on each platform, we used the migration approaches modeled in Section 3, plus the traditional approach where we manually migrate the data after the execution. We did not consider immediate migration here, because it is the slowest approach among the migration methods proposed in this paper since data transmission is not overlapped with any application phase. Until the available Internet bandwidth improves, immediate migration is really only suitable for use in interactive steering of an application. The parameters used for the equations were measured by a sample run of ZEUS-MP without any migration, to determine Tcomp, and by the microbenchmarks used in the previous section for the other parameter values (Table 6). All the microbenchmark results were averaged over ve or more runs. As mentioned earlier, each run has an initialization phase, which is shorter than a computation phase, and thus Tcomp (1) is dierent from other Tcomp(i)s. Again, Tout is computed as 85% of the le system microbenchmark write throughput on each platform, to approximate Panda output performance. The aggregate network bandwidth between the SP and Pitcairn is independent of the number of streams used, while the bandwidth between the Origin and Pitcairn increases as the number of streams increases. However, the network between the Origin and Pitcairn is almost saturated with two streams. To compute Tmig , we needed to decide whether the output will be read from the disk or the le cache on the SP. The current le system con guration on the SP sets the size of the le cache from 50MB to 200MB on each I/O processor, so it seems that the output data are likely to be in the le cache at the time of migration. Therefore, here we calculated Tmig based on the data transmission microbenchmark only. On the Origin, we do not consider the problem of computation slowdown due to disk reads for migration, because of the large available memory.

scomp on the SP was calculated considering concurrent data transmission, but not concurrent le reads, due to the large available memory. If reads from the disk are involved in a migration, scomp will become higher since reads slow down computation signi cantly, according to the benchmark results in Section 4. However, we do not expect that would change scomp a lot, because le system bandwidth is higher than the Internet bandwidth, so we will still be spending most of the migration time in data transmission rather than

ANL SP

Long Computation Short Computation 1 I/O 2 I/O 4 I/O 1 I/O 2 I/O 4 I/O shared-thread migration 0.409 -0.277 -3.280 5.742 3.215 0.595 dedicated-thread migration (dedicated I/O) -0.231 -1.243 -1.519 4.467 -3.406 -2.075 dedicated-thread migration (non-dedicated I/O) -0.233 -0.088 -0.299 5.089 3.060 -4.362 immediate migration (dedicated I/O) -2.888 -2.126 -1.098 immediate migration (non-dedicated I/O) 0.343 -5.502 -2.157 NCSA Long Computation Short Computation Origin 2000 1 I/O 2 I/O 4 I/O 1 I/O 2 I/O 4 I/O shared-thread migration -4.179 -4.194 -3.860 -6.155 -5.509 -7.323 dedicated-thread migration (dedicated I/O) -6.225 -5.552 -5.458 -6.239 -5.805 -6.617 dedicated-thread migration (non-dedicated I/O) 0.703 0.944 1.039 2.357 3.801 -0.020 immediate migration (dedicated I/O) 2.657 4.815 3.681 immediate migration (non-dedicated I/O) -0.043 1.734 0.066

Table 4: The normalized dierence between predicted and actual average turnaround times, for the computation-intensive simulated application. ANL SP Long Computation Short Computation 1 I/O 2 I/O 4 I/O 1 I/O 2 I/O 4 I/O shared-thread migration -0.988 -0.536 1.019 6.488 6.122 3.881 dedicated-thread migration (dedicated I/O) -1.151 -0.099 0.025 5.441 8.923 -2.556 dedicated-thread migration (non-dedicated I/O) -0.652 -0.085 -0.299 7.007 3.116 9.043 immediate migration (dedicated I/O) -4.141 -1.098 -2.555 immediate migration (non-dedicated I/O) 6.418 0.855 4.133 NCSA Long Computation Short Computation Origin 2000 1 I/O 2 I/O 4 I/O 1 I/O 2 I/O 4 I/O shared-thread migration 1.892 0.773 -2.257 -2.115 4.126 2.449 dedicated-thread migration (dedicated I/O) -3.892 1.002 4.124 3.686 2.663 5.992 dedicated-thread migration (non-dedicated I/O) 4.842 -1.237 3.390 1.469 2.094 -2.059 immediate migration (dedicated I/O) 2.007 4.662 4.357 immediate migration (non-dedicated I/O) 2.003 2.320 -1.352

Table 5: The normalized dierence between predicted and actual average turnaround times, for the communication-intensive simulated application. reads. Since ZEUS-MP's computation phases can include both pure computation and local communication among compute processors, the computation slowdown factors were calculated from the fraction of computation and communication in the run, provided by ZEUS-MP's built-in pro ling facility (91.7% computation and 8.3% communication) and the slowdown factors measured by the benchmark program. As in the previous section, we do not consider any slowdown of output or migration on the SP. All the slowdown factors for the Origin are again 1. Tables 7 and 8 show our predicted performance, actual performance measured, normalized dierences, and how expensive each migration method is. To calculate the cost of a run, we multiplied the number of processors used for the run by the overall execution time, which is the most common way that supercomputing centers charge their users. Although we used the dedicated queue to ensure result consistency on the Origin, we charge as for the shared queue, because that is where users typically run. In the case where we create dedicated migration processes on the Origin, we use the total number of processes created, since the newly created processes will run on new processors. For example, if 8 compute processors, 4 dedicated I/O processors, and the dedicated-thread approach are used, the total number of processes is 16. If 4 non-dedicated I/O processors are used, the total will be 12.

For the traditional migration approach, the cost is calculated by multiplying the application execution time by the number of processors used, and the time spent for migration is not included, because the migration will not use supercomputer service units. Therefore, it is cheap although it shows the worst turnaround time since migration is not overlapped with anything. The expected cost of manual migration of 440MB of data using ordinary sequential FTP between the ANL SP and Pitcairn is 165.41 sec, and between the Origin and Pitcairn is 67.7 sec6 . This is only the le transfer time and user interaction time is not included. The performance of the traditional approach in the tables was measured using dedicated I/O processors. However, we do not consider this traditional migration approach in the following discussion, since it might be competitive with other proposed approaches costwise, but still takes too much time to nish. The normalized dierences presented in Tables 7 and 8 are under 10%. On the SP, when dedicated-thread migration was used, Tcomp + Tout is greater than Tmig in most cases. This means that all migrations but the last one are completely hidden, so the cost of migration does not in uence 6 Since FTP between the Origin and Pitcairn is prohibited by the sites' security policies, we estimated the FTP performance between the two machines using the network bandwidth when one data stream is used.

ANL SP

1 I/O

2 I/O

4 I/O

NCSA Origin 2000

1 I/O

2 I/O

4 I/O

n

11 Tcomp Tcomp (1) = 4.2 sec, otherwise 11.7 sec Tout 7.03 sec 3.53 sec 1.75 sec aggregate network bandwidth to Pitcairn 2.66 MB/sec 2.66 MB/sec 2.66 MB/sec Tmig 14.98 sec 14.98 sec 14.98 sec scomp 1.15 sout 1.00 smig 1.00

n

Tcomp Tout aggregate network bandwidth to Pitcairn Tmig scomp ; sout; smig

11 Tcomp (1) = 1.6 sec, otherwise 4.4 sec 0.73 sec 0.65 sec 0.60 sec 6.7 MB/sec 10.7 MB/sec 11.0 MB/sec 5.97 sec 3.74 sec 3.64 sec 1.00

Table 6: The parameter values used for ZEUS-MP. 1 I/O processor predicted(cost) measured(cost) norm. di. traditional 363.94(1786.77) 349.55(1657.26) 4.118 shared-thread migration 246.32(2216.92) 229.70(2067.30) 7.237 dedicated-thread migration (dedicated I/O) 213.51(1921.60) 198.23(1784.07) 7.709 dedicated-thread migration (non-dedicated I/O) 231.06(1848.49) 249.33(1994.64) -7.327 2 I/O processors predicted(cost) measured(cost) norm. di. traditional 325.44(1600.30) 327.41(1620.00) -0.602 shared-thread migration 207.82(2078.24) 202.24(2022.40) 2.761 dedicated-thread migration (dedicated I/O) 172.52(1725.24) 172.20(1722.00) 0.188 dedicated-thread migration (non-dedicated I/O) 192.56(1540.49) 194.67(1557.36) -1.083 4 I/O processors predicted(cost) measured(cost) norm. di. traditional 305.86(1685.40) 307.74(1707.92) -0.610 shared-thread migration 188.24(2258.93) 175.73(2108.78) 7.120 dedicated-thread migration (dedicated I/O) 170.74(2048.93) 170.17(2042.04) 0.337 dedicated-thread migratrion (non-dedicated I/O) 172.98(1383.85) 175.03(1400.24) -1.171

Table 7: The turnaround time and cost in service units of ZEUS-MP on the ANL SP with migration to Pitcairn. turnaround time very much. The normalized dierences for these cases on the SP are small compared to the sharedthread migration cases. More analysis of these results is presented in the next subsection. 5.2 Using Performance Predictions to Tune ZEUS-MP

Choosing the best migration strategy is not a simple task. Users may place dierent priorities on our two metrics, application turnaround time and the cost in service units, to compare migration performance, and users can have dierent priorities to nd the best approach. To be general, we allow users to weight the two metrics according to their needs. Then the overall rating P for a particular application, platform, and migration approach can be written as: P = (wt turnaround time) + (wc service units) = Ttotal (wt + (wc Nproc )) where wt and wc are the weight values for the two metrics, Ttotal is the turnaround time, and Nproc is the number of processors used.

Given the combined metric, we propose the following method for choosing the best migration approach. For a given number of I/O processors, rst predict the turnaround time of each migration method and count the number of processors used for that method. Then, based on the user-speci c wt and wc , calculate P using the above equation to decide the best migration approach for the user's needs. This can be repeated for dierent numbers of I/O processors, with some caveats described below. Our experiments reveal some general conclusions about the relative merits of dierent migration approaches on the platforms we used. First, if Tcomp Tmig , both shared-thread and dedicated-thread approaches using dedicated I/O will show the same performance, since only the last migration will be visible. The former has no cost advantage on the SP, but will be cheaper on the Origin. Further, the nondedicated approach on the Origin will use the same number of processors as the shared-thread approach, which makes it no better than the shared-thread approach. If Tcomp < Tmig , but scomp Tcomp + soutTout smig Tmig , then all the migrations except the last one can still be hidden using the

1 I/O processor predicted(cost) measured(cost) norm. di. traditional 119.25(482.18) 126.53(547.74) -5.756 shared-thread migration 75.25(677.21) 79.96(719.64) -5.897 dedicated-thread migration (dedicated I/O) 68.00(611.96) 73.39(733.90) -7.351 dedicated-thread migration (non-dedicated I/O) 68.00(611.96) 75.50(679.53) -9.944 2 I/O processors predicted(cost) measured(cost) norm. di. traditional 118.42(527.5) 121.58(559.1) -2.598 shared-thread migration 56.49(564.9) 58.64(586.40) -3.666 dedicated-thread migration (dedicated I/O) 56.49(677.88) 58.96(707.52) -4.189 dedicated-thread migration (non-dedicated I/O) 56.49(564.90) 59.71(597.10) -5.393 4 I/O processors predicted(cost) measured(cost) norm. di. traditional 117.87(626.40) 123.34(691.98) -4.434 shared-thread migration 55.84(670.08) 59.18(710.16) -5.644 dedicated-thread migration (dedicated I/O) 55.84(893.44) 58.15(930.40) -3.972 dedicated-thread migration (non-dedicated I/O) 55.84(670.08) 58.76(705.12) -4.969

Table 8: The performance (response time and cost in service units) of ZEUS-MP on the NCSA Origin 2000 with migration to Pitcairn. dedicated-thread approach. Since slowdown factors are involved in the dedicated-thread approach, there is still a possibility that the shared-thread approach will work better if the slowdowns are huge. But the measured slowdowns in our experiments make this very unlikely. When migration cost cannot be hidden by any method, we again have to compare the performance of all three methods. However, it is not likely for the shared-thread migration to work better than the dedicated-thread migration in this case, due to the small slowdowns we measured on both platforms. On the Origin, migration using non-dedicated I/O is an attractive solution for several reasons. First, as long as there is hardware available, we will not experience slowdown by concurrent operations. Also, [5] suggested that non-dedicated I/O performance is comparable to dedicated I/O performance on this platform. Costwise, it is also superior to the dedicated I/O approach. As discussed earlier, in many cases, the user should decide to use either dedicated I/O for better turnaround time or non-dedicated I/O for lower cost, according to her needs. This is applicable when there are no extra processors that can be dedicated to I/O. However, which approach should we follow for better performance if we do not have extra processors for dedicated I/O? Should we reduce the number of compute processors to dedicate some to I/O? Or should we use non-dedicated I/O processors while keeping the number of compute processors the same? To answer these questions, we make two assumptions which do not hold in general, but are useful for the kind of back-of-the-envelope performance prediction that we are doing here. First, we assume that the time to compute sMB of output data linearly increases as s increases. Second, we assume that a problem can be equally divided among any number of compute processors. Then we can introduce the following equations for a computation and an output time for each I/O method when the number of available processors is xed. Here, n is the number of available processors, s is the size of data in MBs, T (s) is the length of a computation phase

dedicated I/O non-dedicated I/O

Tcomp T (s) n?w T (s) n

Tout

s=w DD (w) s=w DND (w)

for sMB of data, w is the number of I/O processors, and DD (w) and DND (w) are the le system write throughput for each I/O approach when w concurrent writers are used. These formulas omit the cost of local communication between compute and I/O processors at I/O time. We can plug these equations into the performance models in Section 3, to calculate the estimated turnaround time. For example, when we have only 8 processors available now and need to have 2 I/O processors, we want to determine which I/O method (6 compute processors and 2 dedicated I/O processors vs. 8 compute processors and 2 non-dedicated I/O processors) will perform better. For simplicity, assume that the I/O time for both approaches will take the same time. Then we have the following equation: T (s) T (s) n ? w = scomp n If we put 8 for n and 2 for w, we get 1.33 for scomp. Therefore, if the computation slowdown on non-dedicated I/O processors is over 33%, having 6 compute processors and 2 dedicated I/O processors will work better; otherwise using 2 non-dedicated I/O processors will be better. With or without extra processors for dedicated I/O, the number of I/O processors should be chosen very carefully, because aggregate performance of shared resources in a le system, interconnect, or Internet connection may drop if too many processes use them concurrently. When Tmig is longer than Tcomp + Tout (the short computation period case), the key idea in minimizing turnaround time is that achieving faster transmission is more important than achieving faster output. For a given number of I/O processors, aggregate

network bandwidth will usually be much lower than the aggregate le system throughput. Thus, with a short computation period, the best turnaround time will usually be achieved when the number of I/O processors is equal to the number of streams which give peak aggregate data transmission rates, even if the aggregate le system performance is lower than its peak. Figure 4 presents an algorithm to choose the number of I/O processors for the best turnaround time for a given I/O strategy. If we have to choose between competing platforms for an application run, the general rule of thumb to choose the pair of platforms with highest Internet bandwidth. For example, if our earlier experiments had considered the machine pairs ANL SP and Pitcairn, and NCSA Origin and Bunny, then the SP would have oered faster turnaround in every con guration than the Origin, even though that SP is very slow compared to the Origin. We can further tune the application performance by controlling the frequency of output or the amount of data per output operation (i.e., the number of variables saved). If a migration takes longer than a computation phase, but we want to reduce the application turnaround time as much as possible, then we can control the application to have longer computation periods (by reducing the frequency of output operations) or to output less so that a migration can be done within a computation period (or a computation and an output period if we overlap migration with output). Similarly, if a computation period is longer than a migration, we can output more frequently or increase the number of variables in each output. We can decide the frequency of output or the number of variables to output using a simple equation. For example, when we run ZEUS-MP with one dedicated I/O processor with a dedicated migration thread, a migration always nishes before the next computation period begins and there is idle time that could be used to migrate data. If we assume that a run that migrates xMB of data takes the same computation time as before, and increased output and migration times, we have x x D sout = ( B ? Tcomp )smig out; where D is the aggregate I/O system throughput and B is the aggregate network bandwidth between the supercomputer and the remote machine. If we solve this equation against the parameters we used for ZEUS-MP performance prediction, we get x = 51:1(MB), which matches the actual measured performance in Figure 5. The solid line in the graph represents total elapsed time including computation, output and migration and the dotted line represents the elapsed time until the computation nished. As we can see from the graph, after around 50MB the gap between the two lines becomes wider, which indicates more output data are being migrated after all the computation is nished. The results tell us that if we have more data to output, we can output 10MB more per output operation than now without hurting the overall performance. (It is inevitable that the total elapsed time increases with more output, because each output and the nal migration takes longer.) We can apply similar equations to solve the same problem using other data migration approaches or to solve the frequency problem.

6. RELATED WORK

Wide-area distributed le systems [3, 13, 17] provide convenient interfaces to access remotely located les, and can be also used to migrate data. Because they are generalpurpose distributed le systems to share geographically distributed les, they usually do not achieve good performance with high-performance computing workloads. Among the few works which speci cally focus on data migration, [11] experimented with data migration issues with a simulation of a black hole and found that data migration can be made nearly invisible by overlapping migration with subsequent computation. [12] discussed design and performance issues to migrate data using a web server and CGI scripts. The Globus project [9] aims to develop middleware to provide services for high-performance grid computations, and therefore its remote data access paradigms match well with this work. The remote I/O (RIO) library [8] is a high-level tool to access les located on remote le systems from parallel programs. It is very closely related to the work presented in this paper in the sense that it targets parallel applications and aims for high performance by overlapping communication with computation. Global Access to Secondary Storage (GASS) [4] is a more general data movement and access service which is optimized according to the several I/O patterns which are common in high-performance grid computation. [18] is another approach to access remotely located les eciently using an objected-oriented application speci c approach. Interactive steering of scienti c applications is getting more and more attention these days, but the research on it is still in its infancy. So far, a few research projects [10, 14] have addressed this problem. All of these approaches to data migration can take advantage of our approach to modeling and performance prediction. [6] discusses overlapping collective I/O with computation or communication to improve collective I/O performance. This is relevant to our migration approach here, since the le read operations in migration can be overlapped with computation or communication. In fact, their results can be used to compute slowdown factors for I/O overlapped with other activities. However, the resulting slowdown factors are not directly comparable to ours. For example, the authors used a separate parallel le system on the SP, instead of local disks as we used in our experiments. The authors' results show that computation on the SP was not slowed down much by concurrent parallel le system writes. This implies that data migration utilizing a parallel le system on the SP will be likely to work well too. 7. SUMMARY AND CONCLUSIONS

In this paper, we proposed data migration strategies using I/O processors for parallel scienti c simulation applications and provided a performance model for each approach. These models were empirically veri ed using simulated applications. We also showed how to use these models to answer the questions which often arise when running this kind of application and migrating the data produced by the application. In particular, we predicted, measured, and tuned the data migration performance of a real scienti c application, ZEUS-MP, on two dierent platforms. Our study shows that the performance models are useful to help scientists decide the best migration approach for their dierent needs.

nstream

the minimum number of concurrent streams which will saturate the network bandwidth;

if (dedicated I/O) then navailable the number of available processors that are not needed for computation; else navailable the number of compute processors; endif ntemp MIN (nstream , navailable ); if (current le system == non-shared) then nbest ntemp ; // non-shared le systems usually oer perfect scalability as the number of writers increases. else begin nIO the number of concurrent writers which maximizes le system write throughput; if (ntemp nIO ) then nbest ntemp ; else nbest x such that T (x) = MINnIO intemp T (i) where T (i) is the turnaround time with i I/O processors; endif end endif use nbest I/O processors for migration;

Figure 4: A heuristic algorithm to choose the number of I/O processors which will bring the best turnaround time for a given I/O strategy.

Total elapsed time with different amounts of output per output operation 390

Elapsed time (sec)

340

290

240

190

140 16

24

32

40

48

56

64

72

Amount of output per output operation (MB) with migration

without migration

Figure 5: The actual change in total elapsed time for ZEUS-MP when the amount of data for each output phase varies.

8. ACKNOWLEDGMENTS

This research is funded by the U.S. Department of Energy through the Center for Simulation of Advanced Rockets at the University of Illinois at Urbana-Champaign under subcontract number B341494, and by a Computational Science and Engineering Fellowship from the University of Illinois at Urbana-Champaign. We gratefully acknowledge use of the advanced computing resources at Argonne National Laboratory and the National Center for Supercomputing Applications. We also thank Bob Fiedler for his help with ZEUSMP. 9. REFERENCES

[1] Abilene home page. http://www.ucaid.org/abilene. [2] NCSA HDF home page. http://hdf.ncsa.uiuc.edu. [3] A. Alexandrov, M. Ibel, K. Schauser, and C. Scheiman. Extending the operating system at the user level: The UFO global le system. In Proceedings of 1997 Annual Technical Conference on UNIX and Advanced Computing Systems (USENIX '97), January 1997. [4] J. Bester, I. Foster, C. Kesselman, J. Tedesco, and S. Tuecke. GASS: A data movement and access service for wide area computing systems. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems, May 1999. [5] Y. Cho, M. Winslett, J. Lee, Y. Chen, S. Kuo, and K. Motukuri. Collective I/O on a SGI Cray Origin 2000: Strategy and performance. In Proceedings of the 1998 International Conference on Parallel and Distributed Processing Technique and Applications, July 1998. [6] P. M. Dickens and R. Thakur. Improving collective I/O performance using threads. In Proceedings of the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing, April 1999. [7] R. A. Fiedler. Optimization and scaling of shared memory and message-passing implementation of the Zeus hydrodynamics algorithm. In Proceddings of the SC97, November 1997. [8] I. Foster, J. D. Kohr, R. Krishnaiyer, and J. Mogill. Remote I/O: Fast access to distant storage. In Proceedings of the Fifth Workshop on I/O in Parallel and Distributed Systems, November 1997. [9] I. Foster and C. Kesselman. Globus: A metacomputing infrastructure toolkit. International Journal of Supercomputer Applications, 11(2):115{128, 1997. [10] G. Geist, J. Kohl, and P. Papadopoulos. CUMULVS: Providing fault-tolerance, visualization and steering of parallel applications. In Proceedings of Environment and Tools for Parallel Scienti c Computing Workshop, August 1996. [11] S. Kuo, M. Winslett, Y. Chen, Y. Cho, M. Subramaniam, and K. Seamons. Application experience with parallel input/output: Panda and the

[12] [13] [14]

[15] [16] [17] [18]

H3expresso black hole simulation on the SP2. In Proceedings of the 8th SIAM Conference on Parallel Processing for Scienti c Computing, March 1997. J. Lee. Web-based data migration for high-performance scienti c codes. Master's thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1999. J. Morris, M. Satyanarayanan, M. Conner, J. Howard, D. Rosenthal, and F. Smith. Andrew: A distributed personal computing environment. Communications of the ACM, 29(3):184{201, 1986. R. Ribler, J. Vetter, H. Simitci, and D. Reed. Autopilot: Adaptive control of distributed applications. In Proceedings of the 7th IEEE Symposium on High-Performance Distributed Computing, July 1998. K. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proceedings of Supercomputing '95, December 1995. M. Subramaniam. High performance implementation of server directed I/O. Master's thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 1996. A. Vahdat, P. Eastham, and T. Anderson. WebFS: A global cache coherent le system. Technical report, Department of Computer Science, University of California, Berkeley, 1996. J. Weissman. Smart le objects: A remote le access paradigm. In Proceedings of the Sixth Workshop on I/O in Parallel and Distributed Systems, May 1999.

Tuning High-Performance Scientific Codes: The Use of ... - CiteSeerX

Tuning High-Performance Scientific Codes: The Use of ... - CiteSeerX

Suggest Documents

Automated Empirical Tuning of Scientific Codes For Performance and ...

Performance Tuning Scientific Codes for Dataflow Execution Andrew ...

Performance Tuning of Scientific Applications

A HighPerformance, LowPower Chip Multiprocessor for ... - CiteSeerX

Synthesis of a unique highperformance

HighPerformance Polybenzoxazine Nanocomposites Containing

HighPerformance Glass Fiber Development for

HighPerformance PhotoelectrochemicalType ...

Lazy Codes For Scientific Computations

Automatic Tuning of Scientific Applications - Semantic Scholar

Performance Tuning of Scientific Applications - Lawrence Berkeley

Scientific assessment of the use of sugars as cigarette ... - CiteSeerX

Performance Tuning of Scientific Applications - Lawrence Berkeley ...

land use codes - Boston.gov

Use Codes - DC.gov

Scientific integrity and copyright in codes of

codes - CiteSeerX

InPlane Liquid Crystalline Texture of HighPerformance ... - NIST

The N-CODES Project - CiteSeerX

On the automorphism group of distinct weight codes - Scientific ...

How to Use Discount Codes

Lattice microbes: Highperformance stochastic ... - Semantic Scholar

DATA SHARING AND SECONDARY USE OF SCIENTIFIC ... - CiteSeerX

DATA SHARING AND SECONDARY USE OF SCIENTIFIC ... - CiteSeerX