Craig Lee and James Stepanek. Computer Systems Research Department, M1-102. The Aerospace Corporation, P.O. Box 92957. El Segundo, CA 90009-2957.
On Future Global Grid Communication Performance1 Craig Lee and James Stepanek Computer Systems Research Department, M1-102 The Aerospace Corporation, P.O. Box 92957 El Segundo, CA 90009-2957 {lee|stepanek}@aero.org
Abstract For a grid network performance data set, we estimate propagation distances to get a lower bound on propagation delays. With a model for primary performance factors and assumptions about expected performance trends, we extrapolate to estimate the communication performance of a global grid in ten years time. Communication pipes are getting fatter, but due to simple propagation delays, are not getting commensurately shorter. Hence, bandwidth-delay products will rise correlated to distance. Based on conservative estimates, the bandwidth-delay products will rise by a median factor of 5.2x with an increasing distribution to a strong mode at 7.1x. This clearly indicates that latency tolerance must be integral to applications at an increasingly smaller scale using established techniques such as caching, compression and pre-fetching coupled with coarse-grain, data-driven execution models to hide latency with throughput. With the extrapolated performance data, we use a simple pipeline model to estimate an “operating region” for work granularity and number of computational threads to hide latency with throughput. We then discuss implications for programming and execution models.
1. Introduction Grid computing endeavors to make all manner of compute resources, large and small, easily available and manageable [6]. This is largely enabled by the network communications that form the basis of the Internet and the World Wide Web. The wide applicability of network communications to science, engineering, education, commerce and to the fabric of modern society has accelerated its development and deployment beyond most people’s expectation. With such a large number of widely distributed and eas1 Appears: The Tenth Heterogeneous Computing Workshop, April 23, 2001, San Francisco, California.
ily accessible hosts, however, the issue of performance over this vast networked infrastructure becomes a central question. Our interest here is to take a larger system view and understand the performance implications for grid computations both under current conditions and under conditions extrapolated for the next ten years. What governs grid communication performance? Endto-end performance can modeled as end-to-end latency partitioned into (1) a speed-of-light propagation delay, (2) a transmission delay based on the data volume sent and the bandwidth, and (3) overhead, including end-host delays and router queuing delays [13, 18]: Latency = P rop + T rans + Ovrhead = Dist/SoL + V ol/BW + Ovrhead Included in this, of course, are fixed, one-time costs, perbyte costs, and per-hop costs. We can, however, make a simple distinction between first-byte and last-byte latencies. Assuming that the first-byte latency is simply the propagation delay plus any queuing, then the last-byte latency is the first-byte latency plus the transmission delay of the entire data volume: Latencyf irst byte = P rop + Ovrhead Latencylast byte = Latencyf irst byte + V ol/BW This is certainly an over-simplification since there are perbyte overheads involved in data transfers. For the purposes of this paper, however, we will use this idealized transmission delay since per-byte and per-hop overheads could be folded into a slightly lower effective bandwidth. Whether first- or last-byte latencies dominate is dependent on the typical bandwidths, data volumes and distances involved. Real-world network performance will also be affected by factors such as protocols, competing traffic, congestion, drop rates, and also memory speeds and backplane
speeds. We note that while packets are the typical unit of transfer rather than bytes, this distinction is not critical for the purposes of this paper. For most applications, just the available end-to-end bandwidth and latency will be the primary factors driving communication performance. Given that grid communication performance is primarily dependent on bandwidth and latency and the aspect ratio between them (and the bandwidth-delay product), what is the current state of grid performance and what will it be in the not-so-distant future? Given that networks and processors will be getting faster and that many grid testbeds will be globally distributed, how important will simple first-byte latencies be? Given that many applications will need to run in a distributed environment but will have insufficient data volumes to “keep the pipes full”, how important will firstbyte latencies be? What kind of techniques can be brought to bear to address this problem? To investigate this issue, a snapshot was taken of the bandwidth and latency performance of a global grid environment. After making assumptions about the expected trends in processor and network performance, we extrapolate to estimate the communication performance of a global grid in ten years time. With this extrapolated performance, we use a simple pipeline model to estimate an “operating region” for the work granularity and number of threads of computation needed to hide latency with throughput. We then discuss the implications for programming and execution models. First, we discuss some related work.
1400 1200
count
1000 800 600 400 200 0 0.0001
0.001
0.01 0.1 1 10 bandwidth (Mbits/s), 100 log-sized bins
100
1000
Figure 1. Globus testbed BW distribution. 2500
2000
count
1500
1000
500
0 0.1
2. Related Work
1
10
100
1000
10000
latency (ms), 100 log-sized bins
Figure 2. Globus testbed RTT distribution. Network performance has generated an immense body of knowledge. The vast majority of it, however, focuses on the performance of individual hosts, flows, routes, or routers. Given the impact of the Internet and networking in general, there is a growing focus on the performance of the Internet or networks as a whole. Organizations such as the National Laboratory for Applied Network Research [14] and the Cooperative Association for Internet Data Analysis [4] are producing a more comprehensive understanding (and pictures) of aggregate network characteristics. NLANR’s Active Measurement Program [16] is similar to the collection methodology used here in that a number of widely distributed sites make active measurements of network performance. Wide-spread performance instrumentation that is integrated into the network infrastructure is an important topic for several projects including the Network Weather Service [23], the Next Generation Internet Initiative [15] and the Grid Forum [8]. These efforts entail not only the design of sensors that make the basic measurements but also a management scheme that enables sensor control and the discovery, collation, and analysis of performance data by any num-
ber of tools and users. The unique focus of this paper, however, is on the fundamental limits of network performance in a global grid environment and their implications. Other performance measurement projects have not focused on these limits from the perspective of a distributed application in the growing field of grid computing. Hence, the source of the data set used here and the mechanics of its collection are actually less important than the use to which it is put.
3. Snapshot of a Global Grid’s Performance Globus [19] is a grid computing toolkit with support for many grid computing functions, e.g, security, remote data access, scheduling, health and status monitoring, performance monitoring, and information services for resource discovery and allocation. Gloperf [11] was a preliminary tool built to make basic network performance measurements on GUSTO, the Globus testbed. Gloperf was based on a library version of netperf [9]. Hence, it did untuned 2
1000 100
100000 10000 1000 100 10 1 0.1
1000
100
10
1 0.1 1 bandwidth (Mbits/s)0.01 0.001 0.00010.1
bandwidth (Mbits/s)
latency (ms)
100000 10000 1000 100 10 distance (km)
10 1 0.1 0.01 0.001 0.0001 0.01
0.1
1
10
100
1000
10000 100000
distance (km)
Figure 3. 3D scatterplot of distance, bandwidth and RTT.
Figure 5. Bandwidth vs. Distance scatterplot. 100000
100000
10000 1000
10000 latency (ms)
100 latency (ms)
1000
100
10 1 0.1
10
0.01 0.001
1
0.1 0.0001
0.0001 0.01 0.001
0.01
0.1
1
10
100
1000
0.1
1
10 100 distance (km)
1000
10000 100000
bandwidth (Mbits/s)
Figure 6. RTT vs. Distance scatterplot with signal propagation delay line.
Figure 4. RTT vs. Bandwidth scatterplot.
TCP testing from the user-level so it essentially observed the same end-to-end performance as an application. At the time of its use, Gloperf daemons would make periodic bandwidth and latency measurements between hosts identified in the Globus Metacomputing Directory Service (MDS) [5]. (Since the host IP addresses were available in the MDS, no DNS lookups were involved.) Measurement results were also stored in the MDS. To produce the data set used here, a simple program would periodically snapshot the MDS Gloperf data into log files and scripts would then extract just the bandwidth and latency data and eliminate duplicates.
employ log-sized bins that make the mode of the distributions much more evident. While it was not uncommon to observe bandwidths as high as 96 Mbits/sec., the median bandwidth for this distribution is 2.2 Mbits/sec. and the 90th percentile is 15.1 Mbits/sec. While the RTT distribution has a much narrower (rhinokurtotic) mode, common latencies span three orders of magnitude. Here, the median is 57.35 msec. and 90% of the latencies are above 5.5 msec. These histograms provide insight into the overall performance potential of a global grid. What, however, are the constraints on this distribution of bandwidths and latencies? Bandwidths can be expected to increase as networking technology progresses. Latencies can also be expected to decrease with the availability of faster processors, streamlined protocols, etc. Latencies, however, are bounded by simple speed-of-light limitations on the propagation delay. The significance of these bounds depends on the context and dis-
Figures 1 and 2 show histograms of Gloperf bandwidth and Round Trip Time (RTT) measurements on the Globus testbed beginning in August and continuing through October, 1999. This represents 17629 unique measurements between 3158 unique host pairs over 138 unique hosts on four continents; mostly in North America but including some in Europe, Asia, and Australia. Note that these histograms 3
tances involved. For grids, we can certainly expect this to be on a global scale. To investigate this issue, we approximated the propagation delays between all hosts represented in Figures 1 and 2 by identifying the latitude and longitude (lat/lon) of each host IP address and computing the great circle or arc distance between all host pairs. Clearly the arc distance is just a first-order approximation since network routes can be very circuitous. Wider geographic distances may also involve more hops where each router introduces more delay. Nonetheless, this provides us with a lower bound on the propagation delay and allows us to discuss possible trends. Lat/lons were discovered by using several web-based services that claimed to provide accuracy to one second of arc. In some cases, the lat/lon had to be estimated from an atlas, but even in this case, three digits of accuracy were possible. All hosts at the same institution were assigned the same lat/lon. Hence, 138 unique hosts yielded 42 unique lat/lons. When a measurement’s host pair were at the same lat/lon, a distance of 100 meters was arbitrarily assigned. While this figure of 100 meters is indeed arbitrary, again, it nonetheless allows us to discuss trends in the data. With this distance estimation, each measurement in this data set can be considered as a triplet of hdistance, bandwidth, RT T i as shown in the 3D scatterplot of Figure 3. To help understand this data set, Figures 4, 5, and 6 present these data as scatterplots in each possible pair of dimensions. Figure 4 shows a general inverse relationship between bandwidth and latency; the higher the bandwidth, the lower the latency. At relatively low latencies, however, we see groupings of measurements just below 10 Mbit/sec. and 100 Mbit/sec. Presumably these groupings are caused by slow and fast ethernet connections between relatively close host pairs. At higher latencies and lower bandwidths, we see a quantization of measurements caused by the fixed time-length, netperf-style measurement technique. The highest density in this scatterplot is naturally in the center as implied by the histograms of Figures 1 and 2. Figure 5 plots bandwidth versus distance. This plot also exhibits a general inverse relationship. While long-haul traffic typically encounters significant competing flows and delays, thereby realizing lower end-to-end bandwidth, some long-haul routes, such as over APAN, can still realize higher than average bandwidth. In this plot, we see the intra-site measurements all at 100 meters which exhibit the highest bandwidths. The closest inter-site measurements at approximately 20 km indicate the lat/lon resolution typically possible for IP addresses. Figure 6 plots RTT versus distance. Also plotted on this graph is a line representing the round-trip (RT) signal propagation delay across the same range of distances based on an approximate electrical signal propagation speed of 2.3×108
100000
NIC bandwidth (Mbit/sec.)
10000
1000
100
10 "bw" "bw_log_extrap" "bw_lin_extrap" 1 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Calendar Year
Figure 7. Past and Extrapolated NIC Bandwidth vs. Calendar Year.
m/sec. (This is somewhat less than the speed of light in a vacuum but slightly more than the speed of light in a fiber). Naturally the closest hosts exhibit the lowest latencies. The most interesting feature of this graph is the difference between the signal propagation delay and the “floor” exhibited by the measured latencies across the entire range of distances. At 100 meters, this difference is about 2.5 orders of magnitude, i.e., “in the noise”. From 20 to 20,000 km, this difference varies from about one order of magnitude to less than one half. Given the under-estimation by the arc distances, true distances would distribute the scatterplot data even closer to the propagation delay line. Clearly there are fixed, one-time overheads represented in all these endto-end latencies and that over longer distances, these overheads become a smaller fraction of the total. The remarkable fact, however, is that so many host pairs are already so close to the signal propagation delay.
4. Bandwidth and Latency Trends Given the proximity of many host pairs to their signal propagation delay, what performance will be observed as grid infrastructures progress? How can we expect observed bandwidths and latencies to change over the next ten years? This will depend on many factors ranging from physical devices to competing traffic. As a first-order approximation, we can use the bandwidth of the network interface as a trend indicator since this will be a limiting factor in the end-toend performance that applications see. Figure 7 plots the bandwidth of available network interface cards versus the calendar year of their introduction and extrapolates from the year 2000 to 2010. (The actual calendar year data are from [21, 10].) A linear extrapolation from recent improvements puts the available bandwidth at 4
1000
2500 "bdp_histo" "extrap_bdp_histo"
100 1500 count
normalized latency
2000
1000 10 500
1 0.01
0.1
1
10
100
1000
10000
0 0.1
100000
1
distance (km)
10
100
1000
10000
100000
bandwidth-delay product (Kbits), 100 log-sized bins
Figure 8. RTT normalized to propagation delay.
Figure 10. Bandwidth-Delay Product histograms.
1000
1600 1400
1000 count
normalized latency
1200 100
800 600
10
400 200 1 0.01
0 0.1
1
10 100 distance (km)
1000
10000
100000
0
1
2
3 4 5 6 7 bandwidth-delay product ratio
8
9
10
Figure 9. Extrapolated RTT normalized to propagation delay.
Figure 11. Histogram of Bandwidth-Delay Product increases.
approximately 7.0 Gbit/sec. by 2010. An exponential extrapolation over the span of this data puts the available bandwidth at approximately 40 Gbit/sec. by 2010. In at least a historical context, this means that network bandwidth has been doubling roughly every 2.5 years. Hence, assuming that the deployed bandwidth follows the same trend as the highest available NIC bandwidth (and that competing traffic does not accelerate even faster), it is possible that the available global grid bandwidth will increase up to sixteen fold in the next ten years.
current photolithography techniques would be questionable since transistors would have to be composed of only a few hundred atoms. For the next ten years, it is conservative to assume at least a sixteen-fold increase in processor speeds. While memory bandwidth is not increasing as fast as cycles (or bandwidth), presumably architectural methods will be used to meet throughput demands and create balanced systems. Faster processors will allow end-hosts to run the local protocols faster but the upper limit of the network bandwidth leaving the host will be determined by the network interface. If the last three decades are an indicator, a sixteen-fold increase in bandwidth is not excessively optimistic. Faster processing speeds will also be available at the routers, but queuing delays will still be determined by the available bandwidth and competing traffic.
We note that this rate of increase for network bandwidth is roughly that of processor speeds. Moore’s Law has transistor density doubling every 1.5 years to 3 years with a concomitant increase in processing speed and power. Current estimates are that this will hold for another 10 to 15 years [12, 17]. At that time, the efficacy of silicon devices and 5
While historical data provides us with a bandwidth trend, what will happen to first-byte latencies? Figure 8 shows the RTT scatterplot of Figure 6 normalized to the propagation delay for the given distance. For reference, a line is drawn at double the propagation delay. Figure 9 shows the same data with the latency not attributed to propagation delay (the “overhead”) reduced sixteen-fold. This indicates that the communication between many host-pairs will asymptotically approach the speed-of-light limitation. (As a backhanded benefit, we note that as this limit is approached, the variance in the latency is decreased.) For all host-pairs below the “double” line, simple propagation delays will dominate over all other communication overheads. The reader is reminded that these extrapolations are conservative since the arc distances significantly under-estimates the true propagation distances. Whether the first-byte latency dominates the last-byte latency is still ultimately dependent on the bandwidth and transfer volume. The break-even point, however, will definitely rise. This also means that network pipes will be getting fatter but not commensurately shorter, i.e., the number of bits that the pipe can “hold”, i.e., the bandwidth-delay product, will be getting higher. This is shown in the histograms of Figure 10 for the measured and extrapolated data. The measured data has a median of 86.3 Kbits while the extrapolated data has a median of 399.3 Kbits, an increase of 463%. The difference between these two distributions is, of course, caused by the difference between the measured and extrapolated performance between each host pair. Figure 11 gives a histogram of the increases (ratios) in the bandwidthdelay products. The co-located hosts at 100 meters with a low propagation delay have very similar products and, hence, produce a spike close to 1.0. The main part of the distribution shows a general increase to a strong mode at 7.1. This is correlated with distance since larger increases in the bandwidth-delay product will occur at higher propagation delays. The overall distribution has a median of 5.2.
Given these considerations, three broad classes of computations can be defined: (1) Latency-Bound Computations. Tightly coupled computations will simply be at a severe disadvantage until science discovers a method of communication that is not limited by the speed of light. Examples of these computations include interactive applications and human-in-the-loop simulations. Distributed simulations that require a lower bound timestamp computation (e.g., [7]) will also be disadvantaged since this relies on a reduction over the timestamps of all hosts and all transient (in-flight) messages. Latency will limit the advancement of simulated time and, hence, of the entire simulation. (2) Latency-Insensitive Computations. In these computations, latency costs are simply an extremely small percentage of the total performance picture. They may simply be loosely coupled with infrequent synchronization requirements, such as Monte Carlo simulations, or may have large data transfers such that latency costs are “in the noise”. (3) Latency-Tolerant Computations. In these computations, many traditional techniques could be applied to cope with the relative latency increases. These techniques are also used in such areas as hardware processor design, albeit on a very different scale and granularity. They include pipelining, caching, parameter estimation, speculative prefetching, speculative execution, and hiding latency with throughput. Even at the less latency-tolerant end of this spectrum, some people will want to run applications that are more tightly coupled than the infrastructure can “efficiently” support simply because it is natural for their application and either they can’t or don’t want to restructure the application. Other applications may be forced to run on an increasingly loosely coupled grid because it is the only way to amass the resources necessary (processor, aggregate memory, etc.) to even attempt the problem. Even for loosely coupled applications, there will be motivation to improve performance by being able to tolerate the latencies involved in wide-scale grids. Given that there will be clear motivation to make applications more latency-tolerant, what is the structure and scale of the overheads in these techniques? When will their use produce a net benefit? For each of the techniques mentioned above, a model could be constructed and its performance analyzed. For the purposes of this paper, we will focus on hiding latency with throughput. It is easy to hide latency with throughput for “pleasantly parallel” applications such as Monte Carlo simulations and other high-throughput paradigms that have minimal data dependencies and synchronization requirements. For other applications, however, parallelism will have be extracted in sufficient quantities to support latency hiding. This means that not only do sufficient data dependencies and synchro-
5. Impact on Future Grid Applications What will be the impact of such an increase in the distribution of bandwidth-delay products in a global grid? Clearly communication “pipes” will be faster but not proportionately shorter. Processors will be faster but there will be a relative increase in the latency to get data. The previous section derives a conservative lower bound which means that the true distribution in ten years time will probably be higher. Regardless of the exact quantitative changes in grid communication performance, the clear trend is that grids will increasingly favor loosely coupled applications over more tightly coupled ones. Many say this is the case now, but it is only going to get worse. 6
1000
number of threads
100
Figure 12. Pipeline model.
10
1
10000 "10Mbit" "100Kbit" "1Kbit"
1000
"WGsize10e3" "WGsize10e4" "WGsize10e5" "WGsize10e6" "WGsize10e7"
0.1
0.01 10
latency (ms)
100
100
1000
10000
bandwidth-delay product (Kbits) 10
Figure 14. Number of threads vs. BandwidthDelay Product with operating region.
1
10000
0.1
1000 0.01
0.1
1 10 100 bandwidth (Mbits/s)
1000
10000 100 WGtime (ms)
0.01 0.001
Figure 13. Extrapolated RTT vs. Bandwidth with operating region.
10
1 "WGsize10e7" "WGsize10e6" "WGsize10e5" "WGsize10e4" "WGsize10e3"
0.1
nization have to identified, it must be with the correct granularity to match the scale of the latencies and bandwidthdelay products. To more clearly understand the constraints on effective latency hiding in a grid environment, we will use the pipeline model illustrated in Figure 12. This model assumes a communication pipeline of a given latency that must be filled with Work Granules (WG). For the rough, order-ofmagnitude analysis presented here, we will assume that all WGs have a constant size WGsize. Work granules propagate through the pipeline with bandwidth BW and are consumed by a processor where each work granule requires some amount of processing. In parallel and distributed computing, the communication to computation ratio is an important concept that captures the notion of how much processing must be done compared to communication. In this model, the WG communication time should equal the processing time; that is to say, the ratio should be unity. This implies that the processing rate PR should equal the bandwidth BW. If P R > BW , then data is processed faster than it arrives and the processor may become idle. If P R < BW then data is arriving faster than it can be processed and the network may become idle. In either case, more network capacity or processing capacity (or multiple processors) could be used to achieve more balanced rates. Again, for this rough, order-of-magnitude anal-
0.01 0.1
1
10 bandwidth (Mbits/s)
100
1000
Figure 15. WG time vs. Bandwidth with operating region.
ysis, we will assume that P R = BW . We now look at the extrapolated RTT latency and BW data shown in Figure 13 which was used to produce the bandwidth-delay products shown in Figure 10. Three diagonal lines indicate the bandwidth-delay products or “pipe capacities” of 1 Kbit, 100 Kbit, and 10 Mbit. On this scatterplot, we can define an “operating region” that is bounded by bandwidths of 0.25 to 500 Mbits/s, RTT latencies of 0.1 to 200 milliseconds, and bandwidth-delay products of 50 Kbit to 5 Mbit. Within this operating region for the bandwidth-delay product, a given WGsize will determine the number of threads of computation that would be require to fill the pipe. This is illustrated in Figure 14 for WGsize from 1 Kbit to 10 Mbit. The operating region here is bounded by 1 to 100 threads which is not an unreasonable assumption for the 7
number of concurrent threads that could be extracted from a computation. Within this operating region for bandwidth, WGsize, the size of a given work granule, will determine WGtime, the communication time. This is illustrated in Figure 15. We note that WGtime should be bounded below by the smallest reasonable WGtime (and work granule processing time) but is not really bounded from above. Note that anything above half of the RTT latency is actually filling the pipe. This analysis reinforces several important understandings. With increasing bandwidth-delay products, larger WGs will be needed to avoid the requirement of more threads to keep the pipe full. With increasing bandwidths, larger WGs will be needed to prevent associated work granule overheads from becoming a significant performance issue. The larger issue is, however, can a significant number of computations operate in this region successfully? Can programming models be devised that easily map on to such loosely coupled, data-driven execution models? Clearly the answers to these questions depend on several issues that were elided here:
network performance, we extrapolated to estimate the communication performance of a global grid in ten years time. Due to simple propagation delays, communication pipes are clearly getting fatter but not commensurately shorter. Hence, by conservative estimates, bandwidth-delay products in a global environment will rise by a median factor of 5.2. Using these projections, we used a simple pipeline model to investigate the constraints on hiding latency with throughput. Under reasonable conditions, there is clearly a large operating region in which computations could hide latency with throughput. The estimates, assumptions, and results presented here for the projected network performance are certainly subject to much argument and interpretation. There is a large difference between the performance of the best possible network and the performance of the typically deployed network infrastructure. Gloperf measured currently deployed hosts and networks and the available capacity at the times the measurements were made. The shape of these measurements will change not only due to the fastest available technology that is deployed but also due to computing practices, e.g., the competing network load and the use of ubiquitous, low-performance embedded and personal devices. A more accurate latency model and associated interpretation than the one used here is certainly possible, e.g., [22]. Work reported by Martin, et al. [13], indicated that applications may be more sensitive to overhead than to bandwidth or latency but these applications were run on a co-located cluster. In widely distributed environments, the use of arc distances as a lower bound for true propagation distances is probably the most significant source of error. More accurate distance information and hop counts would require traceroutes between each host pair coupled with GPS-enabled routers and NICs that could accurately report their location down to meters. Tools such as GTrace [2] and NetGeo [3] certainly go a long way in this direction. Gloperf had the advantage, however, of being deployed on all hosts involved and collecting data between many host pairs. Future grid systems might very well deploy a combined type of capability. Methods for tolerating the heterogeneous latencies and bandwidths in a grid are clearly called for. Techniques such as pipelining, caching, parameter estimation, speculative prefetching, speculative execution and hiding latency with throughput are all candidates. The outstanding issue is how can these techniques be effectively applied? At least for latency hiding, an operating region exists that is bounded by network bandwidth, latency, work granularity, synchronization requirements, and context switching/scheduling overhead. We note that hardware support such as in the Tera MT-1 could be used to reduce context switching and scheduling overhead. The full/empty bit on a single word of memory could be used to signal the arrival of a work granule. With regards to network utilization, com-
• Communication to Computation Ratio. More network or processing capacity typically cannot be added so how balanced can the typical computation be made? • WGsize. Typical computations with have a set of different WG sizes or a statistical distribution of sizes rather than a constant size. What would the mean or median WGsize be? • Synchronization/Context Switching/Scheduling Overheads. Clearly WGsize and WGtime must not be too small to prevent these overheads from becoming excessive. What is too excessive? These questions will not have simple answers. Nonetheless, the need to run large computations on a grid, or simply to run a more tightly coupled application on a more loosely coupled grid, will motivate further investigation into such execution models and also appropriate programming models that have a straight-forward mapping onto them.
6. Discussion We have presented a data set of global grid network performance collected by the Gloperf tool. To establish a lower bound on the propagation delays involved, we estimated the propagation distances by the arc distances between hosts based on latitude and longitude. Using a model of the primary factors driving this observed performance and assumptions about expected trends in processor and 8
munication schedules could also be used to manage network demand as in HPF environments [1]. More importantly, however, how can enough parallelism with the right granularity be identified in a computation? This identification could be done by hand but can be laborious and not practical for large, existing codes. Automatic parallelization has only had moderate success and for only small-scale parallelism. This would indicate that automatic parallelization would not produce the granularity needed to match the grid infrastructure. This argues that identification by design could ultimately be the best avenue. That is to say, an asynchronous programming model or coordination paradigm could facilitate the extraction of enough work with the right granularity. A relevant example in this area is the POOMA C++ class library with the SMARTS runtime system [20]. The POOMA Array and Interval data types denote possible dataparallel operations. By overloading the assignment operator and passing in expression templates, the SMARTS runtime can manage Iterates in a clustered SMP environment that are scheduled when all necessary data objects are available, thereby increasing latency tolerance. The important issue here is whether such approaches can be generalized to a heterogeneous grid environment with the operating regions identified in the previous section. This must be done judiciously since any increasing “distance” from current computing practices would tend to increase the barrier to acceptance. Nonetheless, the motivation for large grid computations will increase the motivation for improved programming tools. In ten years time, computing will look very different as will the landscape before us.
[9] R. Jones. Netperf. http://www.netperf.org/netperf/NetperfPage.html, 1999. [10] J. Kadambi, I. Crayford, and M. Kalkunte. Gigabit Ethernet: Migrating to High-Bandwidth LANs. Prentice Hall, 1998. [11] C. Lee, R. Wolski, J. Stepanek, C. Kesselman, and I. Foster. A network performance tool for grid environments. Supercomputing ‘99, 1999. [12] J. Markoff. Chip progress may soon be hitting barrier. The New York Times, October 9, 1999. [13] R. Martin, A. Vahdat, D. Culler, and T. Anderson. Effects of communication latency, overhead, and bandwidth in a cluster architecture. 24th ISCA, June 1997. [14] National Laboratory for Applied Network Research. Web site. http://www.nlanr.net, 2000. [15] Next Generation Internet Initiative. Web site. http://www.ngi.gov, 2000. [16] NLANR. Active Measurement Program. http://watt.nlanr.net/AMP, 2000. [17] P. A. Packan. Pushing the limits. Science, Vol. 285, September 24, 1999. [18] L. Peterson and B. Davie. Computer Networks: A Systems Approach. Morgan Kaufman, 2000. 2nd Edition. [19] The Globus Team. The Globus Metacomputing Project. http://www.globus.org, 1998. [20] S. Vajracharya, S. Karmesin, P. Beckman, et al. SMARTS: Exploiting temporal locality and parallelism through vertical execution. International Conference on Supercomputing, 1999. [21] J. Walrand and P. Varaiya. High-Performance Communication Networks. Morgan Kaufman, 1996. [22] R. Wang, A. Krishnamurthy, R. Martin, T. Anderson, and D. Culler. Modeling communication pipeline latency. SIGMETRICS, 1998. [23] R. Wolski, N. Spring, and H. Hayes. The Network Weather Service: A distributed resource performance forecasting service for metacomputing. Future Generation Computing Systems, 1998. Available from http://www.cs.ucsd.edu/users/rich/papers/nws-arch.ps.
References [1] S. Benkner, P. Mehrotra, J. V. Rosendale, and H. Zima. High-level management of communication schedules in HPF-like languages. International Conference on Supercomputing, 1998. [2] CAIDA. Gtrace - a graphical traceroute. http://www.caida.org/tools/visualization/gtrace, 2000. [3] CAIDA. NetGeo - the internet geographic database. http://www.caida.org/tools/utilities/netgeo/, 2000. [4] Cooperative Association for Internet Data Analysis. Web site. http://www.caida.org, 2000. [5] S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smith, and S. Tuecke. A directory service for configuring high-performance distributed computations. In Proceedings 6th IEEE Symp. on High Performance Distributed Computing, pages 365–375, 1997. [6] I. Foster and C. Kesselman. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1998. [7] R. Fujimoto. Parallel and Distributed Simulations Systems. Wiley, 2000. [8] Grid Forum. Web site. http://www.gridforum.org, 2000.
Craig Lee is currently the manager of the High Performance Computing Section at The Aerospace Corporation. He received his Ph.D. in Computer Science from the University of California at Irvine in 1988 and has worked in the area of parallel and distributed computing for sixteen years building applications with a strong focus on experimental languages, tools and compute environments. Dr. Lee is co-chair of the Advanced Programming Models Working Group of the Global Grid Forum. James Stepanek is currently a Member of the Technical Staff in the Computer Systems Research Department of The Aerospace Corporation. He received a B.S. in Computer Science from Harvey Mudd College in 1994 and is now working towards a Ph.D. in Computer Science at the University of California at Los Angeles. James has a background in network management and high-speed networks and, in addition to network measurement, currently pursues research interests in wireless and satellite networks.
9