Modeling Load Imbalance and Fuzzy Barriers for ... - Semantic Scholar

2 downloads 0 Views 1MB Size Report
evaluates the expected hit ratio and variance introduced ... 93-l-0163. .... Hit. -. A: A x. B) Markov. Chain. C) Cyclic. Representation lxlxl. Ixl. Figure 1: State ..... Eng., vol. SE-11, pp. 1001-1016, October 1985. S. MadaIa and J. B. Sinclair, ...
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

Modeling Load Imbalance Scalable Shared-Memory

and Fuzzy Barriers Multiprocessors

Alexandre E. Eichenberger Advanced Computer Architecture Laboratory EECS Department, University of Michigan Ann Arbor, MI 48109-2122

Abstract

Introduction

Load imbalance and synchronization costs can have a significant influence on the execution of parallel programs. In most parallel scientific and engineering applications, processors repetitively update large datastructures in parallel and synchronize themselves with synchronization barriers. Therefore, the overall execution time of these parallel applications is significantly affected by the load imbalance present between synchronization points. A common source of load imbalance is systemic load imbalance, where the workload is unevenly partitioned among the processors as a result of which certain processors consistently arrive late at a synchronization point. Systemic imbalance arises when some processors are assigned a larger share of the overall computation and/or communication requirements. Systemic imbalance can be handled effectively by static parti-*This research was supported 93-l-0163.

Santosh G. Abraham Hewlett Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94304

in part by ONR grant N00014-

262 1060-3425/95

*

tioning of the workload so that each processor is assigned an equitable workload. Another important source of load imbalance is nondeterministic imbalance, where processors fail to reach a synchronization point simultaneously but typically the processor arriving last changes on each iteration. Non-deterministic load imbalance is generated by several factors: the workload associated with a processor may change from cycle to cycle; the interprocessor communication may incur random delays due to contention; there may be contention for hardware or software resources. Non-deterministic load imbalance can be reduced by dynamic scheduling policies which redistribute the workload to idle processors. On uniform-memory access machines such as Cray and Convex multiprocessors, non-det,erministic load imbalance can be reduced by dynamic scheduling policies which redistribute the workload to idle processors. This approach is less effective on scalable sharedmemory parallel machines which have non-uniform memory access times. In such scalable systems, redistribution of the workload is associated with high data movement costs. As a result, these systems are more sensitive to non-deterministic load imbalance. While the systemic load imbalance is mostly determined by the load imbalance of an application, the non-deterministic load imbalance is also dependent on architectural characteristics of a parallel machines, i.e. the frequency and variation in random delays due to memory or communication contention. It is therefore important to develop an understanding of the impact of non-deterministic load imbalance while designing a scalable shared-memory machine. In this paper, we develop an analytical model of the execution time of a parallel region in the presence of non-deterministic load imbalance. The execution time is modeled as three parts: computation time, memory reference time, and communication time. The

We propose an analytical model that quantifies the overall execution time of a parallel region in the presence of non-deterministic load imbalance introduced by network contention and by random replacement policy in processor caches. We present a novel model that evaluates the expected hit ratio and variance introduced by a cache accessed with a cyclic access stream. We also model the performance improvement of fuzzy barriers, where the synchronization between processors at the end of a parallel region is relaxed. Experiments on a &-processor KSR system which has random firstlevel caches confirms the general nature of the analytic results.

1

for

$4.00 0 1995 IEEE

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

portional to the variance of the thread execution time and inversely proportional to the amount of slack. This paper is organized as follows. First, we present the performance model of caches with random replacement policy in Section 3 and the communication performance model in Section 4. We develop the overall execution time of a parallel region in the presence of non-deterministic load imbalance in Section 5. We determine the execution time improvements due to fuzzy barriers in Section 6. We present measurements that confirm the general nature of the analytic results using experimental runs on the KSR multiprocessor system which has random first-level caches in Section 7. We conclude in Section 8.

processes do not execute in isolation however: they are assumed to synchronize at the boundaries of the parallel region. Therefore, the overall execution time of a parallel region also t#es into account the idle time at the synchronization barriers, where each process idles until all processes reach the barrier. The first contribution of this paper is the performance analysis of caches with random replacement policy. Whereas random replacement have been previously investigated from the uniprocessor perspective [l], we analyze these caches in a multiprocessor system. We demonstrate that the load imbalance is exacerbated by caches with random replacement policy because the overall execution time may be governed by the processor making the worst replacement decisions. We analyze cache performance on cyclic access streams because these access stream are common in data-parallel programs that access large data structures. We characterize these access streams as a function of the set size and access stream length. We present the first algorithm that constructs the Markov chains describing the behavior of a cache with random replacement for any cyclic access streams and for any degree of set-associativity. Although the chains grow exponentially in the access stream length and degree of set-associativity, we determine analytic expressions for both the mean and the variance of the cache hit ratio for two-way set-associative caches. This work extends the results in [l] by introducing a new closed form solution for the two-way set-associative cache wit,h random replacement and by investigating the variance of its performance. The second contribution is the performance analysis of fuzzy barriers, a technique that efficiently tolerates non-deterministic load imbalance without requiring expensive load redistribution. Fuzzy barriers proposed by Gupta [2], achieve this goal by executing independent operations, called slack, while waiting for the synchronization. Since there are programming as well as other costs associated with increasing the number of independent computations, it, is important to be able to quantify the reduction in idle time due to the introduction of slack in fuzzy barriers. To our knowledge no analytic model for the performance improvement due to fuzzy barriers has been presented. In a two-processor system, we derive analytically the reduction in idle time due to the buffering provided by independent operations but the general case is not tractable. We carried out a large number of simulations and validated an empirical function that is similar in form to the analytic function. According to this function, the idle time is approximately pro-

2

Related

Work

Smith and Goodman [l] have compared the behavior of several cache replacement policies. They provide a closed form solution for the expected hit ratio of caches with random replacement accessed by cyclic access streams exceeding the set associativity by one cache line. Recently, Schlansker et al [3] have investigated the design of placement-insensitive caches and studied the random replacement policy. Both studies recommend this replacement policy, but focus on the average random replacement performance and not on its variance. The source of execution time performance variance has been studied. Adve and Vernon [4] quantified the fluctuation of parallel execution times due to random delays and non-deterministic processing requirements. . Dubois and Briggs [5] Investigated the performance fluctuation generated by memory contentions and obtained an analytical formula describing the expected number of cycles and its variance for memory references in tightly coupled systems. Sarkar [6] provided a framework to estimate the execution time and its variance based on the program’s internal structure and control dependence graph. The effects of load imbalance are investigated in several articles. Kruskal and Weiss [7] investigated the total execution time required to complete k tasks for various 8distributions. Madala and Sinclair [S] quantified the performance of parallel algorithms with regular structures and varying task execution times. Durand e$ al studied the impact of memory contention on NUMA parallel machines and provided experimental measurements [9]. Simulation results show that the idle time generated by partial synchronization barriers [lo] [ll]

263

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - I995

is significantly lower than that generated by normal synchronization barriers. Greenbaum [lo] showed that near-optimal idle time is achievable through Boundary Synchronization, a relaxed synchronization scheme where boundary computations are executed first within each thread. Gupta [2] investigated Fuzzy Barriers, where a region of statements are executed while waiting for the synchronization. Techniques that detect and increase the number of independent operations are presented in [2] [12]. These papers bounded the performance of their weaksynchronization schemes by the performance of strongand n& synchronization respectively for the upper and the lower bound, but do not propose tighter bounds.

Assuming an initially empty cache, we can model the behavior of one cache set, determine its content and its transitions (hits or misses) for each of the lines accessed in a cyclic access syeam. Figure 1 A illustrates the behavior of a two-way set-associative cache accessed twice by the lines ‘ABC’. After the three first compulsory misses, the cache contains either the line tuples ‘AC’, ‘BC’, or ‘CX’. When the line ‘A’ is accessed for the second time, the line tuples ‘BC’ and ‘CX’ result in replacement misses. However, the line tuple ‘AC’ results in a hit. The two remaining accesses proceed similarly. A) State

Transitlons Inmud: A:

3

Performance Replacement

of Cache with

A

x I

x

Miss Hit

-

Random

Random replacement policy [16] has been implemented in several caches. The advantages of this technique are its simplicity and its low cost. Another advantage of randorn replacement, is that it does not exhibit the same performance discontinuities as LRU for workloads differing slightly in size, as observed by Smith and Goodman [l]. Schlansker e2 al [3] investigated large secondary caches with random replacement and concluded that they perform better than LRU, especially with high fill ratio. Cyclic access streams have been used in previous studies of caches [1][3] and are comomon in large dataparallel applications [3], where operations are repetitively executed on large data structures. Cyclic access strearns can be viewed as a worst case of temporal sequences and illustrates, as such, an important aspect of cache performance breakdown. Thus cyclic access strearns are not intended to be representative of average cache performance, but represent an important component of cache behavior.

3.1

x

Cache Model

We propose a model that quantifies the expected hit ratio and its variance for a cyclic access stream in an n-way associative cache with random replacement, described by the following three parameters: the set associativity of the cache the length of the cyclic access stream

B) Markov

C) Cyclic

Chain

Representation lxlxl

Ixl

Figure 1: State transitions for a cyclic access stream of length three (rz = 2, 1 = 3, and T = 2) The state transitions present a regular pattern, illustrated by the Markov chain of Figure 1 B. State 1 represents the transient effect, where one of the initial lines (‘Xl) remains in the set. State 2 corresponds to the state always containing the two most recently accessed lines. Finally, State 3 corresponds to the state containing the current line and the least recently accessed line. For this access stream, all deterministic replacement policies correspond to a cycle in the Markov chain of Figure 1 B. For example, the LRU and OPT replacement policies correspond to the cycle (2) and {2,3} respectively. We introduce now an algorithm that builds the Markov chain associated with the steady state behavior (r --+ m) of a cache with random replacement policy accessed by a cyclic access stream. The idea behind this algorithm is to focus on when these lines were last accessed: in a cyclic access stream of length 1, the ages

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawuii International Conference on System Sciences - I995

A) Stats Transitions

spans from 0 to 1 - 1. We represent the cache state by I adjacent boxes with increasing age from left to right. Furthermore, a cross in the ith box means that the cache currently holds the line of age i. Figure 1 C illustrates this representation for the steady-states State 2 and 3. With this representation, we can formulate the cache invariant as follows: 11 there are exactly associative cache,

State 1

From :

R2 xx m?Bbl

n crosses in a n-way set-

x

To : State1

x

State2

State3

State 1

State1

I?1: RotateRiaht

12 there is a cross in the leftmost box, as the most recently accessed line must be in the cache.

B) Markov Chain A

All states in the steady-state Markov chains must respect these two invariants. Therefore, there are exactly (Lll,) d’1ff erent states. Furthermore, the transition rules between states, corresponding to a new line access, can be formulated as follows: Rl

State 2

xx T Rl xx

Miss -

u

Hit

-

Figure 2: Steady state transitions for a cyclic access stream of length four (n = 2, 1 = 4, and r - oo)

the rightmost box is moved to the leftmost position (rotates to the right), as the oldest line in the stream becomes the most recent one,

R2 after rotating t’he boxes, the invariant 12 may be violated. If so, the invariant is restored by moving one of the crosses to the leftmost box. This corresponds to a cache miss and the choice of the cross is guided by the cache replacement policy. With the random replacement policy, we must investigate all possible replacements.

3.2

Hit

Ratio

and

Variance

The first step in determining the cache performance is to obtain the probability distribution of hits. Once its distribution has been determined, the expected hit ratio and its variance are easily computed.

We construct the steady state Markov chain with the following algorithm: we compute all states that respect the invariants 11 and 12, and we compute the transition from each state by following the transition rules Rl and R2. Furthermore, all transitions originating from a single state are equally likely. Figure 2 A illustrates this algorithm for an access stream of length 4 in a two-way set-associative cache. The transient effects of the cache correspond to states where some initial lines are not part of the cyclic access stream. We introduce a new transition rule, corresponding to the replacement of one of these unrelated lines:

3.2.1

Steady-State

The Markov chains of the steady-state are ergodic’, thus allowing us to compute the state distribution n by solving the following equations: T=ITP

and c

7rj = 1

(1)

where P is a Markov chain transition probability matrix. The hit ratio and variance associated with a Markov chain is obtained as follows:

lqit ratio]=

1 iehit

R3 After rotating the boxes, if the leftmost box is empty and the number of lines that are part of the cyclic access stream is smaller than n, add a cross in the leftmost box.

va+lit

ratio]

= E[(M

lr; * 1 states

ratio)2]

- E2[hd

ratio]

(2)

We solve equations (1) and (2) for a two-way setassociative cache accessed by a cyclic access stream of length 1 2 3. The corresponding Markov chain is illustrated in Figure 3. Since a hit occurs only in State L-l, the expected hit ratio is equivalent to the

We construct the transient state Markov chain with the following algorithm: we build the steady-state Markov chain for the set-associativity one through n; then we connect the Markov chains using the transition rule R3. Furthermore, all transitions originating from a single state are equally likely.

‘A Markov chain is ergodic if it is irreducible, positive recurrent for all states

265

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

aperiodic,

and

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

overall stream length divided by the product of words per line and number of sets in the cache. Thus, even when the overall stream length is large, the access stream length for each set can be quite small. Given 1, we compute the expected hit ratio and its variance, using Equations (5) for two-way set-associat,ive caches with random replacement. Using the properties of the Central Limit theorem,2 we assume that for a sufficient number of memory references, the cache performance is normal distributed. A similar assumption was used in [4] [lo] [5] , and our measurements on the KSRl corroborate this assumption. Assuming that the behavior of each set is independent and identically distributed, we obtain the following expected time and variance for the overall execution time spent in the memory system:

Figure 3: Markov chain of the steady-state transitions for an cyclic access stream of length L (n = 2, I = L, T-+,rx)) probability of State L-l: XL-~. The probability distribution is defined as follows 1 7rj=-7rj-1 2

state

for i=2;~~,1-1

(3)

Solving (1) with (3), we obtain the following state distribution:

E[cache]

= M,,(P,

Var[cache]

It results in the following steady-state hit ratio and variance: 1

p-1 - 2 (21-l - 1)2

Equation (5) is valid for any of length 1 > 3 for a two-way with random replacement. For expected hit ratio is one and its

Performance

3.3

d)

d P th,

&n,

tA

d) ti

&

(5)

4

cyclic access stream set-associative cache shorter streams, the variance zero.

Communication

Performance

In this section, we attempt to quantify the influence of non-determinism introduced by communication delays. Our approach is based on analyzing a parallel region to extract its communication behavior and predicting its performance distribution on a given parallel machine. This method allows us to get a clearer understanding of whether the random delays generated by a given data size, d, number of processors, p, and parallel machine affect the overall performance of that parallel region significantly. The communication behavior of a parallel region is characterized by the number of communication events Events generated by each processor within its parallel region. The primary factors determining &,enls, are the communication pattern and the data partitioning of the parallel region. For example, an algorithm using near-neighbor communication generates communication proportional to fi and m when partitioned along one and two dimensions respectively, where d is the dimension size. Other architectural and hardware factors such as network topology and automak update [15] also have a secondary effect on Events.

Distribution

We can now estimate the performance of a cache with random replacement for a single processor within a given parallel region, assuming that cache performance is dominated by cyclic memory reference streams. First, we determine the number of cyclic memory references M,, generated by one processor within this parallel region. We model this number as a function of the number of processors p and the data size d: M,,(P,

= M,,(p,

- tA PM)

where the variance of the cache performance is proportional to the square of the difference between hit and miss penalty and proportional to the number of memory references.

E[hit rfltio] = &

Var[hitratio]=

d) (L

references number of memory within a parallel region, size of the data within a parallel region number of processors within a parallel region cache hit penalty, miss penalty, and their difference (i?A= t, - th)

‘The Central Limit Theorem states that the sum of a s&iciently large number of random variables approaches a normal distribution, no matter what the form of their density function is.

Then, we determine the number of competing lines for each set, the access stream length 1, which is the 266

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

4.1

Performance

Distribution

The first factor of Equation (9), uP, is highly machine and parallel region dependent. The second factor in Equation (9) however, is machine independent, proportional to JE, and quantifies the intuition that the more processors in the system, the more likely one processor is expected to be significantly slower than the average. This factor is consistent with the estimated total time derived by Kruskal and Weiss [7] 1 Equation (9) specifically corresponds to the time spent idling in the synchronization barrier due to random replacement and network contention. Finally, the overall performance of a parallel region corresponds to the expected execution time of a single processor and the expected idle time, defined as follows:

Assuming that the delays are independent identically normally distributed random variables with parameters pL,,,,t and a&ent: Xl > . . 1 xLn,, number of communication events for one processor within a parallel region expected duration of communication variance of communication

events

the sum of these normally random variables, Y, is a normal random variable with parameters E[comm] defined as follows and Var[comm]

E[par

Var[comm]

= &en&,

4 &ent

(7)

Performance

of a Parallel

Region

Given the cache normal random processes with parameters defined in equation (6) and the communication normal random processes with parameters defined in equation (7), the cumulative execution time of processor i is itself a normal random process Xi with parameters pp = E[cache] 2 uP

6

+ Var[comm]

(8)

where E[comp] is the expected computation time. The results of this equation correspond to the expected execution time of a single processor and its variance. The overall performance of this parallel region for p processors correspond to the expected maximum max( X1, . . , X,). Based on the asymptotic extremal distribution [17], the extreme value can be asymptotically approximated for independent identically distributed normal distribution with parameters pp and up as follows [ 171:

E[idle

time]

z up

d2G

-

log log p + log 4s

w5-G

= plp + E[idle

time]

(10)

Performance of a Parallel with Fuzzy Barriers

Region

In a fuzzy barrier [2], processors execute independent operations between arriving at a barrier and leaving a barrier. We refer to the execution time of these independent operations as the slack of a synchronization barrier. While this technique does not address the problem of systemic workload imbalance, it can significantly reduce the effect of random delays. We can gain an intuitive understanding of the benefit of fuzzy barriers by studying a two-processor system with discrete Bernoulli delays. A similar twoprocessor model was mentioned by Kung [18] in the context of Semi-Synchronized Iterative Algorithm but left unsolved. Consider the execution time of a parallel region, delayed by one discrete delay with probability l/2 and bounded by a slack equivalent to r discrete delays. By iterating over this parallel region, we see that one of the processors can be up to 7 delays faster than the other one without waiting at the barrier. Because the relative difference between the two processors is bounded, and because the delays are discrete, the number of states is finite and equal to 27+ 1. The Markov chain associated with this model is shown in Figure 4, where the state label is the amount by

+ E[comm] + E[comp]

= Var[cache]

region]

Equation (9) and (10) approximate the influence of random delays without systemic workload imbalance. In the presence of systemic workload imbalance, these equations allow us to quantify which percentage of the measured idle time is generated by random delays and which percentage is due to the systemic workload imbalance.

Understanding how Events evolves for increasing number of processors and larger data sets is also crucial in estimating the performance of a parallel region.

5

1995

> (9) 267

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences -

1995

&-s----*----g-Q * Figure 4: Markov chain for the two-processor Bernoulli-delay model of fuzzy barriers which one processor is ahead of the other. The steadystate probability distribution of the Markov chain is uniform. Its expected idle time corresponds to the probability of t,raversing the starred edges in Figure 4 and is equal to: 1 E[idle

time]

= 1

4

Figure 5: Idle time for a fuzzy barrier with 7 = 0, 1,2,4,8,16 for a parallel region with standard deviation ap = 0.5.

-

r+1/2

We see that the expected idle time for a two-processor Bernoulli-delay model is inversely proportional to the slack 7, the execution time of the independent operations. We now consider extending this result for fuzzy barriers with p processors and normal distributed delays with parameters p,,, and bp. Extending Equation (11) for the normal case is impractical, because the idle time is a function of all previous delays in all processors. Instead, we present an empirical mathematical approximation. By running a large number of simulations for a normal distribution we find that the expected idle time is linear in the logarithm space. Curve fitting this slope with the result of the Bernoulli-delay model of Equation (11) in the logarithm space, we obtained the following approximation:

E[idletime]

cz .6238

~ 6 logp, r+ l/2

percent. Otherwise, the idle time behaves more like Equation (9). To the best of our knowledge, this result is the first that quantifies the performance of a weak-synchronization scheme Previous articles [lo] [ 1 l] bounded the performance of week-synchronization using the performance of strong and no- synchronization for the upper and the lower bounds respectively.

7

Measurements

We performed several experiments to validate our models. We ran two synthetic programs to measure the idle time generated by random replacement policy and network contention in a 64-proc&sor KSRll Furthermore, we investigated the benefit of fuzzy barriers on a real application with complex data streams: a sparse solver.

T 2 2u, (12)

7.1

where ap and r are respectively the standard deviation of a single processor execution time and the slack of the fuzzy barrier. The average error between the simulation and the approximation for up = .5, 1,2,4 and r = 1,2,4,8,16,32,64 is less than 12 percent. Figure 5 compares the simulation results and the approximation as computed with Equation (12) for p = 10 and d = l/2,. The average error in expected idle time between the simulation and the approximation is less than seven percent. Equation (12) relates the amount of idle time due to random delays with standard deviation up when using a fuzzy barrier, where the slack is r. This equation is accurate when r 2 2up, namely when the probability of having delays shorter than T is at least 68

Cache Generated

Idle Time

We measured the steady-state behavior of the twoway cache with random replacement of the KSRl parallel machine. Each KSRl cell contains a data and an instruction primary-cache, each 256 KB (random replacement, P-way associative, 64 sets) as well as a 32 MB secondary-cache (LRU replacement, 16-way associative, 128 sets). One run consists of accessing the primary cache r times with a cyclic access stream of length 1. Our results are averaged over a thousand runs. We measured the expected hit ratio and its variance, and compared these measurements with the analytical results of Equation (5). Figure 6 illustrates 268

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

.

the hit ratio for one up to ten competing lines per set. For three competing lines (cyclic access), the hit ratio is around 34 percent, larger than LRU (0 percent) and smaller than OPT (50 percent). Figure 7 illustrates the measured and analytical variance of the hit ratio. The measured variance is larger than the one computed by Equation (5). We believe that this difference is partially due to the fact that we could not turn off the hardware interrupts. These inaccuracies are more significant in Figure 7 than in Figure 6 because the variance is a secondary effect. Figure 8 illustrates the idle time generated by a cyclic access stream of length three and four. These measurements are specific to the KSR13. They were obtained by measuring a large number of runs on a single node and by using these runs to simulate a system with large numbers of processors. The measurements are compared with the results of Equation (9), with the total number of memory reference M,, set to 1920 and 2560 for three and four competing lines respectively. The standard deviation used is the one shown in Figure 7. The hit ratio of an individual processor for three competing lines mapped in a single set is 34%. However, the results presented in the graph demonstrates that the effective hit ratio is reduced to 30% for 64 processors because all processors wait for the processor that made the worst replacement decisions.

g '3 $.. &

.

t Two-Wav Set-associativeCache

LO

t +

-

Model

,.“” . ..__...............................__............... ...-t..M.?~?~~~!

1

&mpkngSCach~ Lid! (Cyhic) ‘)

__.._

10

Figure 7: Variance in a two-set associative cache (r = 1920)

Ftiir

!$t-associative Cache

+ Measurements

Processors Figure 8: Idle time generated by two-way associative cache with three/four competing lines (1920/2560 accesses)

‘..‘+.‘~“~~‘ -‘~-~--~~-~--~~-‘ ~~~~-’ --~-~~~‘~~~~~‘~-~~~ Model~~‘

*

J

1

I

m

CLmpeiing&he 6Line(Cyclic) Figure 6: Hit ratio in a two-set associative cache (r = 1920)

3The measurementof the cache-generatedidletime is dependent on the time difference between the hit and miss penalty of the cache. On the KSRI , this difference is 47 cycles (ta = 2.35 9s).

Figure 9: Idle time generated by communication lays for SOR for four different data sizes

269

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

de-

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

demonstrate the presence of random delays and to validate our models. In this section, we investigate the behavior of a complete parallel application, to demonstrate the usefulness of our mode1 for random delays and fuzzy barriers. We analyze the behavior of FEM-ATS [13], an application which solves a system of complex linear equations iteratively using a diagonal-preconditioned symmetric gradient method. This application presents a highly irregular memory and communication pattern due to a sparse matrix multiplication and is therefore a good example of a complex application. In this paper, we used an input data size of 20033 complex elements for the FEM-ATS application. An iteration consists of eight vector operations and three synchronization barriers. The data are partitioned along one dimension among processors and can result in systemic workload imbalance, as some of the elements may require more computation and/or communication depending on the non-zero pattern of the sparse matrix. Figure 10 illustrates the percentage of time spent in synchronization overhead, which include the synchronization cost, the idle time due to random delays, and the idle time due to systemic workload imbalance. The upper curve corresponds to the behavior of the original FEM code, the second curve corresponds to the hehavior of the improved FEM code, and the three remaining curves plot the additional performance improvements that could be achieved w&h a slack of 250, 1000, and 5000 ~1sper relaxed synchronization barriers. In the improved FEM code, the slack accounts for approximately 5 percent of the total execution time and results in an overall performance improvement of 4.8 percent, yielding an overall speedup of 37 for 56 processors.

Processors Figure 10: Overall percentage of idle time in FEMATS 7.2

Communication

Generated

Idle

Time

We implemented a relaxation algorithm (SOR) where each element is averaged with its four neighbors to measure the idle time generated by communication. The relaxation is performed in parallel with two alternating arrays] thus avoiding additional communication due to race conditions among processes. The two-dimensional data of size (d,, dy) is partitioned along the x-dimension, and the number of data per processor is kept constant, to avoid interferences with the memory subsystem. The resulting communication per processor is 4d, elements because each processor acquires one element in write mode (the element t#oupdate) and one element in read mode (east or west neighbor element) for each of the 2d, elements of the overlap region. On the KSRl, the unit of communication is a cache line of 16 words, and therefore the total number of communication events per processor (Events) is equal to 4[d,/161. On the KSRl, we estimat,ed the standard deviation of a single communication event (u,,,,,~) by running our program for 56 processors and obtained 17~s [15]. Figure 9 illustrates the idle time generated by communication delays in the SOR program on the KSRl. We notice that the idle time indeed varies as described in Equation (9). F or example, multiplying the data size (d,)-. by four doubles the idle time spent at a synchronization barrier. 7.3

An

Example:

a Sparse

8

Conclusion

Load imbalance and synchronization costs can have a significant influence on the execution of parallel programs. While systemic load imbalance can be handled effectively by static partitioning, reducing the effect of non-deterministic load imbalance is more problematic on scalable shared-memory parallel machines, because the dynamic redistribution of the workload results in high data movement costs on machines with non-uniform memory access times. In this paper, we develop an analytical model of the execution time of a parallel region in the presence of non-deterministic load imbalance. The overall

Solver

The two previous synthetic programs presented a highly regular memory access and communication behavior respectively. These prograrns were useful to 270

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on Qstem Sciences -

execution time of a parallel region is modeled using four components: computation time, memory reference time, communication time, and synchronization time. We identify caches with random replacement policy as a source of non-deterministic load imbalance in multiprocessor systems. We present a novel way for representing the steady-state and transient behavior of an n-way set-associative cache with random replacement under a cyclic access stream of length 1. For a two-way set-associative cache, we derive expressions for the mean and variance of the hit ratio distribution. We demonstrate using measurements on the KSR that# the idle time generated by caches with random replacement can be significant. Fuzzy barriers insert a slack, T, consisting of independent operations, while waiting for the synchronization. Though it is an important concept that has been suggested and used by several researchers, to our knowledge the effect of 7 on overall idle times has not been previously evaluated. We present an empirical mathematical approximation for the idle time on a p-processor system, directly proportional to the variance of the execution time of a single processor and inversely proportional to T. This model enables a qualitative understanding of the benefit of introducing slack.

[71 C. P. KruskaI and A. Weiss, “Allocating independent subtasks on parallel processors,” IEEE Trans. Soft. Eng., vol. SE-11, pp. 1001-1016, October 1985.

PI S.

MadaIa and J. B. Sinclair, “Performance of synchronous parallel algorithms with regular structure,” IEEE Trans. Par. 8 Diat. Sya., vol. 2, pp. 105-116, January 1991.

PI M.

D. Durand, T. Montaut, L. KerveUa, and W. JaIby, “Impact of memory contention on dynamic scheduling on numa multiprocessors,” Proc. ht. Conf. Par. hoc., vol. 1, pp. 258-267, 1993.

PO1A.

Greenbaum, “Synchronization costs on multiprocessors,” Par&led Computing, vol. 10, pp. 3-14, 1989.

[W

J. E. Smith and R. J. Goodman, “Instruction cache replacement policies and organization,” IEEE Trans. Computers, vol. C-34, pp. 234-241, March 1985.

M

R. Gupta, “The fuzzy barrier: A mechanism for high speed synchronization of processors,” Proc. Int. Conf. Arch. Support Prog. Lang & OS. pp. 54-63, 1989.

Gupta, “Loop displacement: An approach for transforming and scheduling loops for parallel execution,” Proc. Supercomp. ‘90, pp. 388-397, 1990.

[131 D. Windheiser, E. Boyd, E. Hao, S. G. Abraham, and E. S. Davidson, “KSRl multiprocessor: Analysis of latency hiding techniques in a sparse solver,” Proc. Int. Par. hoc. Symp., pp. 454-461, April 1993. [141 E. Rosti, E. Smirni, T. D. Wagner, A. W. Apon, and L. W. Dowdy, “The KSRl: Experimentation and modeling of poststore,” ACM SIGMETRICS Conf. Meaa. Modei. Comp. Sya., 1993. [151 E. L. Boyd, J.-D. WeIlman, S. G. Abraham, and E. S. Davidson, “Evaluating the communication performance of mpps using synthetic sparse matrix multiplication workloads,” Proc. Int. Conf. Supercomp., pp. 240-250, jul 1993.

WI

R. L. Mattson, J. Gecsei, D. R. Slutz, , and I. L. Traiger , “Evaluation techniques for storage hierarchies,” IBM Systems Journal, vol. 2, pp. 78-117, 1970.

[I71H. W31H.

[31 M. S. Schlansker, R. Shaw, and S. Sivaramakris-

David, Order Statistics.

New York: Wiley, 1981.

T. Kung, “Synchronized and asynchronous parallel algorithms for multiprocessors,” in Algorithms and Complexity: New Directions and Recent Results (J. Traub, ed.), pp. 153-200, New York: Academic, 1976.

han, “Randomization and associativity in the design of placement-insensitive caches,” Technical Report HPL-93-41, HP Laboratories, June 1993. [41 V. S. Adve and M. K. Vernon, “The influence of random delays on parallel execution times,” ACM SIGMETRICS Conf. Meaa. Model. Comp. Sya., pp. 6173, 1993.

PI

C. S. Chang and R. Nelson, “Bounds on the speedup and efficiency of partial synchronization in parallel processing systems,” Research Report RC 16474, IBM, 1991.

P21R.

References

PI

I995

M. Dubois and F. A. Briggs, “Performance of synchronized iterative processes in multiprocessor systems,” IEEE Trans. Soft. Eng., vol. SE-s, pp. 419-431, Jul 1982.

PI V.

Sarkar, “Determining average program execution times and their variance,” Proc. ACM SIGPLAN Conf. Prog. Lang. Des. d Impl., vol. 24, no. 7, pp. 298-312, 1989.

271

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE

Suggest Documents