Evaluating the Communication Performance of MPPs ... - CiteSeerX

14 downloads 98 Views 107KB Size Report
tion, Hillsboro, OR, 1991. [4] “The Connection Machine CM5 ... University of Illinois at Urbana–Champaign, 1988,. No. 808. [8] F. Bodin and D. Windheiser and ...
Appeared in Proceedings of the 1993 International Conference on Supercomputing, Tokyo, Japan, pp. 240-250.

Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads Eric L. Boyd, John-David Wellman, Santosh G. Abraham, and Edward S. Davidson Advanced Computer Architecture Laboratory Department of Electrical Engineering and Computer Science University of Michigan Abstract

thus is increasing rapidly with each new generation, the improvement in communication latency between the processors in an MPP is lagging behind. For example, the cache miss latency of the current Kendall Square Research KSR1 [1] is over 140 processor cycles compared to 40 cycles for the earlier generation BBN TC2000 [2]. Although, to some extent, the slow communication can be compensated for through the exploitation of node locality, and the latency can be masked by timely communication transactions, the latency and bandwidth of the interconnection networks pose the fundamental limit to the scalability of an MPP [13]. The hierarchical ring interconnect of the KSR1, the fat tree network in the Thinking Machines CM-5 [4], and the mesh network in the Intel Paragon [3] are very different structurally and have different characteristics. Since application performance on MPP systems is greatly affected by the communication capabilities of the architectures, it is essential to evaluate the specific communication requirements of target applications on alternate MPP architectures and the ability of each architecture to satisfy the requirements that pertain to it. Given the wide variety of architectures and programming approaches available on MPPs today, it is difficult to efficiently evaluate communication performance in a timely fashion. Although many benchmarks for the evaluation of high performance computers exist, few are easily portable across MPP platforms. Furthermore, because their performance results occur at ambiguous points in the communication parameter space, benchmarks do not necessarily pose a known and controllable workload. What is needed is an efficient and readily portable approach to evaluating communication performance on a wide variety of high performance computers that, unlike a classic benchmark, allows direct control of several communication parameters through manipulation of a synthetically generated workload. We present a method to realize a desired set of communication parameters by synthesizing a range of sparse matrices for a simple parallel matrix multiplication algorithm. These parameters include the number of nodes, the average amount of point-to-point data communication per node, the degree of sharing of particular vector elements among the nodes, and the computation-communication ratio. The overall performance of the Sparse Matrix / Dense Vector Multiplication algorithm (SMDVM) is not the focus of this

Communication has a dominant impact on the performance of massively parallel processors (MPPs). We propose a methodology to evaluate the internode communication performance of MPPs using a controlled set of synthetic workloads. By generating a range of sparse matrices and measuring the performance of a simple parallel algorithm that repeatedly multiplies a sparse matrix by a dense vector, we can determine the relative performance of different communication workloads. Specifiable communication parameters include the number of nodes, the average amount of communication per node, the degree of sharing among the nodes, and the computation-communication ratio. We describe a general procedure for constructing sparse matrices that have these desired communication and computation parameters, and apply a range of these synthetic workloads to evaluate the hierarchical ring interconnection and cacheonly memory architecture (COMA) of the Kendall Square Research KSR1 MPP. This analysis discusses the impact of the KSR1 architecture on communication performance, highlighting the utility and impact of the automatic update feature. It also investigates the impact of system contention on the performance, particularly how it causes potential updates to be ignored.

1. Introduction Modern high performance computers are increasingly characterized by larger and larger amounts of parallelism, as evidenced in the current generation of massively parallel processors (MPPs). While processor speed is tracking performance improvements in microprocessor technology and

The University of Michigan Center for Parallel Computing is partially funded by NSF grant CDA-92-14296 and this research was supported in part by ONR N00014-93-1-0163.

1

approach; the goal is to expose and evaluate communication characteristics and bottlenecks by observing the change in the performance as the communication parameters are varied in a regular fashion. Unlike a benchmark, where only a particular application’s data communication pattern can be evaluated and the performance on several machines compared, the data communication pattern in our approach is easily varied in a controlled manner by changing the sparse matrix. Furthermore, because of its relative simplicity, this algorithm can be easily ported to various MPPs having widely differing programming models. Thus the communication capabilities of several different parallel systems can be quickly and comprehensively evaluated. As an illustration of the method, this approach is used to evaluate the hierarchical ring interconnection and cacheonly memory architecture (COMA) of the Kendall Square Research KSR1 MPP in use at the University of Michigan Center for Parallel Computing [5]. Analysis of communication performance on the KSR1 using a parameterized synthetic workload generated by our method leads to several interesting conclusions. We show that the automatic update feature is an effective implementation of implicit multicast communication, although ignored potential updates can significantly degrade performance when the degree of sharing is high. Furthermore the communication interconnect is a shared resource which can significantly affect performance as the total communication is increased.

PARALLEL REGION pid = get_proc_id() for k = 1, count for row = ((N*pid)/P)+1, (N*(pid+1))/P do for col = 1, N do if (A[row,col] ≠ 0) y[row]=y[row]+A[row,col]*x[col] endfor endfor BARRIER SYNCHRONIZATION POINT for row = ((N*pid)/P)+1, (N*(pid+1))/P do for col = 1, N do if (A[row,col] ≠ 0) x[row]=x[row]+A[row,col]*y[col] endfor endfor BARRIER SYNCHRONIZATION POINT endfor END PARALLEL REGION

Figure 1: Parallel Matrix Vector Multiplication Algorithm sor while calculating its portion of the y = Ax computation is determined by the A matrix. Specifically, interprocessor communication occurs when there are nonzero elements within the band of rows of the A matrix whose column locations do not fall within the corresponding block of x owned by that processor, and thus these x elements must be acquired from another processor. For example, in Figure 2 we can see exactly which portions of the A matrix, x vector, and y vector are accessed by processor P1, for a 6x6 matrix, distributed across processors P0, P1, and P2. Each remote read, R1 represents an element of x accessed by P1 but owned by some other processor, thus requiring interprocessor data communication to obtain a read-only copy of the data. Each write, W1, represents a data item which is owned and must be written by P1. In a distributed shared memory machine, such as the KSR1, this may generate coherence traffic in the form of write acquisition or invalidation traffic. Each local read, L, represents an element of x which is accessed and owned by P1, therefore requiring no interprocessor communication. The “blanks” in the x vector correspond to columns which have no 1’s in Row Band 1 and are consequently not accessed by processor P1. We assume that the A matrix is a square, NxN matrix, and is evenly divisible by the number of processors, P, so that the A matrix naturally subdivides into (N/P)2 submatrices. We define a submatrix of dimension S as follows: Definition 1: Submatrix Sm,n is the submatrix A[(m*N)/ P+1:((m+1)*N)/P, (n*N)P+1:((n+1)*N)/P] of A, where 0≤m,n≤(P-1). Each submatrix is a (N/P)x(N/P) matrix and has a unique row band and column band pair as shown in Figure 2. A submatrix Sm,n, where m = n, is referred to as a diagonal submatrix, Sp,p. A submatrix Sm,n, where m ≠ n, is referred to as a

2. Communication and Computation in Sparse Matrix / Dense Vector Multiplication The distribution of nonzero elements in a sparse matrix determines the communication and computation patterns of the SMDVM. A simple row-blocked parallel matrix vector multiplication algorithm is shown in Figure 1. The algorithm alternates between updating the vector y using y = Ax and updating the vector x using x = Ay. Because the two computations are essentially identical, we discuss only the y =Ax computation (referred to as a single iteration). Each element of the vector y is obtained by multiplying the corresponding row of A with the vector x. Because of the conditional statement in the loop, an element of x is accessed only if the corresponding element in the row of A is nonzero. Each of the P processors runs its own thread, owns a block of the x and y vectors, and is responsible for all updates on its row block of the y vector. All P threads are activated on entry into the parallel region. The pid is a distinct identification number for each processor varying from 0 through P-1. The barrier synchronization point is required to ensure that the y vector has been completely updated before it is used to update the x vector in the next iteration. Sparse matrices are typically represented by a compact data structure where only the nonzero elements are explicitly stored, and when using such data structures, the conditional is not required in the loop. The interprocessor communication incurred by a proces-

2

multiple of P. Furthermore, though all the computations are floating point, the matrix elements will all be either 0 or 1. Finally, all processors will have an identical communication and computation load, i.e. this is a homogenous parallel algorithm. Elements of A that generate computation but not communication are restricted to the diagonal submatrices of A, i.e. no column of a nondiagonal submatrix has more than a single 1. To ensure that every element of y is written by its owner processor, no row of a diagonal submatrix will be all 0. This write forces the owning processors to reacquire exclusive possession of their data elements, and thus all the communication will be repeated in every iteration. Though none of these requirements is necessary, the resulting allowable sparse matrices are easily generated to match the desired parameters and are sufficiently powerful to evaluate the communication performance of an MPP. Straightforward extensions of these allowable sparse matrices permit the construction of synthetic workloads that exhibit uneven communication patterns or load imbalances. The generation of a sparse matrix to create a synthetic workload is governed by four basic parameters, defined as follows: Definition 2: The number of processors, P, is the total number of independent nodes involved in computation and connected via the interconnection network. Definition 3: The average number of point-to-point data communications per processor, U, is the number of elements of x that are read but not owned by a processor, i.e. the number of x elements which a processor must acquire. Definition 4: The degree of sharing, Di, is the number of nonowner processors that read an element i of x. The average degree of sharing, D, is defined to be an average of the nonzero Di. It follows that 1 ≤ D ≤ (P-1). Definition 5: The computation-to-communication ratio, CCR, is the ratio of floating point multiply-add operations per processor to point-to-point data communications per processor. It can be shown that CCR ≥ (D +1)/D, because the maximum communication occurs when all elements are shared, which results in D point-to-point data communications per x element and D nonowner computations per x element plus one owner computation per y element. We now develop a procedure that generates a sparse matrix that realizes any given P, U, D, and CCR, provided that 1 ≤ D ≤ (P-1), and CCR ≥ (D+1)/D. As stated previously, elements of A that generate computation but not communication are restricted to the diagonal submatrices of A. Assuming there are X elements per processor that generate computation, but not communication, the average computation per processor will be (U+X). Hence X can be calculated in terms of the basic parameters U and CCR as follows:

Column Band 0

{ Row Band 1

1

2

3

4

5

6

1

R1

2

R1

3

{4

1

1

1

0

0

0

0

1

1

0

0

1

L



= W1 W1

5

R1

6

A

x y

Figure 2: Elements Read and Written By P1 During a MVM nondiagonal submatrix Sm,n. Nonzero entries in a diagonal submatrix generate only computation, while nonzero entries in a nondiagonal submatrix generate computation and may generate communication. Since the actual values of nonzero elements do not influence the amount of communication or computation, we restrict our discussion to A matrices with elements of value 0 and 1. Summing all the elements of a binary A matrix yields the overall computation load per iteration of the SMDVM. Note, however, that one “unit” of computation includes one floating point addition and one floating point multiplication. The placement of nonzero entries in A determines which elements of x are communicated. If a column j of a nondiagonal submatrix, Sm,n, of A contains only 0 entries, then the jth x element of the band owned by processor n is not read by processor m. If there is one or more 1 entry in column j, then that x element is read and requires a single communication from processor n to processor m. Thus the point-to-point communication load per iteration of SMDVM, in units of elements of x, is the sum of the number of nonzero columns over all nondiagonal submatrices. We also define the multicast communication load per iteration of SMDVM as the number of distinct elements of x in the point-to-point communication load. Note that the communication loads as defined above do not include synchronization, cache invalidates, or other communications with no data content. Such nondata communications are highly machine dependent and would need to be added in practice for the machine being evaluated.

3. Synthetic Workload Generation To simplify the generation of synthetic workloads we will use for evaluating MPP communication performance, we restrict the generated sparse matrices used by the SMDVM to a simple subset of all possible matrices. These matrices will be square matrices of size NxN, where N is a

CCR = (U+X)/U

3

(1)

X = (U * CCR) - U = U * (CCR - 1)

(2)

To construct the sparse matrix used in a synthetic workload, we first create two basic template vectors. The horizontal template vector, H, is a 1 x M matrix. Element i of H is equal to the degree of sharing of element ((p * N) + i) of x for each processor p. The vertical template vector, V, is a (P1) x 1 matrix. Element j of V is equal to the number of pointto-point data communications of elements of x from each processor p to processor ((j + p) mod P). To construct the H and V vectors, we calculate the following values:

If nonowner processors need to read a shared data element of x, it is theoretically possible to multicast the element in a one-to-many communication transaction, although this option is not available on all MPPs. Alternatives to a multicast are either multiple point-to-point data communications or a one-to-all broadcast. Multiple point-to-point data communications may have unnecessary overhead compared to a multicast, while a broadcast may waste resources. Given D and U, it follows that the average number of multicasts required per processor, M, can be calculated as follows: M=U/D

h

(3) v

Since D = U/M, it follows that 1 ≤ U/M ≤ (P-1). Figure 3 outlines the algorithm for the generation of the synthetic workload matrices. This algorithm is explained in further detail below. The matrices and vectors created as intermediate and final steps during the execution of this algorithm are listed in Table 1, and also described in more detail below. Matrix or Vector

Vertical Dimension

h

cf

c

c

=

=

U M

---U

----------P–1

= UmodM

h v

f

1

M

V

P-1

1

T

P-1

M

A

S*P

S*P

Row Offset Column Index

S*P M*(P-1)*P+(X*P)

1 1

Using the parameters, P, U, D, and CCR, determine the elements in the H and V template vectors.

2>

Use the H and V vectors to guide the construction of the template matrix, T, identical for all processors.

3>

Place the rows of T into their appropriate positions within the A matrix.

4>

Add X nonzero entries to all submatrices Sp,p of the A matrix, with at least one nonzero entry per row.

5>

Output a sparse matrix representation of the A matrix.

U

=

(4)

These values are placed as shown in Figure 4.

T

V vc

Vcf

vf

(P-1)

vf

H

hc hc hc hf hcf

Table 1: Sparse Matrix Generation Algorithm Matrices

1>

U M

----

----------P–1 v = Umod ( P – 1 ) cf f

Horizontal Dimension

H

=

hf

U

M

Figure 4: Formation of the H and V Vectors The first hcf entries in the H vector are set to hc, while the last (M - hcf) entries in the H vector are set to hf. Likewise, the first vcf entries in the V vector are set to vc, while the last ((P-1) - vcf) entries in the V vector are set to vf. The sum of the entries in both the H and V vectors is U. Note that the values of each of the entries in the H and V vectors are set to be close to their average, subject to the constraint that each entry is integer and that the sum of all entries is U. This procedure will ensure that each processor communicates uniformly with all other processors, and that each shared value has approximately the same degree of sharing. From the horizontal and vertical template matrices H and V, a template matrix, T, is formed according to the algorithm shown in Figure 5. Each nonzero element T(i,j) of the template matrix, T, corresponds to a point-to-point data communication from each processor p to processor ((p+i) mod P). Each column j of T defines a multicast from each processor p to all processors ((p+i) mod P) for which T(i,j) = 1. This algorithm proceeds across each row, filling in 1’s until the

Figure 3: Generation of Sparse Matrices

4

The generation of this matrix is completed by the algorithm shown in Figure 8. Each diagonal submatrix is filled with X 1’s placed in column major order. We require that X ≥ S = k * M, so that there is enough computation to permit all elements of y to be written in each iteration. Otherwise, for some MPPs, particularly shared memory machines, the reads might not generate the expected interprocessor communication. Each nondiagonal submatrix of A is filled with the appropriate row of the T matrix. The generation of the A matrix induces skewing and shifting to spread communication demand more uniformly in time and to control which processor requests an element first which may be relevant to a machine with multicasting. As an example, Figure 9 shows the expansion of a template matrix into an A matrix. The sparse matrix is represented in a compressed row format by a one dimensional array, the column index vector, of length (U+X)*P, with each entry corresponding to a column index in the A matrix, and a one dimensional array, the row offset vector, of length N = S * P, with each entry corresponding to a row number in the A matrix. The column index vector contains, in row major order, the column index of every entry with a value of 1 in the A matrix. Each row index in the row offset vector points to the last column index in C that corresponds to an entry in that particular row. An example is shown in Figure 7. In practice it is possible to proceed directly from T to the sparse representation of A, but we describe the regular representation of the A matrix for clarity.

template[0:P-2][0:M-1] = 0 row_total[0:P-2] = 0 row = 0 col = 0 while (row < (P-1)) do while (row_total[row] 1 (i.e. U > M), the slopes are greater than for D = 1, indicating a performance penalty in addition to system contention seen when D = 1. This penalty results from missed updates, which increase the total number of multicasts and the system contention even further. This effect is also shown in Figure 13 by the vertical displacements for different values of U. Note that for D > 6, the slope begins to decrease as D increases due to competing effects from contributing factors for which there is no satisfactory quantitative model.

6000

Execution Time (µsec)

5000

Φ ΦΦ Η ΦΦ Η Η Φ Φ ΗΗ Η Φ Η ϑ Φ ϑ ϑ ϑ ΗΗ Η ϑ ϑ ϑ ϑ Η Η ϑ ϑ Β Β Β Β Β ϑ ϑ Β Β Β Β ϑ Β Β Β Β ΦΦ

4000

3000

2000

1000

0 0

5

10

15

20

25

30

Number of Processors, P

5.4. Varying Degree of Sharing

(Tproc (U = 500) - Tproc (U = 0))

To measure the effect of varying the degree of sharing, D, we ran several experiments which varied the number of possible updates, U, for synthetic workloads of fixed matrix size with M = 50,U+X = 1000 or 3000, and 16 or 28 processors. The measured execution times for the various test cases are plotted in Figure 14. Note that there are two groups of lines in Figure 14, and these correspond to two different testing situations. We explicitly controlled the binding of the threads to specific processors of the KSR1, in order that the number of potential updates might be either maximized (the

(Tproc (U = 400) - Tproc (U = 0))

Φ

(Tproc (U = 300) - Tproc (U = 0))

Η

(Tproc (U = 200) - Tproc (U = 0))

ϑ

(Tproc (U = 130) - Tproc (U = 0))

Β

(Tproc (U = 50) - Tproc (U = 0))

Figure 13: Execution Time versus the # of Processors, P, with M = 50, S = 100, U+X = 1000

8

In Figure 14, for fixed P in both the update and no-update experiments, varying U+X has a relatively modest performance impact. For a fixed U+X in both the update and noupdate experiments, varying P also has a relatively modest performance impact attributable to synchronization overhead. The largest performance gap is caused by minimizing and maximizing the width of multicasts in the no-update and update experiments, respectively. The slope in the update case shows the effect of missed updates and system contention, while the slope in the no-update is smaller than expected, indicating that some multicasting is occurring. The performance gap increases as the degree of sharing increases indicating the benefit of automatic updates. Furthermore, the tight clustering of the update and no-update experiments shows that the communication impact far outweighs the differences in execution time due to the contention effects and larger total computation workload. Clearly, the automatic updating of data items is a very useful and important feature for good performance on the KSR1.

studied by Gallivan et al [7] and Bodin et al [8]. A number of authors have investigated modeling the performance of MPPs, including Flatt and Kennedy [20], Karp and Flatt [21], and Zhang et al [23] [24]. Others have concentrated on memory performance, including Mowry and Gupta [18], Dubois et al [19], and LaRowe and Ellis [22]. The cache invalidation patterns of five realistic applications have been examined in Gupta and Weber [9]. The data generated by our synthetic SMDVM workload on new multiprocessor systems can be combined with information on invalidation patterns in [9] to estimate the performance of those applications on new multiprocessor systems. Synthesis of artificial workloads to characterize multiprocessor performance has been used by Nanda and Ni [11] [12]. They define three major factors that influence performance: the distribution of shared data in the memory hierarchy and the access pattern, the nature of synchronization operations, and the presence of global synchronization barriers. They quantify the performance behavior for each of the dimensions by synthesizing workloads. The performance of the KSR1 system has been analyzed by Dunigan [14], Windheiser et al [13], Kahhaleh [17], and Rosti et al [15]. The Dunigan paper runs a series of experiments evaluating the communication and synchronization performance of the KSR1 and compares it to other shared memory multiprocessors such as the Sequent and BBN TC2000. The Windheiser et al paper examines the impact of prefetches and poststores in the context of the FEM-ATS application. The Kahhaleh paper analyzes the various memory latency features which stall the processor during program execution. The Rosti et al paper details a performance model of poststores on the KSR1 using Generalized Stochastic Petri Nets and evaluates their effectiveness. Another COMA-style architecture similar to the KSR1 is developed by Hagersten et al [5].

6. Related Work Sparse matrix computations have previously been used in the context of evaluating and improving uniprocessor cache designs [6]. Matrix addition and matrix multiplication (but not sparse matrices) have been used to determine remote memory access delay by Zhang et al [10]. The performance of the memory system of the BBN multiprocessor, a shared address distributed memory system like the KSR1, has been 12000

Η Η Β Β Φ ϑ Β ΗΦ ΗΦ ϑ ϑ Β Β ϑ Φ Η ϑ Β Η Φ Β ϑ Φ ϑ Β Η Β ϑ ΗΦ ϑ Φ Β Η ϑ ΗΦ Β Β ϑ Η Φ ϑ Β ΗΦ ϑ Φ Β Η ϑ Φ Η Β Φ Η ϑ Φ

Execution Time (µsec)

10000

8000

6000

4000

7. Conclusion Parallel processors are important vehicles for attaining the highest performance on many application workloads. However, though the core processors of these machines continue to improve in performance with each generation, the communications performance is increasingly limiting the overall machine performance. Many authors have explored methods to reduce the amount of communication required for specific applications, and even parallel programs in general. Unfortunately, there is inherent communication in many applications which simply cannot be removed through any known methodology. This communication poses the fundamental limit to the scalability of a machine, and its ultimate performance on a workload. Architects and machine designers clearly need a means to analyze the communication performance of current generation MPPs, and from these analyses determine the most important factors affecting the communication performance of

2000

0 0

100

200

300

400

500

600

700

800

Number of Point-to-Point Communications, U Β

Tn(P = 28, (U+X)=3000)

Tu(P = 28, (U+X)=3000)

ϑ

Tn(P = 28, (U+X)=1000)

Tu(P = 28, (U+X)=1000)

Η

Tn(P = 16, (U+X)=3000)

Tu(P = 16, (U+X)=3000)

Φ

Tn(P = 16, (U+X)=1000)

Tu(P = 16, (U+X)=1000)

Figure 14: Execution Time versus the # of Possible Updates, U, with M = 50, S = 100

9

[4]

these machines. We have presented a simple procedure to synthesize workloads which realize desired communication parameters, and a simple sparse matrix dense vector multiplication engine which can be used to generate the relative performance information for these different synthetic workloads. By utilizing this synthetic workload methodology, the performance of an MPP can be evaluated under different, controlled communication system stresses, highlighting the important factors of the communication system. We have presented an evaluation of the KSR1 system using this methodology, and from this analysis have identified several communication system features which both improve and detract from the system performance. On the KSR1 system, the implicit multicast operation, embodied in the automatic update of invalid cache data, can significantly improve the performance of an application. However, as the number of processors employed in the parallel program is increased, the greater degree of system congestion tends to cause updates to be ignored, and thus negatively impacts the performance. If the local caches of the processors were freed to better exploit the automatic update feature then the performance of the KSR1 would improve. The general system congestion (causing communication requests to occasionally stall rather than being transmitted immediately) also tends to impact the performance fairly significantly. As we varied the number of processors involved in the parallel matrix multiplication (under a constant computational workload), we noted that the curves for a given computation-to-communication ratio were all exhibiting an increasing slope. This slope is caused by the increased cost of the communication and the synchronization overhead. Finally, though the methodology presented here will only generate homogenous matrix multiplication problems (where each processor does an identical amount of work and communication), this methodology can be extended to the examination of heterogenous synthetic workloads. The methodology generates sparse matrices where the communication and computation are evenly distributed, and where the implicit multicast of KSR–style architectures is maximally exploited. To generate matrices with asymmetric communication or computation between processors, the placement and number of nonzero elements in the A matrix could be varied by using different template matrices for each column band. This general approach would allow the examination of effects such as computation-load imbalance, communication-load imbalance, and communication hotspots.

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

8. References [1] [2] [3]

“KSR1 Principles of Operation, Kendall Square Research Corporation,” Waltham, MA, 1991 “Inside the TC2000 Computer,” BBN Advanced Computers Inc., Cambridge, MA 02138, 1990 “Paragon XP/S Product Overview,” Intel Corporation, Hillsboro, OR, 1991.

[17]

10

“The Connection Machine CM5 Technical Summary,” Thinking Machines Corporation, Cambridge, MA, January, 1992. E. Hagersten, A. Landin, and S. Haridi. “DDM --- A cache-only memory architecture,” IEEE Computer, September 1992. V. E. Taylor. “Sparse matrix computations: Implications for cache design,” Proceedings of Supercomputing '92, pp. 598-607. K. Gallivan and D. Gannon and W. Jalby and A. Malony and H. Wijshoff. “Behavioral Characterization of Multiprocessor Memory Systems: A Case Study,” University of Illinois at Urbana–Champaign, 1988, No. 808. F. Bodin and D. Windheiser and W. Jalby and D. Atapattu and M. Lee and D. Gannon. “Performance Evaluation for Parallel Algorithms on the BBN GP1000,” Proceedings of the ACM International Conference of Supercomputing 1990, 401-413. A. Gupta and W-D. Weber. “Cache Invalidation Patterns in shared memory multiprocessors,” IEEE Transactions on Computers, Vol. 41, No. 7, July 1992, pp. 794-810. X. Zhang and X. Qin. “Performance prediction and evaluation of parallel processing on a NUMA multiprocessor,” IEEE Trans. Software Engineering, Vol. 17, No. 10, Oct. 1991, pp. 1059-1068. A. Nanda and L. M. Ni. “MAD kernels: An experimental testbed to study multiprocessor memory system behavior,” Int. Conf. Parallel Processing, Vol. 1, 1992, pp. 28-35. A. Nanda and L. M. Ni. “Benchmark workload generation and performance characterization of multiprocessors,” Supercomputing '92, pp. 20-29. D. Windheiser, E. L. Boyd, E. Hao, S. G. Abraham, E. S. Davidson. “KSR1 multiprocessor: Analysis of latency hiding techniques in a sparse solver,” Int. Parallel Processing Symposium, April 1993 T. H. Dunigan. “Kendall Square Multiprocessor: Early experiences and performance,” Oak Ridge National Laboratory Technical Report ORNL/TM12065, April 1992. E. Rosti, E. Smirni, T. D. Wagner, A. W. Apon, L. W. Dowdy. “The KSR1: Experimentation and modeling of poststore,” SIGMETRICS Conf.erence on Perf. Measurement and Modeling, May, 1993. J. R. Goodman and P. J. Woest, “The Wisconsin Multicube: A New Large-Scale Cache Coherent Multiprocessor,” Proceedings of the 15th Annual International Symposium on Computer Architecture, pp. 422-431, 1988. B. Kahhaleh, “Analysis of Memory Latency Factors and their Impact on KSR1 MPP Performance,” University of Michigan, Technical Report, CSE-TR-15793, 1993.

[18] T. Mowry and A. Gupta, “Tolerating Latency Through Software-Controlled Prefetching in SharedMemory Multiprocessors,” Journal of Parallel and Distributed Computing, Vol. 12, No. 2, pp. 87-106, June 1991. [19] M. Dubois, C. Scheurich and F. Griggs, “Memory Access Buffering in Multiprocessors,” Proceedings of the 13th Annual International Symposium on Computer Architecture, pp. 434-442, 1986. [20] H. P Flatt and K. Kennedy, “Performance of Parallel Processors,” Parallel Computing, Vol. 12, No. 1, pp. 1-12, 1989. [21] A. H. Karp and H. P. Flatt, “Measuring Parallel Processor Performance,” Communications of the ACM, Vol. 33, No. 5, pp. 539-543, 1990. [22] R. P. LaRowe, Jr. and C. S. Ellis, “Experimental Comparison of Memory Management Policies for NUMA Multiprocessors,” Department of Computer Science, Duke University, Technical Report CS1990-10, 1990. [23] X. Zhang, “Performance Measurement and Modeling to Evaluate Various Effects on a Shared Memory Multiprocessor,” IEEE Transactions on Software Engineering, Vol. 17, pp. 87-93, Jan. 1991. [24] X. Zhang and P. Srinivasan, “Distributed Task Processing and Performance on a NUMA Shared Memory Multiprocessor,” Proceedings of the 2nd IEEE Symposium on Parallel and Distributed Processing, Los Alamitos, CA, pp. 786-789, 1990.

11

Suggest Documents