IEEE TRANSACTIONS ON COMPUTERS, VOL. C-32, NO. 1, JANUARY 1983. Shared Cache for Multiple-Stream Computer. Systems. PHIL C. C. YEH ...
IEEE TRANSACTIONS ON
38
Shared PHIL C. C. YEH,
Cache for
MEMBER, IEEE,
COMPUTERS, VOL. C-32, NO. 1, JANUARY 1983
Multiple-Stream Computer Systems
JANAK H. PATEL, MEMBER, IEEE,
AND
EDWARD S. DAVIDSON,
SENIOR MEMBER, IEEE
Abstract-Cache memory organization for parallel-pipelined multiprocessor systems is evaluated. Private caches have a cache coherence problem. A shared cache avoids this problem and can attain a higher hit ratio due to sharing of single copies of common blocks and dynamic allocation of cache space among the processes. However, a shared cache suffers performance degradation due to access conflicts. In this paper, effective shared cache organizations are presented which retain these inherent advantages and have very low access conflict degradation, even with very high request rates. A Markov model is developed for performance analysis of shared cache organizations. Analytic expressions for performance are presented for several important organizations. Bounds are derived for other organizations. Simulation results show that the assumptions of the analytic model are reasonable. Index Terms-Cache memories, memory interference, multiprocessors, parallel memories, performance evaluation.
I. INTRODUCTION T HE high performance and cost effectiveness of cache memory for uniprocessor computer systems is well known [1]_-[7]. In a cache-based computer system, the cache memory and the main memory are usually divided into equal sized blocks. The block is the minimum amount of data which may be transmitted between the cache and main memory. A memory reference is a hit or a miss if the referenced datum is present or absent in the cache, respectively. After a miss, the block containing the desired datum is copied from the main memory to cache memory. The hit ratio is the fraction of hits among all references; the miss ratio is the fraction of misses. This paper is concerned with tightly coupled multiprocessor systems in which main memory is shared by all the processors. Cache memory, if present, may be private or shared. A single shared cache may lose performance due to access conflicts. It may, however, have a lower miss ratio due to space trading in the cache among tasks on distinct processors and due to sharing single copies of common blocks. Manuscript received February 3, 1982; revised July 22, 1982. This work was supported by the Joint Services Electronics Program under Contract N000I 4-79-C-0424 and by the Naval Electronics Systems Command under VHSIC Contract N00039-80-C-0556. P. C. C. Yeh was with the Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801. He is now with the IBM Corporation, Poughkeepsie, NY 12602. J. H. Patel and E. S. Davidson are with the Coordinated Science Laboratory and the Department of Electrical Engineering, University of Illinois, Urbana, IL 61801.
Multiprocessor systems with a private cache for each processor have been analyzed and their performance has been characterized [8]. However, private cache causes the wellknown cache coherence problem [9], [1O], i.e., multiple inconsistent copies of data may exist in the system. Note that reentrant (pure or read-only) code avoids the coherence problem for code blocks. This coherence problem arises when shared data are updated by one processor in one cache, rendering copies in other caches and in main memory obsolete. Even without shared data, if jobs may be switched among processors, obsolete data may be read from main memory to a new processor cache after a job switch. Another disadvantage of the private cache is that certain shared system resources, such as operating system routines, may be copied several times in the cache memories when they are referenced by more than one processor. Private caches also require a fixed cache allocation per processor; shared cache allows dynamic allocation of total cache space among the processors. The effective total system cache size is thus larger for shared cache than for a private cache system, and its hit ratio should be higher. Two policies are commonly used for updating main memory. In the write-through approach, all writes are sent to main memory; the data are also updated in cache if they are there. Write requests then need never cause a miss. Although write-through has the advantage that obsolete information is never present in main memory, it is not sufficient in itself to ensure coherence for a private-cache multiprocessor system. In addition, the effectiveness of write-through is known to be less than that of write-back [3], [11], [12]. The write-back approach is to write the data in the cache only. A block is written back to main memory whenever that block must be replaced in the cache. Due to the high access rate to main memory, a pure write-through policy is not suited to high performance multiprocessor systems. Several mechanisms have been proposed to solve the coherence problem in private-cache multiprocessor systems. In C.mmp [13], only read-only blocks may appear in the cache. Performance is degraded when programs with high write rate are executed. In the classical solution, every cache is connected to a communication path over which the addresses of modified blocks are broadcast throughout the cache memories for invalidation. Each cache constantly monitors this path and executes the proper operations for invalidation. The drawbacks of this solution are high invalidation traffic, low cache effi-
0018-9340/83/0100-0038$01.00 © 1983 IEEE
YEH et al.: MULTIPLE-STREAM COMPUTER SYSTEMS
39
Each processor segment takes one segment time unit (STU) to complete its execution step. A pipelined processor of order s can thus issue s time-multiplexed memory requests per cycle, one per STU. If an instruction from a stream is initiated at time t, the next unit instruction from the same stream is initiated at time t + s. Therefore, instruction execution overlap occurs only between distinct instruction streams and no execution overlap occurs between instructions from the same stream. A parallel-pipelined processor of order (s, p) [19], [20] is modeled as a set of p identical and independent, but synchronized, processors, each of which is a pipelined processor of order s. A parallel-pipelined processor of order (s, p) executes sp distinct instruction streams concurrently and issues p simultaneous memory requests per STU. Each of the sp streams makes a memory request every s STU's. All time units are expressed as an integer number of STU's unless otherwise stated. B. Shared-Cache Memory Management Policies A detailed investigation of shared-cache memory management policies is presented in [21 ]. The choice of management policy forms a background for the assumptions, but does not directly affect the analysis discussed in Section III. A brief summary of those policies chosen for our model is described below to ensure coherence and to reveal feasibility. A number of cache memory mapping mechanisms have been proposed in [101. In [21], a set associative mapping mechanism is shown to be most suitable for shared cache if a sufficiently large set size is used. Blocks in cache memory and main memory are grouped into sets. The set size z is the number of cache blocks contained in each set. Main memory blocks are interleaved among the sets. Each cache block in a set has an identifying tag. A word address consists of three fields: the set identifier, the tag, and an address within the block. For each cache access, the set is selected, and the z tags and the addressed word in the z blocks for that set can be accessed simultaneously in the cache. These tags are associatively II. A SHARED-CACHE MODEL searched. If a match is found with the address tag, the reference is a hit, and the corresponding word is selected and output A. Multiprocessor Organization at the end of the cache cycle for a read or modified for a write. Parallel and pipelined computing [ 14] can be employed to If no match is found, a miss is declared at the end of the cache enhance the throughput of a computer system. Parallelism is cycle. For a miss, the request is rejected and one cache block from usually achieved by a multiplicity of independent processing units. A pipelined processor consists of several specialized the referenced set is selected to be replaced by the referenced subprocessors called segments. Each segment performs a block according to a modified LRU replacement policy. In this specific part of a particular computation and operates con- nonload-through policy, a miss request is resubmitted until a hit results after replacement is complete. The model is excurrently with other segments. A pipelined processor of order s is modeled here as a set of tended to load-through in Section III. The LRU (least res segments. The s instructions which execute concurrently in cently used) replacement policy replaces that block in a set the pipelined processor are assumed to come from distinct which has not been referenced for the longest period of time. instruction streams as in [15]-[17]. Thus, the degree of mul- Modified LRU replacement requires that a block brought into tiprogramming is also s. For a straight-through pipelined the cache does not become eligible for replacement until it has processor, all instructions have identical flow patterns and flow been referenced at least once. This modification avoids the through all s segments in sequence. Hence, the instruction possibility of a deadlock in which, say, two processes keep recycle (or pipelined processor cycle) is fixed for all instructions. placing each other's blocks without a reference to either block The instruction here is the unit instruction defined by Strecker being successfully completed. For simplicity, for a given set, [18] such that each instruction issues one memory request per at most one block replacement -may be active at a time. During replacement of a block in a set, all miss requests to that set are instruction cycle. ciency, and the need for buffers to accommodate the peak invalidation traffic. Tang [101 proposed an algorithm using a central directory to keep track of every block in each cache memory. A writeback policy is used. Each block is identified as shared (readonly) or private (only one copy allowed in the caches at any time). This solution requires extensive central directory searching per miss and block status checking per cache memory write. Frequent block status changes may also be required, e.g., for critical sections or semaphores. An access conflict problem occurs at the central directory. Censier and Feautrier [9] proposed a solution very similar to Tang's algorithm. None of these solutions can resolve the coherence problem for multiprocessor systems without significant overhead. This paper focuses on shared cache, which has been neglected thus far due to the supposed access conflict problem. Shared cache has several inherent advantages. Proper cache management strategies can resolve the coherence problem entirely, without overhead penalty and hardware cost. It provides high cache utilization due to its dynamic space sharing and single-copy requirement for shared information. Interprocessor communication can easily be achieved through the shared cache. The potential performance degradation for shared cache is simply cache access interference, which can be overcome to any desired degree by using a sufficiently large number of cache modules. An effective shared-cache memory organization with preferred cache management policies for parallel-pipelined multiple instruction stream processor systems is proposed in Section II. System performance is analyzed in Section III. From these new results on shared-cache access conflict and the well-known coherence and miss rate problems of private-cache multiprocessors, an effective comparison of shared versus private cache can be made.
40
IEEE TRANSACTIONS ON COMPUTERS, VOL.
simply rejected, and no further block transfer operations are initiated for that set. A write-through with buffering updating scheme is adopted in our model. Smith [22] has reported that a sufficiently large buffer for write-through can greatly reduce the performance disadvantage of write-through without buffering with respect to write-back. Write-through with buffering preserves the main memory update advantages and reduces the main memory write bottleneck of a simple write-through scheme. We assume that the buffer size is sufficiently large to prevent blocking due to main memory updating operations. A no-write allocation strategy [2], [3], which brings a block into cache only on a read miss, is also assumed. Write requests thus never cause a cache miss; they are simply written through to main memory, and if the word is present in the cache it is written in the cache. Simulation experiments have been performed [21] to evaluate these policies and assumptions. Their results corroborate the analytic predictions developed in Section III. C. Shared-Cache Memory Organization Briggs and Davidson [20] have developed the L-M memory organization used here for cache. It achieves high performance with economical busing and interconnection. A line is an address bus within the memory. A data bus is associated with each line. However, data buses do not cause any access conflict once the line access conflicts are resolved. The L-M memory organization, shown in Fig. 1, consists of 1(= 2k) lines and m (= 2n-k) memory modules per line, for a total of N (= 2n = lm) identical memory modules. The set of modules on a line share the same address bus and the same data bus. The address hold time and the data hold time on the address and data buses are assumed to be equal to the bus cycle time of 1 STU. The memory cycle time c is allowed to exceed 1 by latching bus values within each memory module. A memory module is busy during its cycle and cannot accept new requests. A line is busy as long as there is some cache module on that bus which is involved in a block transfer operation. A nonbusy line can accept one request per STU. Memory interleaving is a common and inexpensive way to yield high effective memory bandwidth [18], [23]-[30]. However, interleaving among the cache memory modules by words is not suitable. If each block were to span all lines in the cache, the entire cache would be busy during block transfers, and prohibitive replication of tag lookup directories would be required in the cache. A set interleaving scheme is proposed here. In the address format of Fig. 2, successive blocks are in successive sets (modulo 2d). There are 2d sets in the cache and 2b words per block. The shared-cache memory can be interleaved by sets by choosing either of the following two cache implementations. In Fig. 3(a), one or more entire sets are wholly contained in each module. Successive sets are allocated to successive lines (modulo 2k). Sets assigned to the same line are ordered numerically, and the next set on the line is allocated to the next module (modulo 2n-k) on that line. Tag bits may be uniquely associated with modules, since each block is wholly contained in one module. One line and one module are busy during block transfer. Thus, this implementation requires new blocks to be loaded into the cache one word per cache cycle during the block
C-32,
NO.
1, JANUARY 1983
transfer operation. For a slow cache or a large block, the block transfer time is large, and all the modules on a line being used for block transfer will be blocked for a long period. In Fig. 3(b), one or more entire sets are wholly contained in each line. Successive sets are allocated to successive lines (modulo 2k). Successive words of a block are allocated to successive modules (modulo 2n-k) on the same line. During the block transfer operation, new blocks are loaded into the cache one word per bus cycle, instead of one per cache cycle. However, since each block is spread over the m modules on a line, tag directories may have to be replicated m times (or a fast tag memory could be associated with each line). The choice between the two implementations involves tradeoffs between block transfer time and tag directory cost. This choice does not affect the analysis in Section III. The two become identical if m = 1. The implementation of Fig. 3(a) is assumed for further discussion. For example, a shared-cache memory interleaved by sets with k = 3 and n = 5 has I = 2k = 8 lines and m = 2n-k = 4 modules per line. It is referred to as an (1, m) or (8, 4) configuration. If a 4096 word cache memory has b = 4 and d = 5, then the block size is 2b = 16, the number of sets is 2d = 32, the number of blocks is 4096/16 = 256, and the set size is 256/32 = 8. D. System Configuration and Request Scheduling For simplicity, a p-by-i crossbar is assumed to interconnect the processors and the cache lines. However, no interconnection network is required between the shared cache and the main memory. All the addresses which are mapped into the cache modules on one line can easily be allocated to a set of main memory modules associated only with that particular line. Note that more than one main memory module, interleaved by low-order bits, can be attached to each line to provide high block transfer bandwidth. Fig. 4 shows the proposed system. The write-through buffers may be associated with each main memory module. Secondary memory is not shown. One might be concerned that the crossbar increases shared cache access time above that for private cache memory. However, with suitable pipelining at the crossbar, line bandwidth can be maintained. The parallel-pipelined processor requires high bandwidth to maintain throughput; however, large access time can be accommodated with no loss of throughput by using a suitably large degree s of processor
pipelining (multistreaming). Cache access conflict occurs when a request attempts to access a busy line or module or when two or more simultaneous cache memory requests attempt to access the same line. A set of p simultaneous requests arrive at the crossbar each STU. Requests are assumed to be independent and uniformly distributed in the cache. If multiple requests attempt to access the same line, all but one are immediately rejected. Rejected requests are discarded in the model; however, it may be assumed that they are resubmitted after an s STU noncomputing pass through the processor pipeline. The model is unaware of resubmitted requests and treats them as new random requests. When a line is not performing a block transfer, it can accept a new request each STU; a cache module can accept one request per c STU's where c is the module cycle time. A module
41
YEH et al.: MULTIPLE-STREAM COMPUTER SYSTEMS
M L
-
M
.. Mi.m-i
I
HM10
M1i
L
Co
p
0,m-1 *
pXt
M-,m-l
1,1
m
Tog
d
I
SET
BuI SET 1.
b
I
Bum
m
b2
-
-1Z1
Fig. 2. Address format for a set associative cache memory. I- Tog
d Ink k d-n n-k
Id-
I-Tag i-
d-
d-k
-
k
IOJL
l
b
CROSSBAR SHARED CACHE MODULES
Qm K:
{2
MAIN
MEMORY
MODULES
Fig. 4. Shared-cache system organization.
b I n-k ,r
PIPELINED PROCESSORS
.
I
Fig. 1. L-M memory organization. I-
.
"
(b)o
k
m = 2 n-k N = 2n =m
Fig. 3. Address format for two implementations of interleaving by sets.
thus appears busy to any request arriving within c- 1 STU's of a prior request accepted by the module. A request which addresses a busy module is rejected. Note that there may be some busy modules on a line when a miss occurs on that line. It is assumed that all requests in process within busy modules on a line will be aborted when an.earlier request causes a cache miss on that line at the end of its cache cycle. At the end of a module cycle, a served request has probability h of being a hit. If the served request is a miss, all requests being served by other modules on that line are aborted, and a block transfer is initiated on that line. The line is then busy for T STU's where T is the block transfer time, and all requests arriving at that line during these T STU's are rejected. The effect of a cache miss then is to tie up a line for c + T STU's. Note that the model can evaluate either of the two implementations of Fig. 3 by simply adjusting T. The interleaved allocation in the cache and the period of s cycles between requests from the same stream tend to make the independent, uniform distribution of requests assumption in the model more realistic. The value of s is assumed to be greater than c, so that each stream is sequentially processed. The hit ratio h is assumed to be independent of cache access conflict. For analytical purposes, the hit ratio is left unevaluated and is treated as an independent model parameter. Although cache memory access conflicts indeed affect the reference patterns, the hit ratios should not be disturbed significantly by the access conflicts if the hit ratios are sufficiently high. As with the working set [311 concept for a paging system,
a block should reside in the cache for a while before it is removed. The analytical model is oriented toward developing the probability of acceptance PA for a typical shared-cache memory request. The performance measurement, CPU utilization, is derived from PA and h. Miss requests which result in block transfers are considered to be accepted, but do not contribute to CPU utilization. Since c and T implicitly characterize the speeds of cache and main memory, (c, T) is defined as the cycle characteristic of the system. III. PERFORMANCE ANALYSIS
Discrete Markov models were developed to predict the performance of a parallel-pipelined processor of arbitrary order (s, p) with any (1, m) shared-cache memory configuration and cycle characteristic (c, T). The model has been solved for c = 1, 2, and 3 as well as for m = 1. General lower and upper bounds for any c were also derived. A brief summary of the analytical results and some of their derivations are presented in this section. Since all lines in an L-M shared-cache memory system are identical and independent, a single line model, instead of a total system model, is sufficient to analyze the system performance. In the following, first we describe the generation of the Markov state diagram. Fig. 5 is an example of such a diagram and is helpful in understanding the following description. It represents a line state diagram for cycle characteristic (c, T) = (3, 10). The module state is 0 for an idle module. It becomes 1 when a request is accepted. On successive STU's, it advances from 1 to c - 1. If the request served was a hit, it then returns to state 0 and a new request can be accepted by that module in the next STU. A continuously busy module with no misses thus repeatedly cycles through states 0, 1, * * , c - 1. If the request served is a miss, the module state advances from c - 1 to c. When any module on a line advances to state c, the state of every other module on that line is forced to 0 to signify aborted requests. Once a module is in state c, it advances from c to c + T - 1 on successive STU's. It then returns to state 0. A module is busy with respect to new requests when its state is -
notq.
The line state is the set union of the states of all modules on
42
IEEE TRANSACTIONS ON COMPUTERS, VOL.
C-32, NO. 1, JANUARY 1983
state X to its successor potential acceptance state is
llqh
Pa(X) m= m
if c - 1 e X
IXlq otherwise m where I AX is the cardinality of line state X. Obviously, the probability of transition from a checking state X to its successor busy line state is Pb(X) = I - h. Note that m - 14 is the number of idle modules on an idle line in state X. Since the probability of a line being referenced is q, the probability that some idle module on a particular idle line is referenced is (m - Il )q/m. Therefore, Pa = (m - X | )q/m Fig. 5. Line state diagram for shared cache with cycle characteristics if the referenced idle line state X is not a checking state; oth(c, T) = (3, 10). erwise, Pa = (m - I XI )qh/m because a request is potentially accepted at the checking state if and only if the module with that line. Note that two modules on the same line can never state c - 1 results in a hit. have the same nonidle state. A line is busy if it is in one of the Corollary 1: The probability of transition from an idle line states c, * *, c + T - 1. A line state is a potential acceptance state X to its successor nonacceptance state is state if it includes the element 1 (states marked with * in Fig. Pn(X) = h-Pa(X) ifc - 1 E X 5); otherwise, it is a nonacceptance state. Each potential ac= 1 -Pa(X) otherwise. ceptance state corresponds to an accepted request if the corresponding request is not subsequently aborted due to a miss Note that Pa(X) + Pn(X) = 1 for all nonchecking idle line in another module on that line. Accepted requests have prob- states and Pa(X) + Pb(x) + Pn(X) = 1 for the checking states. ability h of contributing to system performance. A line state Corollary 1 follows immediately. Note also that the probability is a checking state if it contains the element c - 1. A checking of transition from a busy line state X to its successor busy line state has probability 1 - h of going to state c, signifying a state (or toO if X = T+ c- 1) is 1. miss. Theorem 3: The total number of distinct line states N(c, T) The next line state is determined as follows. A busy line state for cycle characteristic (c, T) is is incremented on successive STU's to state c + T - 1. It then for m > c - 1 returns to state 4. The elements of a nonchecking idle line state N(c, T) = 2c-1 + T are incremented each STU; a new element, 1, is joined to the (c-() = T form 3, the solution of the Markov model is extremely complex. We do not have a general solution for PA (c, T, p). However, reasonable upper and lower bounds for PA (c, T, p) can be obtained to provide a rough prediction of the performance. Theorem 5: A lower bound on the probability of acceptance for a given (1, m) memory is PA(C, T,p) lNq Np + Npq(l -h)(T+ c- 1) + lpqh(c- 1) A probabilistic model [21 ] was developed to derive this lower bound. The lower bound is derived by assuming that all aborted requests contribute to the busy module conflicts. Theorem 6: An upper bound on the probability of acceptance for a given (1, m) memory is PA(C,T,P) Nq ~
43
for each satisfied request is
PA(I -h)T" + (1 -PA)PA[1 + (1 -h)T,"] + (I - pA)2pA[2 + (I - h)T"1] +** =(l -h)T" +-- 1 PA where (1/PA) - 1 is the penalty for the access conflicts and (1 - h) T" is the penalty for a cache miss. Therefore, the total number of passes a request must take is I ~~~~~I (I1- h)T-" + ( - I+ 1 =- + (1 -h)T".
PA
~~PA
Theorem 7 follows immediately. Note that this formula does not consider the situation in which processors have to make an extra request to obtain the data from the cache after a block transfer operation has been completed. However, for a high performance system, the performance difference caused by this one extra request is
negligible. So far, a nonload-through policy has been assumed. Also, the processor request rate has been assumed to be one, and blocked requests have been handled by resubmitting them as new requests one instruction cycle later. If load-through capability is provided, miss data are automatically forwarded to the requesting processor during T, and processors do not have to resubmit a cache miss request. Let W denote the processor waiting time, measured in STU's, for obtaining the miss data after a cache miss occurs. Usually, this waiting time is approximately equal to the main memory cycle. Since each cache miss causes no request for rw/sl instruction cycles, the processor request rate a, as seen by the cache, is given by
PA + (1 - h)rw/sl
By an argument similar to that used in the proof of Theorem 7, it is obvious that each request will extend to 1 /PA requests due to cache access conflicts, and each miss causes no request p +pq[T(1 -h) + (c- 1) for r Wls instruction cycles. The expression for a follows. Since increasing I cannot decrease performance, the perAssume that the block transfer time T is not affected by formance of any (1, m) memory configuration is less than or load-through. The probability of acceptance PA for loadequal to the performance of an (N, 1) configuration with N = through is evaluated as before, except that the cache request Im. The derivation of the bound follows from the analysis of rate seen by a particular line in the shared cache q is now 1 the Markov model for a memory with N lines and one module (1 - a/l)P. Corollary 2 is changed to PA = IPAI/(ap). Using per line. this q in the equation for PA, we have an equation in which PA Theorem 7: The CPU utilization U for a shared-cache is the only unknown. This equation can be numerically solved memory is using standard iterative techniques. A suitable initial value for PA is obtained by setting a = 1. The CPU utilization is then 1 U= obtained as -+ (1-h)T"1 T.
PA
1
where T" = rTis]. -+ (1 -h)rW/sl PA Let T" be the block transfer time relative to the pipelined Note that load-through reduces the waiting time required processor cycle time s, i.e., T" = rT/s]. Recall that a rejected to will cause the to a null obtain the data which causes misses. In order to access the make request corresponding processor next pass through one cycle. Hence, the total number of null passes data, the processor may still have to wait until the block
IEEE TRANSACTIONS ON COMPUTERS, VOL.
44 1.0 .9
T 16 T32
F
.o
.98
_
C' i
; .
~~~~
.~
.8~ -
h
C-32, NO. 1, JANUARY 1983
p
1
.6
.7
.4J
.6
.:1
N
_
_
0
.5
;4
.5
h
-
.9
4
0
-1
.4 4.
v
.3 C,
1
-
C
.2
2
4
8
16
32
64
128
256
5 12
1024
Number of Lines, I
p- 16
Fig. 7. Effect of I on U for N = 1024 and c
=
1.
.1~
I'
.0
1
4
512 128 256 8 32 64 16 Total Number of Cache Memory Modules, N
Fig. 6. Effect of N on U for
=
1024
4.
transfer operation is completed. This situation is due to prolocalities which may cause the next data accessed to be in the same block as the currently referenced data. Since the address of the requests are assumed to be independent and random, this neglected effect has not been modeled. However, this effect is reduced when the difference between W and T is small. A further extension of this model for tag table lookup to determine hit or miss explicitly prior to the cache cycle is found in [21]. IV. ANALYSIS OF RESULTS In this section, we present the effect of several system parameters on performance. This study is based on data from numerical solutions of the analytic models developed in the last section. The formulas 'used are shown in Table I. Where we need to highlight the effects of cache access conflicts on performance, we have chosen a high hit ratio of 0.98, which reduces the cache miss penalty. Where we need to highlight the miss penalty, the tradeoffs shown are better seen with a low hit ratio of 0.8. For all of the following discussion, the number of segments of a pipeline was set at s = 4. Fig. 6 illustrates the effect of the number of cache modules N on CPU utilization for 1 = 4. In general, an increase in N increases the performance for given 1, p, h, T, and c (>1). However, c does not have a significant effect on U for large N and h. For c = 1, N has no effect on performance because there is no busy module collison. The graph shows that the block transfer time T has a significant effect on U. This effect becomes larger for c > 1 as N increases. As an illustration, suppose that p = 1 and U is required to be 0.75. Using (c, T) = (3, 32), N is required to be at least 256, whereas if (c, T) = (3, 16), N may be as low as 16. In either case, N is significantly larger than pC. If T is primarily dominated by the main memory cycle, i.e., gram
the only way to reduce T is faster main memory, another tradeoff between c and T for each p can be found in the graph. Suppose that p = 4, and that cache memory can at most be divided into 16 modules due to practical restrictions on the module size for a given cache capacity. Then the performance at (c, T) = (1, 32) is higher than that at (c, T) = (3, 16). In addition, the cost of a system using (c, T) = (1, 32) may be lower than with (c, T) = (3, 16). Since the size of the main memory is usually much larger than the size of the cache memory, speeding up the main memory may be much more expensive than speeding up the cache memory. However, the reverse tradeoff holds for this example if N = 32 is allowed. Note that there is significant improvement in performance from increasing N when is close to p and c is greater than 1.
The effect of the number of lines 1 on performance is shown in Fig. 7 for c = and N = 1024. The graph shows that poor and undesirable performance occurs in the region I < p. For I > p, the probability of acceptance PA is close to 1 and the performance is limited only by cache miss penalties. Some tradeoff between and T for certain p can be obtained from Fig. 7. For example, forp = 8, with 1 = 256 and (c, T) = (1, 32), performance is similar to that with I = 64 and (c, T) = (1, 16). Fig. 8 illustrates the effect of the number of processors p on the total system throughput. For p 1, both PA and U decrease in proportion to an increase in p, thus maintaining a constant total throughput. Fig. 9 shows the effect of miss penalties on performance. To highlight the contributions of T and (1 - h), low access conflict is required; accordingly, I >> p is used in the figure. As discussed in the analysis of Fig. 7, the access conflicts are low
45
YEH et al.: MULTIPLE-STREAM COMPUTER SYSTEMS
20
15
.7 0
..6 °
LA
To
7r. 0 0
F
1-
T
8
10
0I z
- 16 ~~~~~~~~T
.4
systems.T
Ei -
- 24 T -3
0
LA
5
.2 ,
2
" W"
4
-I-
J£
8 16 Nuber of Processors,
-4
32
.1
64
p
Fig. 8. Effect of p on pUfor N = 256 and c
=
1.
for I >> p. The graph shows, for example, that U = 0.7 for (1 -h) = 0.1 and T = 16, as well as for (1-h) = 0.05 and T = 32. Therefore, doubling (1 - h), in effect, requires T to be halved in order to keep constant performance for I >> p. However, if T is primarily determined by the main memory cycle, then halving T may generally cost more than halving (1 - h) because the size of main memory is usually much larger than that of cache memory. It is obvious that an optimal design should attempt to minimize the miss penalty (1 - h) T" instead of the miss ratio (1 - h) only. In general, the block transfer time T may be expressed as a linear function of the block size. Let Bs represent the block size. Assume that the processor waiting time W in loadthrough is equal to the main memory cycle. Thus, the block transfer time can be expressed approximately as T = W + uB, where u is the transfer rate between the cache and the main memory. Fig. 10 illustrates the performance difference between load through and nonload-through for W = 4. In this case, varying T can be explained as the result due to the variation in memory transfer rate or block size, whereas fixed W implies a fixed main memory cycle. It can be seen that load-through performs significantly better than nonload-through, especially for small h and a large difference between W and T. Fig. 11 shows the performance variation for load-through due to various waiting times (W curves) for h = 0.8. To highlight the contributions of Wand Bs to the miss penalty, a low value of h is chosen. Since u is assumed to be 1 in Fig. 11, the difference between T and W is the block size. In this case, load-through performs significantly better than nonload-through for large block sizes. The Bs curves in Fig. 11 illustrate this effect. The Bs curves are constructed by selecting T = W points on each W curve and connecting points with constant Bs over all W curves. It should be pointed out that the assumptions made for an-
.0 .00
I .05
.10 .15
I .20
-A .25
I v I I .30 .35 .40 .45
I
.50
Miss Ratio, (1-h)
Fig. 9. Effectof (I - h)on UforlI= N =256. purposes do not cause a significant deviation of the
alytical analytic model from reality for reasonably high performance systems. Trace-driven simulators have been developed based
II. Several real program experiments to generate address sequences for cache memory requests. In the simulation experiments, the blocked requests are resubmitted, instead of discarded, until they are satisfied. The simulation experiments were performed for the extreme cases of high and low cache access conflict. Our experiments [21] show that the percentage variation of the hit ratios over a range of access conflicts is less than 1 percent. Therefore, the effect of cache memory interference on hit ratio is insignificant. The assumption of independence and randomness of the reference patterns was tested by comparing the simulation measurements to the analytical predictions. The simulation experiments produced results within 5.2 percent of the analytical results when the measured CPU utilization is higher than 0.77. on the model described in Section
traces were used i-n the simulation
V. CONCLUSION In this paper, a simple and flexible shared-cache memory system organization for parallel pipelined processors is proposed. Cache memory management policies suitable for shared-cache systems are also presented. The performance of such a system is analyzed for a variety of parameters. The cache coherence problem in conventional multiprocessors with private caches can be totally eliminated by sharing the caches. Since shared space could be divided among the processes according to their needs, and since only single copies of shared blocks are stored in the cache, shared cache under an effective management policy can yield a hit ratio higher than that for
46
IEEE TRANSACTIONS ON COMPUTERS, VOL.
1.0
Nonload-Through Load Through
-
P X
n
9
a
-
N 1 4
-
c w
.9
-
Nonload-Through
->
4 64
Load Through with Constant W Load Through with Constant B
-7
\B-
.8 .6
.7
C-32, NO. 1, JANUARY 1983
1
---_c
6
\B --
"4 0
c
.6
\
0 0
N La4
N "4
.5
.5
fX
_
B9
\\
\
166
_
-
,4
.4
.4
.3
p
.2
s
4
I - N - 64 .3
hcu-
h-s .80\
-
.1
I .0
.50
Fig.
10.
I
I
.55
.60 .65
I
I
I
.70 .75
I .80
I
.85 .90
.95
1.0
12
16
.1
1
20
24
28
32
36
40
Hit Ratio, h
Block Transfer Time, T-W+ uBs
Performance comparison between load-through and nonloadthrough for a fixed W = 4.
Fig. 11. Performance comparison between load-through and nonloadthrough for various W and B,.
private cache. Shared cache can thus result in higher system performance for, those configurations that keep the access conflict at low levels. Control of a shared cache is expected to be less costly than for multiple private caches. For shared-cache systems, we have shown that for a larger number of cache memory modules, the effect of the cache cycle time is insignificant. It is seen that in order to obtain reasonable performance, the number of lines I should be greater than the number of processors p. For sufficiently large 1, the system performance primarily depends on the miss penalty (1 - h)T". We illustrated that load-through is significantly better than nonload-through for a small hit rttio and a large block transfer time. REFERENCES [1] J. S. Liptay, "Structural aspects of the System/360 Model 85, Part II: The cache, "IBM Syst. J., vol. 7, pp. 15-21, 1968. [2] W. D. Strecker, "Cache memories for PDP-1 1 family computers," in Proc. 3rd Annu. Symp. Comput. Architecture, Jan. 1976, pp. 155158.
8
[3] K. R. Kaplan and R. 0. Winder, "Cache-based computer systems," Computer, pp. 30-36, Mar. 1973. [4] R. M. Meade, "On memory system design," in AFIPS Proc., Fall Joint Comput. Conf., vol. 37, 1970, pp. 33-43. [5] S. S. Sisson and M. J. Flynn, "Addressing patterns and memory handling algorithms," in AFIPS Proc., Fall Joint Comput. Conf., vol. 33, part 2, 1968, pp. 957-967. [6] A. J. Smith, "Comparative study of set associative memory mapping algorithms and their use for cache and main memory," IEEE Trans. Software Eng., vol. SE-4, pp. 121-130, Mar. 1978. [7] G. S. Rao, "Performance analysis of cache memories," J. Ass. Comput. Mach., vol. 25, pp. 378-395, July 1978. [8] J. H. Patel, "Analysis of multiprocessors with private cache memories," IEEE Trans. Comput., vol. C-31, pp. 296-304, Apr. 1982.
[9] L. M. Censier and P. Feautrier, "A new solution to coherence problems [10] [11] [12] [13]
[141 [15] [16] [17] [18] [19] [20]
[21] [22] [23]
in multicache systems," IEEE Trans. Comput., vol. C-27, pp. 11 121118, Dec. 1978. C. K. Tang, "Cache system design in the tightly coupled multiprocessor system," in AFIPS Proc., Natl. Comput. Conf., vol. 45, 1976, pp. 749-753. C. J. Conti, "Concepts for buffer storage," IEEE Comput. Group News, vol. 2, pp. 9-13, Mar. 1969. L. A. Belady, "Study of replacement algorithms for virtual storage computers," IBM Syst. J., vol. 5, pp. 78-101, 1966. C. G. Bell and W. A. Wulf, "C.mmp-A multiminiprocessor,"7 in AFIPS Proc., Fall Joint Comput. Conf., vol. 41, part II, 1972, pp. 765-777. T. C. Chen, "Parallelism, pipelining, and computer efficiency," Comput. Design, pp. 365-372, Jan. 1971. L. E. Shar and E. S. Davidson, "A multiminiprocessor system implemented through pipelining," Computer, vol. 7, pp. 42-5 1, Feb. 1974. W. J. Kaminsky and E. S. Davidson, "Developing a multiple-instruction-stream single-chip processor," Computer, vol. 12, pp. 66-76, Dec. 1979. J. S. Emer and E. S. Davidson, "Control store organization for multiple stream pipelined processor," in Proc. Int. Conf. Parallel Processing, 1978, pp. 43-48. W. D. Strecker, "An analysis of the instruction execution rate in certain computer structures," Ph.D. dissertation, Carnegie-Mellon Univ., Pittsburgh, PA, 1970. D. L. Weller and E. S. Davidson, "Optimal searching algorithms for parallel-pipelined computers," Springer-Verlag Lecture Notes on Comput. Sci., no. 24, Aug. 1975, pp. 90-98. F. A. Briggs and E. S. Davidson, "Organization of semiconductor memories for parallel-pipelined processors," IEEE Trans. Comput., vol. C-26, pp. 162-169, Feb. 1977. C. C. Yeh, "Shared cache organization for multiple-stream computer systems," Univ. Illinois, Urbana, Coordinated Science Lab., Rep. R-904, Jan. 1981. A. J. Smith, "Characterizing the, storage process and its effect on the update of main memory by write through," Commun. Ass. Comput. Mach., vol. 26, pp. 6-27, Jan. 1979. D. E. Knuth and G. S. Rao, "Activity in interleaved memory," IEEE Trans. Comput., vol. C-24, pp. 943-944, Sept. 1975.
YEH et al.: MULTIPLE-STREAM COMPUTER SYSTEMS
[24] G. J. Burnett and E. G. Coffman, "A study of interleaved memory," in AFIPS Proc., Spring Joint Comput. Conf., vol. 36, 1970, pp. 467-
[25] [26] [27] [28] [29]
[30] [31]
474.
, "Analysis of interleaved memory systems using blockage buffers," Commun. Ass. Comput. Mach., vol. 18, pp. 91-95, Feb. 1975. C. Skinner and J. Asher, "Effect of storage contention on performance," IBM Syst. J., vol. 8, no. 4, pp. 319-333, 1969. C. V. Ravi, "On the bandwidth and interference in multiprocessors," IEEE Trans. Comput., vol. C-21, pp. 899-901, Aug. 1972. D. P. Bhandarkar, "Analysis of memory interference in multiprocessors," IEEE Trans. Comput., vol. C-24, pp. 897-908, Sept. 1975. K. V. Sastry and R. Y. Kain, "On the performance of certain multiprocessor computer organizations," IEEE Trans. Comput., vol. C-24, pp. 1066-1074, Nov. 1975. F. Baskett and A. Smith, "Interference in multiprocessor computer systems with interleaved memory," Commun. Ass. Comput. Mach., vol. 19, pp. 327-334, June 1976. P. J. Denning, "The working set model for program behavior," Commun. Ass. Comput. Mach., vol. 11, pp. 323-333, May 1968.
O Phil C. C. Yeh (S'79-M'81) was born in Taiwan, _ 4yI* Republic of China, on June 12, 1950. He received the B.E. degree in electronic engineering from Chung Yuan Christian College for Science and Engineering, Taiwan, Republic of China, in 1972, the M.S. degree in electrical engineering from Northwestern University, Evanston, IL, in 1975, and the M.S. degree in computer science and the Ph.D. degree in electrical engineering from the University of Illinois, Urbana-Champaign, in 1977 and 1982, respectively. Currently, he is a CPU Architect at the IBM Corporation, Poughkeepsie, NY. His research interests include computer architecture, parallel processing, and performance evaluation. Dr. Yeh is a member of the Association for Computing Machinery and the IEEE Computer Society.
Janak H. Patel (S'73-M'76) was born in Bhavnagar, India. He received the B.Sc. degree in physics from Gujarat University, India, the B.Tech. degree from the Indian Institute of Technology, Madras, and the M.S. and Ph.D. degrees from Stanford University, Stanford, CA, all in electrical en-
gineering.
From 1976 to 1979 he was an Assistant Professor of Electrical Engineering at Purdue Universiw t ty, West Lafayette, IN. Since 1980 he has been with the University of Illinois, Urbana-Champaign, where he is currently an Assistant Professor of Electrical Engineering and a Research Assistant Professor with the Coordinated Science Laboratory. He is presently engaged in research and teaching in the areas of computer architecture, VLSI, and fault-tolerant systems. Dr. Patel is a member of the Association for Computing Machinery.
Edward S. Davidson (S'67-M'68-SM'78) was born in Boston, MA, on December 27, 1939. He received the B.A. degree in mathematics from Harvard University, Cambridge, MA, in 1961, the M.S. degree in communication science from the University of Michigan, Ann Arbor, in 1962, and the Ph.D. degree in electrical engineering from the University of Illinois, Urbana-Champaign, in 1968. He was with Honeywell from 1962 to 1965. From 1968 to 1973 he was an Assistant Professor of Electrical Engineering at Stanford University, Stanford, CA. He returned to the University of Illinois as an Assistant Professor in 1973, advanced to Associate Professor in 1975, and Professor of Electrical Engineering and the Coordinated Science Laboratory in 1980. He has performed research in computer architecture, parallel and pipeline processing, VLSI systems, and fault tolerance. He has served as a consultant to Hewlett-Packard, Honeywell, Fort Monmouth, Sperry, the Defense Nuclear Agency, and others. Dr. Davidson has been the IEEE Computer Society Western Area Chairman and is presently chairman of ACM SIGARCH.