Cache Write Generate For High-Performance ... - Semantic Scholar

8 downloads 0 Views 200KB Size Report
Jul 19, 1994 - Seattle, Washington 98195 ... Arun K. Somani: arun@ee.washington.edu ... We show that write caching can greatly alter the hit/miss ratios,.
Cache Write Generate For High-Performance Processing Craig M. Wittenbrink , Arun K. Somani, and Chung-Ho Chen !

Department of Electrical Engineering and Department of Computer Science and Engineering University of Washington, FT-10 Seattle, Washington 98195 

Telephone: Arun K. Somani: (206) 685-1602 e-mail: Craig M. Wittenbrink: [email protected] Arun K. Somani: [email protected] Chung-Ho Chen: [email protected] !

Current Address: The Baskin Center for Computer Engineering & Information Sciences University of California 225 Applied Sciences Building Santa Cruz, CA 95064 July 19, 1994

1

Abstract: Much attention has been paid to read caching and several schemes have been

developed to make read caching very ecient. As a result, the performance of write caching has become a concern. This paper investigates write caching policies and how they a ect the performance of memory systems. We show that write caching can greatly alter the hit/miss ratios, but only more subtly a ects the performance. Many factors such as write bu ers, contention, split transactions, and multilevel caches dilute the correlation of cache miss ratios to performance. A more accurate performance formula is presented taking these factors into account. This allows an incremental improvement in multiprocessor cache architectures. Cache write generate is a scheme where a cache line is validated on a write miss without fetching from memory. It avoids unnecessary reads from main memory, reduces the CPU stalling time, lowers the cache miss latency, reduces bus contention, and thus increases the available bandwidth of the memory. We compare the performance of cache write generate with the performance of write around and write allocate in single processor and shared bus multiprocessors. We use register level simulations validated by our functioning hardware prototype Proteus system. Various memory speeds and di ering numbers of processors are evaluated using detailed simulation models for high performance measurement accuracy. Keywords: Cache memory design, write caching, cache generate, shared memory parallel architecture, memory system performance.

1 Introduction Caches are small fast memories that hold frequently used data. Caches play an important role in achieving higher performance in modern uni- and multi-processors. When a high percentage of reads and writes are made to the cache, the e ective bandwidth of the memory is that of the cache. Many prior studies have focused on read caching. As a result, read caching has become highly ecient with multilevel cache systems, prefetching, and multi-way set associative memories. Surveys of the variety of cache control mechanisms and their e ect on hit ratios, or the % of accesses served by the cache are presented by Smith [15], Hennessy and Patterson [7], and Stone [17]. Caches di er in their size, control, and organization. E ects of organization, such as the associativity, can be easily investigated by trace analysis [7] [15] [17]. When memory read performance is decoupled from write performance, traces are adequate [9]. In multiprocessor design write performance also becomes signi cant. This is because when caches are large enough reads are very ecient, and writes constitute a larger percentage of the bus trac. Writes have more variability and are di erentiated in their control on hits and misses. On cache-write-hits data can be: (1) updated only in the cache, write back, or (2) both updated in the cache and updated in the main memory, write through. When a write misses in the cache it can (1) be sent past the cache to the memory, write around, (2) force a read of the data and update of the cache, write allocate, or (3) directly update the cache, cache generate. The cache tag is updated without reading from memory and the line is marked dirty and modi ed with the written data. Write around memory operation is write through, but with allocate and write generate, memory operations can be write through or write back. In shared memory processing write back for hits is the preferred alternative because of the limited bandwidth of a shared bus [11]. Because the amount of available bandwidth and the amount of trac generated determine the performance of shared memory systems, hit ratios are not adequate to study the performance di erences of the write miss controls. Additionally, write bu ers do not adequately protect the processor from the e ects of writes when shared buses saturate. To understand the e ects of writes on performance consider the following model for program run time. 2

The number of each type of instructions in a program is l, s, b, and o for load, store, branch, and others. Thus, we have N = l + s + b + o instructions. The average number of processor cycles to execute each instruction is cl , cs , cb , and co , respectively. The time for a program to complete is T

= lcl + scs + bcb + oco :

(1)

By changing the system design cl , cs , cb , and co will vary. The cache design including cache size, associativity, line size, and control changes the load hit ratio lh and store hit ratio sh . Instruction execution may be a ected by other instructions because of the interaction between on chip instruction and data caches, on chip execution unit, and on chip write bu ers. The cycle times are a function of all other parameters. We simplify by making several assumptions. If sucient write bu ers are used on chip, then write hits and write misses to carry the same penalty. Furthermore, assuming that the write bu er is not over lled prevents the stall of following writes. But write bu ers do a ect reads and instruction cache misses, because writes on the bus interfere with cache re lls. If write allocate is used then write misses causing re lls a ect the cache re ll time for load and store misses. Thus, the values of cl and cs are functions of both the number of load and store misses and the size of the write bu er. If we assume that in the ideal case cl = cs = cb = co = 1 whereas in the actual case cl = 1 + xl and cs = 1 + xs then the ideal and the actual case executions times are given by the following. Tideal = l + s + b + o (2) T = T +lx +sx actual

ideal

l

s

The second and the third terms in the expression for Tactual correspond to the performance degradation due to the load and store misses. Suppose the processor uses write allocate and has no write bu ers. If the number of stores is about 10% and the average number of extra cycles xs = 1, then the loss in performance due to writes is about 10%. Instead of write allocate, if one uses write bu ers then xs may be close to zero. But the posted writes in write bu ers will cause bus contention, increasing xl causing an indirect performance loss. Unfortunately, xl and xs are nonlinear and not easy to quantify. Bus trac on the other hand has its own e ect on the cache replacement. Most severe is when the bus saturates because after that the performance is solely dependent on the bus speed [2]. In this paper we investigate the reduced trac using generate, the self coherency problem and various solutions required to use generate, and the expected performance gains over both write around and write allocate. Hardware description language (HDL) simulations and actual system performance measurements are used to show the speedup of generate over write around and allocate. Shared bus systems exhibit a greater performance improvement as do those processors that use generate in on chip caches instead of in second level caches. Generate reduces the number of bus cycles required by many programs providing more available bandwidth that we feel allows greater sustained performance with a larger number of processors. Existing multiprocessor cache coherence protocols are easily extended to support this new mode. In the next section, we provide details of how to use CWG. Section 3 describes the simulation model, Proteus hardware using cache write generate, and workloads used to evaluate our scheme. In Section 4 we present the simulation and performance results. Section 5 covers the self coherency problem and context switching behavior of CWG. The paper concludes in Section 6.

3

2 Cache Write Generate Cache write generate (CWG) is de ned as cache write validation on a write miss. The cache line is updated with the write and the cache line tag is modi ed to the address of the write. If an existing dirty line in the cache must be replaced, the old line is written back. CWG does not fetch data, but coherency operations (broadcast, or directory update) still take place at the memories above the cache's level. CWG can a ect the entire state of a line, or change just the state of a single word written into the line by using validation bits for each word within the line. CWG achieves higher performance because it reduces the usage of write bu ers, it increases the number of hits on writes, and the write misses do not cause unnecessary reads. Therefore, CWG reduces the number of main memory bus cycles. Write bu ers have a very fast cycle time, but posted writes may con ict with subsequent reads and writes, slowing them down. Posted writes increase the main memory bus trac and the number of collisions also increases. Since writes are cached, subsequent writes may hit in the cache increasing the write hit ratio. Write allocate bene ts from these two advantages as well. However, the third advantage, that write misses do not cause unnecessary reads and reduce the number of main memory bus cycles, is exclusive to CWG. By reducing the number of cycles on the main memory bus, and the time to service a write hit, the available memory bandwidth can be increased without altering the program. For programs where there is a signi cant amount of loads and stores, CWG can improve performance signi cantly. To illustrate how write caching a ects memory behavior consider the following example.

Example 1. User's View of Generate Consider the following program of three steps, two

matrix multiplication followed by a di erence. Here P , Q, R, S , T , U , and V are the matrices. The system is a 2 level cache memory where the secondary cache is large enough to hold all necessary data (once cached). 1. R = P  Q 2. U = S  T 3. V = R ? U Consider three cache write behaviors: write around, allocate, and generate. The rst policy, write around, has poor performance, because in steps (1) and (2) elements of R and U are written in the main memory as they are computed (since they are cache misses). Afterwards, in step (3), R and U are cache read misses. All writes go to main memory and reuse of written data causes reads from main memory. The second policy, write allocate, has better performance, but matrices R, U , and V are read from memory before they are written. This is a waste of bus operations and perhaps processor time. A key advantage of allocate over write around is many following writes are hits in steps (2) and (3). In step (3), the processor uses the R and U matrices in the cache, an additional improvement over write around. The third policy, CWG, forces R, U , and V to be hits without memory reads. All writes are hits in steps (1) and (2); overwritten data are not fetched; and no write around cycles are sent to main memory. In step (3), as in allocate, reads of R and U are cache hits. In this example cache write generate is an improvement over write around and write allocate.

2.1 Related Work

Improving cache write eciency has been explored earlier in di erent forms in several systems. Many systems provide software control of cache write updating such as the IBM 801 [14], the LSI 4

Logic Sparc chip cache controller [10], and the Wisconsin Multicube [4]. In these processors and protocols, there are special instructions to selectively update cache lines, avoiding reads. This control is only available in supervisor mode, and application programs cannot directly use such features. Additional instructions are executed to force hits. The Motorola 68030 uses write-through with word validate on aligned long word misses, but uses write around on all other misses. This requires one bit per long word to store valid/invalid information in cache, and is not suitable for shared bus processing. IBM also investigated cache write eciency in their research parallel processor (RP3) [1]. They used write-through, reasoning that the interconnection network could handle the additional trac and not the periodic congestion of write back [1]. Write updates without fetches are also provided in the RP3 processor memory element by word valid bits, a scheme similar to the Motorola 68030 aligned long word validation. Trace analysis of allocate and valid bit per word schemes was done by Jouppi [9] for a variety of cache sizes in a uniprocessor. Valid bits per word in MC68030, IBM801, and in Jouppi [9] are expensive in hardware. Moreover, following reads in the same line and cache

ush operations are complicated. Smith [15] mentions alternatives for handling writes but relies primarily on write bu ers, a solution we have found inadequate by itself. Write bu ers [15], write allocate [12], and write through [1], [13] do not address the removal of unnecessary trac. We believe that for shared memory parallel processing write through, write around, write bu ers, and software cache control are not adequate. One of the authors, Wittenbrink in [18], investigated the e ect of directly updating the line when it was known in advance that the line is to be written by using trace analysis. In this paper we further investigate the cache write technique Cache Write Generate.

2.2 Identi cation of Generate Variables

CWG, without word valid bits, is done when the contents of an entire line are going to be replaced. In fact, all memory when it is rst used is invalid. Cold start of virtual memory, processes, I/O space, stack space, etc. use memory that is replaceable. Further, a running process may recycle memory from the operating system, which also allows memory to be reinitialized. In a single processor system, a processor can CWG a line whenever it has this information. However, in a multiprocessor system, for cache coherency purposes it is important to inform the main memory system and other processors that a line is being used in write exclusive mode by the processor writing that line. A CWG coherency protocol uses the same operations as write allocate protocols except that CWG avoids data reads. Writes that bene t from CWG are writes of data computed by the processor or explicitly initialized by the processor. Examples include dynamically allocated memories, stack segments, static memory segments, and temporary bu ers. To implement generate, we need to identify the possible generate variables. Writes that bene t from CWG are writes of data computed by the processor or explicitly initialized by the processor. Examples include dynamically allocated memories, stack segments, static memory segments, and temporary bu ers. These memory areas are easy to identify through explicit declaration or by the compiler. In Example 1, a user speci es R, U , and V as generate areas by explicit declaration. Alternatively, the compiler identi es R, U , and V as outputs of the program and therefore these are candidates for generate variables. Once identi ed, the variables are marked as generate. When the process has nished using them, they become generated. A generate variable, once written, can be spawned by explicit ushing to the main memory if it is to be used or can be killed by explicit invalidation if ushing is unnecessary. In Example 1 above, if V is the only result of interest, then memory areas belonging to R and U can be killed and the memory areas belonging to V should be 5

spawned. CWG provides the highest performance solution when writes cause performance penalties. Once a generate variable has been written, it should not be re-generated. We de ne the tracking of such cache line state as the self coherency problem, discussed in Section 5. In Section 2.3 we discuss additional memory control and multiprocessor coherency issues.

2.3 Generate Memory Control and Multiprocessor Coherency

The marking control for generate variables can be exercised at several levels. A cache line is the minimum granularity of control we suggest to keep the control simple with low overhead. Finer control at the byte or word level adds to both the cache state storage and to the ushing complexity. Page level control is more amenable than line level control, because most microprocessors include a memory management unit (MMU) on chip. Page table entries (PTE) and the translation look aside bu er (TLB) carry the relevant page control information. An additional bit in the PTE and the TLB can specify if the data in a page are generate areas. To exercise the control at the page level, all generate variables are allocated in memory consecutively. The operating system or the compiler identi es all pages which belong to generate areas and sets the generate bit in the corresponding PTEs as \1", generate. This bit is read by the TLB when the page is referenced for the rst time. The TLB generate bit is made available to the cache controller every time a location in that page is referenced. Thus if the cache misses data and knows that the location can be generated, it uses CWG and updates the cache tag without reading the line from the main memory. Multiprocessor coherency is not a ected by the addition of a generate mode. This is because the processors follow exactly the same algorithm as write allocate except that the actual transfer from the main memory does not take place. Thus, bus snooping and directory based schemes are both adaptable to generate, by insuring that CWG write misses a ecting coherency do the proper invalidates or updates, and that generate areas are exclusive while being modi ed. Adapting a directory based coherency system with cache generate is simpler then page control and because memory is controlled at the cache line level. The directory is initialized for each line so that it is generate or not generate.

3 Simulation Models and Hardware System We compare the relative eciency of the cache write generate policy to existing write caching controls, using single and multiprocessor shared memory systems. We developed a detailed model of the Intel i860 RISC processor [8], a custom external cache, and a main memory using register transfer level simulation language, ISP-prime, used with the N.2 simulator [19]. The simulation models were developed for architectural investigation while designing the Proteus system [16] [5]. The Proteus cluster hardware (currently 32 i860's) is also available for performance measurement. The Proteus system was designed for automated classi cation of objects and suitable for coarse grain parallelism where the shared data objects are large in size. The system is composed of clusters, each of which has a 64 bit shared memory bus connecting four i860's [8], a cluster control processor i960, dual port 1 Mbyte fast RAM, 32 Mbyte DRAM, and one duplex 250 Mbit/sec serial interface for communication with associated direct memory access (DMA) controller. Clusters are connected in groups, nine clusters per group. Each group has a controller and a crossbar switch to interconnect clusters and groups. Each group is a node in an enhanced hypercube. More details on the overall system architecture are available in [16] and [5]. 6

The simulation models are shared memory multiprocessors, where we varied memory timings, cache sizes, cache control, number of processors, and external or no external caches. The Intel i860 was implemented in HDL and the on chip caches were modi ed for the studies without external caches. Benchmark programs were cross compiled to assembly code on an i860 software development system, and the assembly code programs were compiled into object les for the N.2 simulator.

3.1 Cache Modes, Sizes, and Memory Timings

Writes have the smallest impact in write back, and therefore we consider only write back with the emphasis that multiprocessors use write back to reduce contention. For studies showing the write through protocol's adverse a ects on multiprocessor performance see [11] and [18]. For performance comparisons, we consider the following three scenarios for caching.

Case N: Normal. Mode N is where the application is run with normal write back caching, with write around on miss. Read misses are cached.

Case A: Allocate. Write allocate is write caching on write misses in addition to read caching.

A line is rst fetched from main memory and then updated in the cache.

Case G: Generate. Write generate is write caching without fetches on write misses for generate areas. For non-generate areas, we use the normal mode (write allocate can also be used). The cache line, and tag are validated on write misses without going to main memory by Address Decoding a self coherency method discussed in Section 5. To investigate the added complexities of performance with thrashing, we simulated two cache sizes. We have two choices for the external cache, a cache large enough where no replacements occur (256 k bytes is e ectively in nite for our scaled down applications) and a cache where frequent replacements occur (64 k bytes which is able to hold one 100  100 32 bit pixel image). We also simulated systems with no external caches where the caching behavior of the i860 on-chip cache was modi ed to include cache generate. For the rst level cache study, we use the same size cache as the i860, a 8k bytes data cache. To investigate the e ect of write posting and replacement policies, we have also simulated two fundamentally di erent external caches, cache x and cache y. Cache x uses no write posting, no wrap around lls, and no posted replacements, or other enhancements. The simplest control is used to see how these devices may have in uenced the relative performance of CWG. The cache y (used in the constructed Proteus system), uses posted writes, wrap around lls, and posted replacements. In all systems, on chip cache, secondary caches, and main memory operate at progressively slower speeds. Let tp be the processor clock time. The secondary cache time is ks tp and the main memory time is km tp . ks ranges from 2 to 4 and km ranges from 4 to 10. Moreover, for the rst cycle ks can be 4 to 6 and km can be 6 to 12. We selected representative timing for a range of memory hierarchy designs. In our simulations for the secondary cache we have ks = 2. For the main memory we have three models, fast km = 2(4), medium km = 4(5), and slow km = 20(21) where the numbers in parenthesis are those for the rst bus cycle in a burst mode. For the i860, tp = 25 nsec. The one-level cache system also uses these main memory timings.

3.2 Workload

We benchmarked our cache variations with image processing applications using mathematical morphology [6]. The morphology algorithm is a bright feature detection algorithm [3] shown in Figure 7

1

a = I SE ( )

2

flush

3

b = a ⊕ SE ( )

4

flush

5

a = I−b

6

flush

I

R = a>t

8

flush

erosion

a b

dilation subtraction

a >t

7

SE ( )

threshold

>t

R

Figure 1: Bright Feature Detection Morphological Algorithm 1. I is the input image operated on by the structuring element S E (). Memory is used most eciently by using temporary images a and b, and processing in the eight step program shown along the left side of 1. Flushes are only necessary for the parallel version SPSD, below. This algorithm was used because it characterizes the applications of interest to the Proteus project. Having a single benchmark also highlights the shortcomings of a detailed HDL simulation, in that, while precise for timing measurements, simulation times are so great that using a wide number of benchmark routines is intractible. To execute the task graph of Figure 1, it is partitioned by three types of parallelism.

SPMD. Single programming multiple data: the data for each task is strictly partitioned. Each processor works on a separate data. No ushing, steps (2), (6), and (8), required.

SP. specialist pipelining: processors may specialize in functions. Data is shared using pipelined

processing.

SPSD. single programming single data: each function is computed by all of the processors.

This uses ner grained sharing. data are split for processing, each part given to a separate processor. For four processors each processor works on 1=4 of the job. In our coding, we maximize exclusive output areas to minimize coherency trac. We use two variants of the program, an optimized single processor version, and a multiprocessor version. The nd program uses reads and writes in such a manner to make parallelization dicult, whereas the os program uses partitioning and padding of images useful in avoiding crossing page boundaries. The os program represent a programming method that uses exclusive write areas. The simulator program processes random images. In the Proteus system, we use 1M cache and 256  256 images. So for 128128 images, 256K cache was the chosen scaling. The results presented here are those using the SPMD and SPSD using both program variants. Both programs consists of multiple modules as shown in Figure 1.

8

4 Simulation and Results We have grouped the results into three di erent cases: (1) simulation results when the processor's on-chip cache model remains the same but the secondary cache uses di erent modes; (2) simulation results when only one level (on-chip) cache is used; and (3) the Proteus system measurements (notsimulation) where the secondary cache is programmed to use generate, allocate, or normal write caching.

4.1 Secondary Cache Results

We show how hit ratios do not quantify the e ect of writes on the performance (run time). With the mix of instructions given in Table 1, the on-chip cache behavior of the i860 is the same regardless of secondary cache modes. For the external cache, allocate and generate give exactly the same hit ratios for reads and writes. Using hit ratios, as shown in Tables 2 and 3, we would conclude that there is no di erence but allocate and generate are di erentiated by read and write miss penalties. This a ects the program performance which can be seen through the number of bus cycles they use, the number of load stalls, and the run time. Table 1: Instruction Counts, os Program Type Number Percentage Loads/Stores 173; 545 26% Delayed Branches 56; 797 8% Other 444; 308 66% Total 674; 650 100% Table 2: Hit Ratios For On Chip Caches, os Program Mode Read Write Instruction Combined N , A , G 0.5706 0.0 0.9999 0.9121 Table 3: Hit Ratios For External Caches, os Program Size Mode Read Write Combined 64k N 0.7558 0.0364 0.2918 A , G 0.7552 0.8750 0.8325 256k N 0.8285 0.4243 0.5678 A , G 0.9420 0.9375 0.9391

4.1.1 Bus Cycles The number of bus cycles demonstrates the utility of generate. The external cache uses generate to reduce the number of bus cycles to main memory. To illustrate we present all of the cycles in 9

the system for this program. The on-chip cache loads, stores, and instruction cache misses create read, write, line ll, and line ush requests on the bus outside of the processor. These requests are serviced by the external cache. Since the on-chip (i860) behavior for all modes in the secondary cache is the same, the number of external requests are the same for all three modes and these are summarized in Table 4 for the os program. For this program, there are no replacements and the number of memory references by the on chip execution unit is 848; 195. About 88% of the memory requests are satis ed on-chip, and therefore, the number of external cycles is only 101; 621 for data cache line lls, instruction cache line lls and writes. All writes are misses in the on-chip cache, so they are posted directly to the on-chip write bu er. Table 4: External Cache Requests, os Program Cache Line Fills External Requests Data Cache 8,987 35,948 Inst. Cache 35 137 Write Bu er 65,536 Total 101,621 Table 5 is the most concise demonstration of the di erence between normal, allocate, and generate cache modes. In this table, we show the number of bus cycles due to single write, burst writes, and burst reads (to ll cache lines). The nal column in Table 5 is the total number of external reads and writes to main memory by one processor. In a multiprocessor system with n processors, there will be n times as many bus cycles. This will result in varying amounts of congestion. Table 5: Shared Memory Bus Cycles, os Program Cache Mode Writes Burst Writes Burst Reads Total 64k N 63; 152 298 8,812 99,592 A 0 7,936 17,024 99,840 G 0 7,936 8,832 67,072 256k N 37,728 0 6,188 62,480 A 0 0 6,188 24,752 G 0 0 2,092 8,368 If shared resources are being used, or if the processing is memory bound, generate reduces the number of cycles, and therefore, congestion. The 64k cache demonstrates processing with some replacements, and the 256k cache demonstrates processing without replacement. Allocate creates more bus cycles than the normal mode in 64k caches because write caching causes thrashing. Generate does not fetch main memory data to cache writes. Therefore, even though thrashing occurs, generate has less bus cycles than allocate or normal.

4.1.2 Load Stalls In varying our external cache model the number of stalled loads varies. Di erent numbers of loads are stalled depending on memory system performance. The speed of processor cache lls changes 10

because of stalls. Instruction fetches may also stall processing. However, that is not a signi cant e ect in most applications including ours. The 64k caches in all modes have 19,852 stalled loads, or 18.38% of all loads. The 256k caches are large enough for no replacements and are more ecient than 64k caches. The number of stalled loads for 256k caches is N (= 18; 424), A and G (= 17; 777). Because the writes are more ecient fewer loads are stalled. Increasing the cache size reduces stalled loads by 7.19%. Increasing the cache size and using A or G reduces stalled loads by 11.67 %, an improvement of 3.64 % over 64k cache N mode.

4.1.3 Performance We believe performance to be the most rigorous comparison (Hennessy and Patterson [7]). Does this dramatic reduction in bus cycles or improvement in hit ratio translate directly to a corresponding performance improvement? For our two application programs, the speedup of mode G over mode A (A/G in Figures) and mode G over mode N (N/G in Figures) in external caches are shown in Figures 4, 5, 6, and 7. We use speedup as de ned by speedup = (execution time

)=(execution time

original

):

enhanced

(3)

The speedups for 256K cache os program, 64K cache os program, and 64K cache nd program are also given in Tables 6, 7, and 8. Table 6: Speedup For the os Program 256k secondary cache No. Proc. 1p 4p 8p Memory f m s f m s f Cache x y x y x y x y x y x y x y N/G 1.08 1.06 1.12 1.11 1.67 1.64 1.40 1.40 1.66 1.67 4.13 4.13 2.39 2.39 A/G 1.00 1.00 1.01 1.01 1.18 1.18 1.01 1.01 1.04 1.04 1.79 1.79 1.08 1.08

No. Proc. Memory Cache N/G A/G

Table 7: Speedup For the os Program 64k caches 1p 4p f m s f m s x y x y x y x y x y x y 0.97 0.99 0.95 0.99 1.08 1.19 1.41 1.48 1.55 1.46 1.56 1.52 1.00 1.00 1.02 1.01 1.21 1.30 1.05 1.14 1.30 1.27 1.48 1.47

The results in Figure 4 are speedups for 256K cache fast memory with respect to the number of processors. The speedups show that both allocate and generate perform the same using up to four processors. Beyond that, generate outperforms allocate due to fewer bus cycles and hence less contention on the bus. This shows that generate could be an e ective technique to increase the number of processors on a shared bus. Generate performs better in comparison to the normal caching mode even when only four processors are sharing the bus. 11

No. Proc. Memory Cache N/G A/G

Table 8: Speedup For the nd Program 64k caches 1p 4p f m s f m s x y x y x y x y x y x y 1.00 1.02 0.98 1.02 1.15 1.30 1.45 1.48 1.53 1.52 1.60 1.56 1.00 1.00 1.01 0.99 1.20 1.21 1.08 1.14 1.24 1.31 1.52 1.52

Figures 5, 6, and 7 (and Tables 6, 7, and 8) show the speedups with respect to the memory speed (f, m, or s) for di erent cache sizes (256K or 64K), di erent cache implementations (x or y) and di erent programs (the os or the nd). These gures show that generate improves performance by a greater amount when the memory is slower. This is as expected because we are incrementally improving a small percentage of the program{the writes. It is also interesting to note that for one processor generate (and allocate) both perform worse if the cache size is small and a lot of data thrashing occurs for fast memory. Fortunately cache sizes are increasing to prevent such application footprint sizes. Moreover, even at the medium speed, generate starts improving the performance. The timings show that generate brings the execution time closer to the fastest (fast memory single processor system) execution time. If a system slows down a program by 3 times in mode N, or mode A, then mode G only slows it down 2 times. Thrashing makes the relative improvement over mode N greater, but the improvement over mode A is the same. Generate yields a signi cant speedup over mode A for all systems with slow memories in the single processor case. For the more sophisticated y cache model we achieve a greater speedup, which shows that sophisticated lls and ush postings are improved with generate.

4.2 Single-Level Cache Memory

We also performed simulations using only one level on-chip cache by modifying the i860 cache to support generate. We did not change the size of the on-chip cache (it remained 8K for data cache and 4K for instruction cache), simulated two memory speeds, fast (f) and medium (m), and varied the number of processors sharing a single bus from 1 to 8. The speedups for the nd program are shown in Figure 8 and Table 9 and the run times are shown in Figure 9. Generate has lower number of load stalls and external bus cycles than allocate. It is interesting to note that the normal mode performs the best with one processor but is taken over by generate with four processors and then by generate and allocate with eight processors. Generate achieves 17% speedup for the fast memory and 32% speedup for the medium speed memory over normal caching when four processors are sharing the bus. The corresponding speedups over allocate are 22% and 27%. With eight processors sharing the bus, the speedups achieved by generate are even better. These results show how generate allows increasing the number of processor on a shared bus or improving the overall performance of a given shared memory system.

4.3 Proteus Performance Results

Lastly, we ran the same nd program for 256  256 32 bit integer images and an optimized matrix multiplication assembly program to multiply two 256  256 oating point matrices on Proteus. We used normal, generate, and allocate caching modes and for the secondary cache with one and four processors. Measured speedups of generate over normal were 18% For the nd program. The 12

Table 9: Speedup For the nd Program single level cache No. Proc. 1p 4p 8p Memory f m f m f m N/G 0.94 0.90 1.17 1.32 1.47 1.41 A/G 1.06 1.10 1.23 1.28 1.31 1.32 speedup was only 0.8% for the matrix multiplication. A 256  256 matrix multiplication writes only one result after a dot product of two 256 element vectors. The write frequency is less than 0.4% for the matrix multiplication and not much speedup is as expected for one processor with the fast memory system used to implement Proteus. The main point from the Proteus performance timings is that we implemented CWG in hardware and validated our simulation models. The os program was parallelized for two, and four processors, and contained synchronization code for barriers between the respective operations. Recall, that this program is optimized to minimize the number of writes. So we do not expect a big gain. The speedup of generate over allocate is 3% for timings taken from a four processor program running on Proteus. As predicted the performance impact is small, the simulation showed speedups of 1% and 4%, on a four processor system.

5 CWG Self Coherency Problem As may be obvious to those astute in cache design, generate should be enabled in each cache line only if the data have not been previously written, otherwise, data are lost. Implementation of CWG must monitor the writes to avoid creating multiple validated lines. Multiple validated lines for the same address creates inconsistent and/or incorrect data. Although not speci c to matrix multiplication, we will use our running example of matrix multiplication and discuss the detail of processing of Step (1) to demonstrate the self coherency problem.

Example 2: Cache's View of Generate The left side of Figure 2 shows the rows of matrices

, Q, and R in main memory. In the middle of Figure 2 the memory managment unit (MMU) shows that the matrix R is set to generate and P and Q are no-generate. We follow the operations on cache line 0. The right side of Figure 2 shows that cache line 0 starts out in the invalid state. Now, Figure 3 shows operations proceeding from top to bottom, with the operation, address, data value, cache tag, memory contents, and action summary. The rst operation (write address 32, top of Figure 3) is the write of result R . The gure shows that it was a CWG and the tag was updated. The second operation (read address 0) is the read of row 0 of matrix P again so it can calculate the next output. The miss causes the dirty cache line, 8, to be ushed to main memory, address 32. The third operation (write address 33) is the write of the next result using CWG into the cache. You can see R within the line of matrix P values. At this point there are two validated copies of the same line, one in main memory and one in the cache. The fourth operation (read address 0, bottom of 3) is the read, of row 0 of matrix P again causing a ush which overwrites the main memory line at address 32, and the calculated value R is lost. For this example, Matrices P , Q, and R are each 44. the matrix multiplications were calculated as R = P Q + P Q + P Q + P Q , R = P Q + P Q + P Q + P Q , which uses row 0 of matrix P and columns 0 and 1 of matrix Q. The cache has 32 direct mapped lines of P

00

01

00

00

00

00

01

10

02

20

03

30

01

13

00

01

01

11

02

21

03

31

P

Q

R

MM Line # Page # MMU Generate Loc Tag CM P P P P P 0 00 01 02 03 0 0 Inv * * * * 0 0 Q 4 * * * * 0 4 1 1 8 * * * * R 1 8 2 2 0 12 3 3 16 Q00Q01Q02Q03 16 4 20 5 24 6 32 R00 R01 R02 R03 28 7

Figure 2: Example of Multiple Validation, Start State Operation write

Tag

Address data 32

R00

MM 32 CM 0 8

read

0

*

MM 32 CM 0 0

2 validated copies of the same line R01 MM 32 write 33 CM 0 8

Contents in memory Actions * * * * R00 * * *

Line 1

R00 * * * P00 P01 P02 P03 Line 1 R00 * * * P00 R01 P02 P03

R00 Overwritten in Main Memory (MM) P00 R01 P02 P03 MM 32 read 0 * P00 P01 P02 P03 CM 0 0

Miss CWG

Miss replaced

Miss Line 1 CWG

Line 1

Miss replaced

Figure 3: Example of Multiple Validation, R Overwritten in Main Memory 00

4 entries, and the page is generate if the MMU has a \1".

5.1 CWG Self Coherency Solutions

Additional control is required to ensure that the cache does not generate multiple copies of the same cache line. This control is provided by associating additional information for each line in memory, cache, or in TLB. We brie y discuss ve schemes for maintaining self consistency: bit-per cache line, two-bits-per cache line, TLB control, address decoding, and directory-based control. Our Proteus implementation uses address decoding. A simple solution to avoid self-coherency problems is to provide an extra control bit, CWG bit, for each cache line indicating whether that line has been CWG'ed or not. The disadvantage of this scheme is that each line can CWG only once. To improve generate capability a cache line may have two control bits instead of one. This scheme doubles the generate capability of each line and therefore may be useful for smaller caches. Notice that this cannot be extended to a third bit because a third request cannot be di erentiated from the rst. To fully utilize generate, a better scheme is to provide the cache line control in the translation 14

look aside bu er (TLB). The TLB provides a generate bit for each line in a page. When the TLB fetches a page table entry (PTE) and the generate bit in the PTE is set, it sets these bits to generate. Whenever, a location in a line is written, the generate bit is forwarded to the cache and reset to un-generate. Thus, the second time around, the generate information sent from the TLB is no-generate. Using this scheme, multiple lines from di erent pages mapping to the same cache lines can all be CWG'ed. This scheme has a further advantage that a page can be set to generate and spawned or killed several times by simply modifying the PTE and TLB entries. Thus any pages used as scratched areas can be repeatedly CWG'ed without intermediate ushes or invalidates as required in the bit(s) per line schemes. Self coherency may be maintained through software control by not using multiple memory areas which map to the same generate areas in the cache. The compiler and operating system place program and data segments such the generate variables map to unique lines while programs execute, so replacement of generate data does not occur unpredictably. The ability to generate can be embedded in the virtual address generated by the program. A hardware address decoder identi es generate addresses. The overhead of decoding is: software control necessary to prevent simultaneous use of cache lines for generate and address decoding circuitry to enable CWG. It is possible to use addresses decoding on existing microprocessors with custom external caches. The obvious danger is the ability of incorrect programming and operating system behavior which causes inconsistency. In a directory-based self-coherency scheme a single bit is provided for each main memory line to indicate whether a line is in generate area or not. This bit is sucient to maintain self-coherency also. The generate bit is reset upon the rst write. The address decode scheme uses the least hardware. Proteus [5] [16] uses address decoding control because of the minimal hardware overhead and ability to use o -the-shelf microprocessors.

5.2 Context Switching

To avoid losing the generate capability on TLB replacement and context switches, one can specify that a generate page should not be removed from the TLB. Another solution is to preserve the generate bits for all lines in a page when the TLB entry is replaced. Obviously, this requires additional work and memory to save the state. Also, somehow the generate bits must be reinstated when the page table entry (PTE) is reloaded. Depending on the implementation, the overhead may be substantial. Because of this, the preferred solution is to reset the generate bit in the page table when the TLB fetches that PTE. If the PTE is removed from the TLB for any reason, no further CWG's in that page are possible. The cost of this low overhead is a loss in the generate capability but it should not be signi cant because TLB entries are not replaced often.

6 Conclusions Generate improves performance of multiprocessor applications signi cantly over allocate and write back caching modes by altering the second level cache. When generate is incorporated on chip, even greater improvements are achieved. Allocate and generate have the same hit ratio, but generate signi cantly reduces the number of bus cycles by making write misses more ecient. For our application, generate reduced the number of bus cycles in a multilevel cache system by 33 % to 66 %. We also showed that the performance improvement is not nearly as dramatic as the improvement in hit ratios or reduction in bus cycles. Performance is weakly coupled with trace results in the case of write behaviors, and the improvement in generate over allocate averaged about 20 %. The reduction in bus cycles is achievable without rewriting programs and allows shared bus systems to 15

use greater numbers of processors. The program performance can be improved in single processor systems as well, where the contention between loads, stores, branches, and instructions is reduced by decreasing cache write miss service times. The multiprocessor coherency overhead is the same as allocate. Required modi cations include the addition of self coherency controls, identi cation of possible generate variables, and generate memory management, which we have discussed. Generate allows designers to increase the number of processors on a shared bus or use slower memories to achieve the same performance as systems that do not use generate. It can be used in the arsenal of techniques to improve performance [2]. There will be improvement only for applications that are I/O bound, obviously, and further examination of benchmarks will likely show that generate improvement is modest for a single processor, but signi cant for a multiprocessor. Our detailed simulation was limited because of the inability to run benchmarks, but has strength in that it gives precise run times, not possible with trace analysis. We have shown that for machine vision applications, the hit ratio greatly improves, the number of bus cycles is reduced signi cantly, but the a ect on performance is more subtle. Generate reduces the amount of trac, and therefore moves the knee of the curve at which a shared bus will saturate, or become congested. Our future research will be focused on further evaluating parallel benchmarks with generate. This may be done by developing precise simulations that are quicker to run, and also by designing and building new parallel machines. We are also working on inventing additional cache optimization techniques, and creating analytical performance models matching real world systems.

7 Acknowledgments We gratefully acknowledge the Proteus circuit designers at the Applied Physics Laboratory of the University of Washington, Mike Harrington, Ken Cooper, Bill Corrin, Bob Johnson, and Gary Harkins. They put up with our experimentations and allowed us to implement generate for evaluation. Special thanks to Yung-Hsi Yao and Tuan Phan for performance timings and Proteus system support. The work of the Proteus software group overseen by Professors Robert Haralick, Linda Shapiro, and Jenq-Neng Hwang also added to the quality of our work and we would like to sincerely thank their e orts. This research has been supported in part by the Navy Coastal Systems Center, the NASA Graduate Student Researcher's Program, and the Boeing Company.

8 BIBLIOGRAPHY [1] W. C. Brantley, K. P. McAuli e, J. Weiss, \RP3 Processor Memory Element," in Proceedings of the 1985 International Conference on Parallel Processing, Aug. 20-23, 1985, pp. 782-789. [2] C.-H. Chen and A. K. Somani, \E ects of Cache Trac on Shared Bus Multiprocessor Systems," in International Conference for Parallel Processing, Aug. 1992, Chicago, IL, pp. I285I288. [3] M. S. Costa, \A Practical Guide to Task-Oriented Sequences of Morphological Operations for Use in Image Analysis," Technical Report, EE-ISL-90-01, University of Washington, Department of Electrical Engineering, Intelligent Systems Laboratory, pp. 39-42, Apr. 1990. [4] J. R. Goodman and P. J. Woest, \The Wisconsin Multicube: A New Large-Scale Cache Coherent Multiprocessor," in 15th Annual International Symposium on Computer Architecture, Honolulu, HA, June, 1988, pp. 422-431. 16

[5] R. M. Haralick, A. K. Somani, C.M. Wittenbrink, L.G. Shapiro, J.N. Hwang, C.-H. Chen, R. Johnson, and K. Cooper \Proteus: A Recon gurable Computational Network For Computer Vision," to appear in Journal of Machine Vision and Applications. [6] R. M. Haralick, S. R. Sternberg, Y. Zhuang, \Image Analysis Using Mathematical Morphology," IEEE Transactions On Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 4, July 1987. [7] J. L. Hennessy and D. A. Patterson, Computer Architecture, A Quantitative Approach. San Mateo, CA: Morgan Kaufmann, 1990. [8] Intel, i860 64-Bit Microprocessor Hardware Reference Manual. Mt. Prospect, IL: Intel Corp., 1990. Intel, i860 64-Bit Microprocessor Programmer's Reference Manual. Mt. Prospect, IL:Intel Corp., 1990. [9] N. P. Jouppi, \Cache Write Policies and Performance," in 20th Annual International Symposium on Computer Architecture, San Diego, CA May 1993, pp. 191-201. [10] LSI Logic, LSI Logic L64815 Memory Management, Cache Control, and Cache Tags Unit Technical Manual. LSI Logic Corp., 1989. [11] T. Lovett and S. Thakkar, \The Symmetry Multiprocessor System," in International Conference for Parallel Processing, 1988, pp. 303-310. [12] MIPS Computer Systems, Inc., 930 Arques Avenue, Sunnyvale, CA 94086. MIPS R-Series Architecture, Sept. 1990. [13] Motorola, MC68030 Enhanced 32-Bit Microprocessor User's Manual. Motorola Inc. 1987. [14] G. Radin, \The 801 Minicomputer," in Proceedings, Symposium on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA, March 1-3, 1982, pp. 39-47. [15] A. J. Smith. \Cache Memories". Computing Surveys, vol. 14(No. 3), Sept. 1982. [16] A. K. Somani, C.M. Wittenbrink, R.M. Haralick, L.G. Shapiro, J.N. Hwang, C.-H. Chen, R. Johnson, and K. Cooper \Proteus System Architecture & Organization," in Fifth International Parallel Processing Symposium, Anaheim, CA, April 30 - May 2, 1991, pp. 276-284. [17] H.S. Stone, High Performance Computer Architecture. Reading, MA. addison Wesley, 1987. [18] C. M. Wittenbrink \Directed Data Cache for High Performance Morphological Image Processing," Masters Thesis, University of Washington, Dept. of Electrical Engineering, Oct. 1990. [19] Zycad, N.2 User's Manual. Zycad Corporation, 1989.

17

Figure 4: os program speedup of generate vs. normal and allocate, with 1,4, and 8 processors in 2 level cache system

Figure 5: os program speedup of generate vs. normal and allocate, fast medium and slow memories in 2 level cache is 256k 18

Figure 6: os program speedup of generate vs. normal and allocate, fast medium and slow memories in 2 level cache is 64k

Figure 7: nd program speedup, generate vs. normal and allocate, fast medium and slow memories in 2 level cache is 64k 19

Figure 8: nd program speedup, generate vs. normal and allocate, with 1, 4, and 8 processors in 1 level cache system

Figure 9: nd program timings of generate, normal, and allocate, with 1, 4, and 8 processors in 1 level cache system 20