A Software-Controlled Cache Coherence Optimization for Snoopy-based SMP System Youhui Zhang, Ziqiang Qian, Weimin Zheng Department of Computer Science and Technology, Tsinghua University Beijing, P.R.China
[email protected] Abstract—Some research results show that on average 67% of broadcasts for the maintenance of cache coherence in SMP systems are unnecessary. To reduce the unnecessary overhead of snoopy-based SMP systems, this paper proposes a new software/hardware hybrid cache coherence optimization—the programmer can insert special instructions into programs to direct related hardware to enable/disable broadcast operations, so some potential broadcasts for unshared variables are avoided without violating data coherence. We design the mechanism along with a proposed coherence protocol. Moreover, it is simulated on a SMP simulation platform and the results show that the improvement is apparent. Although the insertion is manual, it accords with the existing SMP programming model. Keywords- SMP, cache coherence, snoopy-based protocol
I.
INTRODUCTION
To improve the maintenance performance of cache coherence for SMP [1][2] is an important research issue, although this problem is at least thirty years old. Many hardware and software solutions for cache coherence have been proposed. Hardware approaches play a dominant role, and most commercial SMP systems employ either the snoopy-based or the directory-based hardware cache coherence scheme. The snoopy-based scheme is suitable for small-scale, bus-based machines. Accordingly, the focus of our research is how to optimize the snoopy-based approach based on some software methods for embedded SMP systems. To maintain coherence, snoopy-based SMPs commonly broadcast memory requests to other processors in the system [3,4,5]. While broadcasting is a simple way to find data copies, it consumes considerable bus bandwidth and increases the latency for non-shared data access. [6] presents that on average 67% (and up to 94%) of such broadcasts are unnecessary. [6] and [7] propose some hardware-based methods to exploit coarse grain sharing patterns in snoop-based shared memory multiprocessors to reduce bandwidth, latency and energy. The Coarse-Grain Coherence mechanism monitors the coherence status of large regions of memory, and uses that information to avoid unnecessary broadcasts. This paper presents a new software/hardware hybrid mechanism to decrease the number of unnecessary broadcasts for the snoopy-based embedded SMP system without violating coherence. That is, the programmer can insert new instructions or extended existing instructions into software to direct related hardware to enable/disable broadcast operations of the local processor. In contrast with some existing approaches, our solution owns the following features,
z
It does not modify the main body of the snoopybased hardware except that some small judgment logic is added. z And the programmer can indicate whether a program section accesses shared variables or not, which is much more accurate than other software methods based on compiler or OS. Of course, this approach generally increases the programmer’s workload. But our focus is the small-scale embedded SMP systems, which have relatively smaller and fewer running-programs. So it is relatively easier to complete such indications manually. Moreover, it accords with the natural SMP programming model, which we will present in section 2. II.
DESIGN AND IMPLEMENTATION
A. Programming Model In this paper, an abstract SMP programming model based on the viewpoint of data sharing is presented. Any storage element or variable of a program that can be read or written by more than one thread/process is called shared data. So, the instruction series of a program can be divided into several continuous blocks. One block will contain or not contain any instructions that access the shared data. This model is formalized as follows. One program, P, can be divided into a series of basic blocks [8]. A basic block, B, is an instruction sequence with the max length that observes the following two conditions, z It has no branch or jump instructions in except to the entry and z has no such instructions out except at the exit. B is a continuous instruction sequence. There are two kinds of blocks—the first is named as data-sharing block, denoted as DSB. That is, there is one or more instruction visiting the shared data in the block. And the other is nondata-sharing block, denoted as NDSB. Our idea is to insert a pair of specific instructions to the DSB. The first is an instruction to enable broadcasts and the latter is a disabling one. The insertion principle is—the first one should be inserted before any instruction that visits the shared data, and the last one should be located after the sharing instruction(s). Of course, the concrete positions depend on the programmer’s judgment. Fortunately, compared with the existing SMP programming model, this model is natural. Programs that share data must access that data serially to avoid data coherence problems. Before a program accesses a shared data item, it must ensure that no other program will change the item. The primary mechanism
that is used to keep programs from interfering with another is the lock. So, a SMP program can be regarded as a series of sections divided by pairs of lock & unlock, and all shared data updating operations lie in the pairs. It is similar to our model except that our division requests any shared-data read/write should be located in DSB. B. Cache Coherence Protocol A variety of different cache coherence protocols have been designed, that have been placed into a general framework by Sweazey and Smith [9]. It identifies the following attributes as being of relevance in the family of cache-coherence protocols they consider: M (exclusive modified), O (owned), S (shared unmodified), E (exclusive unmodified) and I (invalid). More detailed and formalized descriptions of the MOESI protocol are presented in [10]. A little judgment logic is added to this process to implement our solution. That is, only if the broadcast is enabled, the MOESI protocol will be followed; otherwise all data access will be directed to the local cache and/or the global memory. C. The Hardware Design The first enhancement of hardware is to add special instruction(s) to enable/disable the hardware cache coherence mechanism. But in most cases, it is not necessary to add new instruction(s) while some existing control instruction(s) can be extended to own the new switch function. We know there are always some control registers or coprocessor registers in most popular processor architectures. For example, there are some coprocessors for MIPS and ARM, as well as ASR (ancillary state register) for Ultra-Sparc. Accordingly some related instructions are defined to access these registers. Generally speaking, there are some reserved bits in these registers for future extension. So, these bits can be used to control the switch of cache coherence mechanism. Regardless of the concrete micro-architecture, the simplest hardware design can be presented as follows. One control or coprocessor register with reserved bits is employed to log the current status of cache coherence mechanism. In detail, if the special bit of this register is 1, the coherence is enabled. Otherwise it is disabled. The related register access instruction can modify the bit as needed. And then, the bit will be visited to decide whether the snoopy-based protocol should be enabled or not before any cache operation. But this design can not suit the multi-task environment. For example, if one task, a, enables the mechanism (that is, the related bit is 1) before it is suspended, and the next running task, b, sets the bit 0. It will violate the cache coherence when a is re-executed. There are two solutions for this instance—one is to swap out/in the register as a part of the task’s context; the other is to add an internal register to record the accumulative total of enable operations. When a disable instruction is committed, the register is decreased by 1. Only if the internal register is 0, the reserved bit is set 0. Otherwise it is 1. In this paper, we adopt the second solution for simplicity. Moreover, our target is the embedded system that often has one task running on a processor simultaneously.
III.
EVALUATION METHODOLOGY
A. Evaluation System Detailed timing evaluation is performed with one multiprocessor simulator, Sparc-Sulima [11]. It is a machine simulator for the Ultra-SPARC implementation of the SPARC V9 processor architecture. Sparc-Sulima can be used to give accurate analysis of memory behavior for threads interactions in the SMP context where all CPUs are linked by one system bus. As mentioned in [11], Sparc-Sulima implements the level-1 ICache / DCache and the combined level-2 cache and cache-coherency related functions for intra-CPU and SMP coherency. In the version 0.3 we used, the snoopybased mechanism is employed to implement its MOESI protocol. Moreover, some configurations of the simulated system can be adjusted by a script file, including the CPU number, the related memory transaction latencies and some parameters of cache configurations. We modify Sparc-Sulima so that the instructions to read/write ASR—RDASR and WRASR, are enhanced as mentioned by section 2. It means, whether any cachecoherence operation will be disabled or not will depend on the control bit in ASR. In our evaluation, the example programs combined with the simulator package are tested in the simplest fashion, which is to ‘boot’ the simulated Ultra-SPARC SMP directly from main functions of these programs using a specially compiled executable. The test is under different system cache configurations, which are presented in Table 1. The related operation latencies (cycle) are described as follows. R/W latency of L1 Cache: 1; R/W latency of L2 Cache: 3; R/W latency of main memory: 32 Latency of the completion of broadcasting invalidation/read requests: 3*n (n is the number of CPUs). TABLE I.
L1 DCache
CACHE CONFIGURATIONS
Read/Write Size/line/block Replace Scheme Scheme Write-through, 16KB, two 16 byte direct mapped read-allocation sub blocks per line
L1 ICache
read-allocation 16KB, 32 byte block two-way setper line associative
L2 Cache
Write-back
1MB, 64 byte line
direct mapped
B. Results The following command is used to launch the test program, myprog, on the simulator. ./run myprog [-p p] [t [tm]] Here t and tm are the memory system trace levels. p is the number of CPUs to be linked to the single system bus. For p > 1, each CPU boots from main() in SPMD style using independent stacks, with any shared data being declared static. For SMP programs, libsmp in the package provides all necessary calls. The test cases in Table 1 are completed and all running time are recorded respectively in Table 2.
TABLE II.
Hello Meminstrns factorial Intops
CPU Number= 2 (unit: 32 cycles) 5742 (5952, 96.6%) 4710 (5120, 92.0 %) 62064 (64640, 96.0%) 3405783 (3698560, 92.1% )
CPU Number= 3 (unit: 32 cycles) 5773(6016, 96.0%) 4967(5408, 91.8%) 62900(66720, 94.3% ) 3455894 (3768768, 91.7%)
We can see that the performance improvement lies from 3.4% to 9.7% for these programs. And for programs with more memory operations and for programs running in a larger SMP system, the improvement is more. IV.
THE TEST RESULTS
CONCLUSIONS
This software/hardware hybrid cache coherence optimization requires programmers to insert some special instructions to the program to enable/disable the cache coherence mechanism. For an instruction block that contains accesses to shared variable(s), one pair of enable/disable instructions is inserted to maintain the data coherence. Although the insertion is manual, it accords with the natural SMP programming model. Test results show that the optimization can fairly reduce average memory access latency in a snoopy-based multiprocessor system, and hence improves the overall performance. ACKNOWLEDGEMENT The work is supported by the High Technology Research and Development Program of China under Grant No. 2006AA01Z111 and the National Natural Science Foundation of China (No. 60773147).
CPU Number= 4(unit: 32 cycles) 5792(6048, 95.8%) 5056 (5600, 90.3%) 64064 (68416, 93.6%) 3495900(3853056, 90.7%)
REFERENCES [1]
J. Huh, D. Burger, and S. W. Keckler. Exploring the design space of future CMPs. In Proc. 10th International Conference on Parallel Architectures and Compilation Techniques, September 2001. [2] D. A. Wood and M. D. Hill. Cost-effective parallel computing. IEEE Computer Magazine, 28(2), Feb. 1995. [3] Charlesworth, A. The Sun Fireplane System Interconnect. Proceedings of SC2001. [4] Tendler, J., Dodson, S., and Fields, S. IBM eServer Power4 System Microarchitecture, Technical White Paper, IBM Server Group, 2001. [5] Kalla, R., Sinharoy, B., and Tendler, J. IBM Power5 Chip: A DualCore Multithreaded Processor IEEE Micro, 2004. [6] Jason F. Cantin, Mikko H. Lipasti, and James E. Smith. Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. Proceedings of the 32nd Annual International Symposium on Computer Architecture table of contents. Pages: 246 - 257, 2005. [7] Andreas Moshovos. RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence. Proceedings of the 32nd Annual International Symposium on Computer Architecture. Pages: 234 245. 2005 [8] John L. Hennessy, David A. Patterson. Chapter 3, Computer Architecture: A Quantitative Approach (3rd Edition). Pub. Date: June 2002. [9] P. Sweazey, A. J. Smith. A Class of Compatible Cache Consistency Protocols and their Support by the IEEE Futurebus. Proceedings of the 13th annual international symposium on Computer architecture. Pages: 414 - 423. 1986 [10] Kai Baukus, Ron van der Meyden: A Knowledge Based Analysis of Cache Coherence. ICFEM 2004: 99-114 [11] Bill Clarke, Andrew Over and Peter Strazdins. The Sparc-Sulima Manual. http://cap.anu.edu.au/cap/projects/sulima. The Australian National University, 2004.