Implementing High Availability Memory with a Duplication Cache

3 downloads 118 Views 376KB Size Report
ket is expected to grow faster than the overall server market, particularly in the ... rent high availability systems provide excellent fault coverage, they impose 100% and ...... tor [16] that can boot an unmodified Windows or Linux OS and execute ...
Implementing High Availability Memory with a Duplication Cache Nidhi Aggarwal, James E. Smith, Kewal K. Saluja

Norman P. Jouppi, Parthasarathy Ranganathan

University of Wisconsin-Madison Madison, WI

Hewlett Packard Labs Palo Alto, California

Abstract—High availability systems typically rely on redundant components and functionality to achieve fault detection, isolation and fail over. In the future, increases in error rates will make high availability important even in the commodity and volume market. Systems will be built out of chip multiprocessors (CMPs) with multiple identical components that can be configured to provide redundancy for high availability. However, the 100% overhead of making all components redundant is going to be unacceptable for the commodity market, especially when all applications might not require high availability. In particular, duplicating the entire memory like the current high availability systems (e.g. NonStop and Stratus) do is particularly problematic given the fact that system costs are going to be dominated by the cost of memory. In this paper, we propose a novel technique called a duplication cache to reduce the overhead of memory duplication in CMPbased high availability systems. A duplication cache is a reserved area of main memory that holds copies of pages belonging to the current write working set (set of actively modified pages) of running processes. All other pages are marked as read-only and are kept only as a single, shared copy. The size of the duplication cache can be configured dynamically at runtime and allows system designers to trade off the cost of memory duplication with minor performance overhead. We extensively analyze the effectiveness of our duplication cache technique and show that for a range of benchmarks memory duplication can be reduced by 6090% with performance degradation ranging from 1-12%. On average, a duplication cache can reduce memory duplication by 60% for a performance overhead of 4% and by 90% for a performance overhead of 5%. Keywords-component; high availability, memory duplication, duplication cache, low cost availability, selective replication

1 INTRODUCTION High availability systems employ identical components in redundant configurations, e.g., processor duplication, to provide non-stop operation even in the presence of hardware faults. Redundancy supports fault detection, isolation, and recovery – all necessary operations in a complete high availability system implementation. Historically, high availability systems have been the province of mainframe computers [9] or specially designed fault-tolerant systems [7, 8]. However, high availability design techniques will become important in a wider range of future systems given the trends towards unreliable components [1]. For example, the high availability server market is expected to grow faster than the overall server market, particularly in the lower end commodity market segment [2].

The prevalent microprocessor industry trend is toward multi-core integrated circuits (a.k.a. chip multiprocessors (CMP)) [3, 4, 5] because of their ability to provide high performance and efficiency. CMPs also appear to be natural building blocks for future highly available systems because they contain multiple identical components (cores, caches, and memory controllers) that can be used in redundant configurations [22]. However, there is an important distinction between current high availability systems and future, larger high availability systems – cost. Today’s high availability systems like HP NonStop [7], Stratus [8] and IBM zSeries [9] are million dollar systems that spare little expense to guarantee high availability; future widely-deployed high availability systems will have to be much more cost-sensitive. Although dual modular redundancy (DMR) and triple modular redundancy (TMR) as used in current high availability systems provide excellent fault coverage, they impose 100% and 200% overheads, respectively. The high overhead of making all components redundant will be unacceptable in commodity CMP-based high availability systems, especially when all general purpose applications might not require high availability. For example, some general purpose applications might only care about high reliability with fail stop behavior, that is, either an operation is done reliably or the application is notified of the error – there must be no silent data corruption. For example, it might be acceptable for a home banking application to occasionally get an error notification and it might need to be restarted as long as the data is not lost or corrupted. There are two major system cost components – 1) on-chip resources including cores and caches, and 2) off-chip memory. As future chips scale to larger numbers of cores, duplicating on-chip resources, especially processor cores for running redundant computation, will lower the potential throughput of the system. Also, running all applications redundantly will increase the power consumption of the system and decrease its energy efficiency in terms of performance/watt. But transistors are going to be available in abundance in the future, and some of them can easily be used to increase the reliability/availability of the system. Duplicating all of main memory as in current high availability systems such as NonStop and Stratus, which use commodity processors with a custom chip set, is problematic given that system costs are going to be dominated by the cost of memory [10]. The IBM zSeries systems use highly customized processors with duplicate pipelines that have tightly lockstepped instruction level comparison prior to the memory interface. Instruction level comparison avoids duplicating all of

memory, and error correcting codes are used for memory reliability. However, using custom designed processors for fault tolerance is not suitable for future commodity high availability systems because it does not permit higher performance nonfault tolerant configurations. Further, tight lock-stepping is increasingly impractical and we will elaborate on some of the reasons in the next section. Future workloads are likely to have an increased memory footprint, for example, large enterprise-wide applications that consume vast amounts of memory space. Examples of these include databases, enterprise resource planning applications, decision support systems, and even memory intensive desktop applications processing multimedia files. Also, workload consolidation using virtualization is increasingly being used in data centers to reduce the total cost of ownership [15]. Workload consolidation can increase the memory usage of the system substantially. Fully duplicating memory for such workloads is going to be extremely expensive and might prohibit adoption of high availability systems in the cost-sensitive commodity market segment. Our research goal is high availability and high reliability systems that employ redundant commodity hardware, but do not have to pay a constant (high) overhead of replication. In this paper, we focus on the memory system by proposing a duplication cache as a way of significantly reducing memory duplication overheads. A duplication cache is a reserved area of main memory that holds copies of pages belonging to the write working set (set of pages the application is actively modifying) of active processes. All other pages are marked as readonly and are kept only as a single, shared copy. Our solution leverages the intuition that ECC and other memory protection techniques provide adequate error coverage once the data gets to memory, and memory duplication is essentially needed only for errors that originate in the processor core or cache and propagate to memory. The size of the duplication cache can be configured dynamically at runtime and allows system designers to trade off the cost of memory duplication with minor performance overhead. We extensively analyze the duplication cache technique for a set of total memory and cache sizes and show that our techniques can reduce memory duplication overhead by 60% with a performance impact of only 4% on average or by 90% with a performance impact of 5% on average. The range of overheads across SPEC CPU 2000 benchmarks is 1% for low memory benchmarks to 12% for high memory benchmarks. The rest of the paper is organized as follows. Section 2 provides background on fault tolerant systems including their memory architectures and identifies potential sources of improvement. Section 3 discusses our proposed duplication cache technique for reducing the overhead of memory duplication. In Sections 4 we present an evaluation of the duplication cache technique. We discuss the implications of our approach and present some observations on processor overhead reduction in section 5. We discuss related work in Section 6 and conclude the paper in Section 7.

2 BACKGROUND 2.1

Fault Tolerant Systems There are many ways in which redundant components can be configured to achieve high reliability and/or high availability. The granularity of redundancy can be fine grained or coarse grained. For example, the IBM zSeries systems rely on fine grained duplication where there is a redundant copy of each sub-component of the chip – pipeline, cache controllers, memory controllers etc. These types of systems detect, isolate and recover from faults at the component level. Other systems like NonStop and Stratus use a coarse grained redundancy where the unit of replication, detection, isolation and recovery is an entire processor board and outputs are compared at I/O level. However, we focus on low cost fault tolerant systems that take advantage of the inherent redundancy present in future CMPs where there are multiple identical components on chip.

P0 P1 P2 P3 P4 P5 P6 P7 L1 D1 L1 D1 L1 D1 L1 D1 L1 D1 L1 D1 L1 D1 L1 D1

R C U

B0 Link Adpt

B1

Mem Ctrl

B2 B3 B4 B5

Link Adpt

B6 B7

Mem Ctrl

Mem Ctrl

Link Adpt

Mem Ctrl

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

Link Adpt

Figure 1. CMP-based high availability systems using configurable isolation. The system has private level 1 caches and a shared level 2 cache shown as banks B0-B7. In order to implement a high availability system using a commodity CMP where resources like caches and memory controllers are shared, it must be possible to have complete isolation between the components running the redundant processes. One technique to provide a configurable degree of isolation as was proposed in [22]. Using the configurable isolation techniques it is possible to partition the system into fully isolated color domains (red and green in Figure 1, black and grey in black and white print) which can then be used to map redundant computation. The main capability required is the ability to dynamically partition the on-chip interconnection network. A ring based system (as shown in Figure 1) can be partitioned using self-checked [24] Ring Configuration Units (RCU) [23] that are added in the interconnect. A self checked unit can be shared without compromising the fault isolation of the system because it does not inject faults to the connected units. A system with configurable isolation can be configured as a high availability system with I/O level voting and looselockstepped processors similar to the NonStop DMR system. The NonStop software stack includes the NonStop kernel and critical software implemented as process pairs. Using the NonStop terminology, a single logical processor is imple-

mented with two hardware processing elements (PEs). So, an eight core CMP (eight PEs) can implement four logical processors. A logical processor runs a redundant process pair. In a DMR configuration of the proposed architecture the OS schedules each process of a process pair to a different color. The cache and TLB state across the two colors can be different, but each PE writes to its own memory so that on any output operation (say to disk) the data pulled from either of the colors should be the same. Consequently, result comparison is done in the I/O system. The output operations of each PE in a logical processor are compared by a fully self-checked voter in the I/O hub. Therefore, in an eight core CMP, four voters compare the output operations from the four logical processors. In the system presented above, all of main memory is duplicated. This overhead is too high for CMP-based systems that are targeted towards the lower end of the high availability server market. Memory elements have long been susceptible to faults due to the density of transistors in the memory elements. Therefore, memory fault tolerance has received considerable attention, and chip manufacturers and system designers incorporate mechanism like extensive ECC, chip kill, DIMM sparing etc. that have been proven to make the memory system robust. Further, an important class of hard memory errors is fail stop in the sense that the memory module cannot be accessed when there is an error. For example, if an entire memory module fails then it doesn’t respond. However, the memory system is susceptible to faults that originate in the processor core or caches and propagate to it. These faults are undetectable through memory system fault tolerance techniques like ECC. In this paper, we focus on devising techniques for checking all errors that either originate at or are propagated to memory and isolating them at a page level - but without using the standard technique of fully duplicating memory. Once an error is detected, we envision that the system can roll back to a fault free state using previously stored checkpoints by using techniques similar to current high availability systems. We leverage the unique properties of a CMP-based system that can enable new optimizations over traditional fault tolerant systems - for example dynamic sharing of components. Note that our techniques for error detection are applicable not only to high availability systems (as in Figure 1) that incorporate mechanisms for recovery, but also to other high reliability fault tolerant systems where only the ability to detect errors to enable fail stop behavior might be sufficient. Memory Architectures in High Availability Systems There are a number of alternatives for constructing memory systems in high availability/reliability systems. There are two major alternatives, depending on the way in which checking is done. In one approach, redundant processors operate in tight lockstep, and in the other they operate in loose lockstep.

Rather, ECC is used for memory error detection and correction. Control structures such as cache controllers and memory bus adapters are replicated, and their signals are compared. In one alternative, redundant processors instead of pipelines may be used in tight lockstep, for example the Tandem system [12].

V Cache Mem Ctrl.

Mem Ctrl.

P

P

Cache

Cache

Mem Ctrl.

Link Adpt.

Link Adpt.

Mem Ctrl.

Mem Ctrl.

V

DIMM

DIMM

DIMM

DIMM

(a)

DIMM

V

DIMM

(b)

Figure 2. Block diagrams of current high availability systems. Hatched DIMMs represent duplicated memory. Non-hatched DIMMs are non-duplicated memory. V stands for voter. (a) Lock-stepped pipelines as in IBM zSeries processors. The outputs of the pipelines are compared before committing to cache, and individual components like the memory controllers are replicated and their signals compared to detect faults. Memory is not duplicated. (b) Loose lockstepped processor similar to NonStop and Stratus with duplicated memory. The outputs are compared at I/O level. Unfortunately, microprocessor design trends make it increasingly impractical to construct tightly lock-stepped microprocessors [7], especially commodity processors, for the following reasons. 1.

Low level recovery mechanisms for handling soft errors local to a processor complicate lock-step operation. For example, ECC mechanisms in processor-local caches and register retries can lead to two processors falling out of lock-step. Smaller die geometries result in higher soft error rates and more low-level recovery mechanisms.

2.

All the system level components need to be designed with lockstepped operation in mind and must act in synchronization. Current commodity processors are not designed for such precise synchronization. For example, the memory controllers and the interrupt delivery systems act asynchronously and do not synchronize with each other.

3.

Tightly lock-stepped processors cannot use power management techniques that incoporate variable frequencies. This is true even if the lockstepped processors are on the same chip and use the same clock. On the same chip, synchronized frequency shifting that is accurate to one cycle is extremely difficult. If the processors are on different chips, then a synchronized frequency shift coordinated to within one cycle is virtually impossible.

4.

As transistor switching frequency increases and wire delay lags behind, pipeline-level lockstep comparisons would require multiple cycles to propagate the values from one pipeline to the other, with the associated addition of

2.2

Traditional fault tolerant systems [12, 9] are designed with two pipelines or processors working in tight lockstep, executing the same instructions and using the same clock, with the outputs compared after every operation to detect any faults (Figure 2 (a)). In pipeline and tight lock-stepped systems, data errors are not allowed to propagate to caches or main memory respectively, and hence storage structures are not replicated.

P

P

Hence, to the first order, the alternatives are tight lockstepping with no memory redundancy (other than ECC) or loose lockstepping with memory duplication. Because of the difficulty in supporting tight lockstepping in future systems, as well of the advantages of loose lockstepping, we have chosen the loose lockstepping approach. Before proceeding, we note that that prior research has proposed cache coherence-based checking (fingerprinting) as an alternative to I/O level voting so that errors are not propagated to memory, thereby avoiding memory duplication [17]. However, the authors of the fingerprinting approach acknowledge that their approach requires frequent checkpoints and substantial changes to the cache coherence controller [14] - a difficult component to design and verify. We believe that the cache coherence scheme offers an interesting approach but needs custom design in future processors. In contrast, memory duplication has been proven to work in current systems. Hence, we direct our research toward an innovative method based on memory duplication, but which avoids duplicating all of memory at all times. The next section analyzes the fault tolerance properties of duplicate memory fault tolerant systems and identifies areas of potential overhead reductions in CMP-based high availability systems. 2.3

Potential for Reducing Memory Duplication Having settled on loose lockstepping and I/O level memory checking, we identify two principal ways of reducing the need for full duplication of memory. Because we want to minimize hardware impact, we consider a method that is primarily software managed and consequently we focus on page-level management granularities. The basic issue is that if checking is not done at the memory interface, then a processor-produced error can propagate to memory (via a write) without being detected. ECC checking will not help because ECC would be

Read only pages

Pages written

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

ammp art facerec mesa equake galgel mgrid sixtrack fma3d lucas applu apsi swim wupwise Average FP

A benefit of loose lockstepping with comparison at the I/O level is that less comparison bandwidth is needed because I/O operations are less frequent than memory operations, and as a result do not require frequent synchronization of the replicas. Also, I/O level comparison provides 100% soft error coverage for all the circuitry on chip. The replicas can operate independently and tolerate variations in execution caused by processorlocal error handling; e.g., cache retries. However, I/O level comparison in board-level redundant systems implicitly requires memory to be duplicated, as is done in the current NonStop and Stratus systems.

No duplication of read only pages – If the contents of a memory page are not modified by the application (i.e., are read only), any errors that originate in the processor core or caches cannot propagate to the page. If the page is protected by ECC, then ECC alone can be used to detect/correct soft errors that affect the page. Hence, such read-only pages do not require duplication.

crafty eon twolf parser vpr vortex bzip2 gap gcc gzip mcf perlbmk Average INT

For these reasons (and others), the NonStop and Stratus systems have both moved towards loose-lockstepped processors that run identical instruction streams on standard processors (with custom chipsets). The replicated processes run on different boards, and their outputs are compared at the I/O level. Therefore, the processor, caches, and memory on a processor board are duplicated or triplicated (refer to Figure 2 (b)).

computed on the erroneous data. Note that any data corruption that occurs while the data is resident in memory can be detected via ECC, however.

Fraction of Total Page Accesses

latches. The accumulated cost of the latches can be appreciable [13]. Also, as the number of cycles required for communication between replicated pipelines increases, the checkpointing of instructions needs to be delayed till the instructions can be compared. These delays can severely affect the performance of the system.

Figure 3. Percentage of page access types for SPEC CPU2000 benchmarks. To evaluate this potential for overhead saving, we monitored dynamic loads and stores performed by an application during its execution. Figure 3 shows the number of pages that are only read and never modified by benchmark programs as a percentage of the total pages touched by the benchmark when simulated for 10 seconds of real time. On average, 18% of the pages in SPECint and 8% in SPECfp are read-only and don’t need to be duplicated. While not insignificant, a reduction of 18% overhead (versus full duplication) is not a big win. Hence, we need another, more innovative way of reducing overhead. Intermittent Duplication of Seldom-Written Pages – Many pages, although not being read-only, are written seldom (perhaps only once). If a page is written, but will not be written again for a long time interval (or never), then at a relatively low cost, the page can be “converted” to a read-only page. This conversion can be done by having system software compare the duplicated copies, and if they are the same (which will be the case unless there has been an error), one of the copies can be discarded while the other is marked as read-only. To study the potential for this approach to overhead reduction, Figure 4 (a) shows cumulative numbers of different pages that are written (during an interval of 10 seconds). On average, for SPECint, 91% of pages written have fewer than 1000 total writes to them and 18% have fewer than 300 total writes. For SPECfp 69% of pages written have fewer than 1000 writes and 7% have fewer than 300 writes. This suggests that a fairly large number of pages are seldom written.

can be monitored at runtime. Based on these trends, we propose a novel technique to dynamically adjust memory duplication at runtime by using a cache of frequently written pages – the duplication cache.

Percentage of pages with a certain number of writes Percentage of pages written (cumulative)

100% Average SPECint Average SPECfp

90% 80% 70%

3 DUPLICATION CACHE

60% 50% 40% 30% 20% 10% 0% 1

3

10

30

100

300

1000

Total number of writes to a page

(a)

As is evident from the discussion above, the write working set of an application (the set of pages that an application is actively modifying) is a small percentage of the total pages that it reads and writes. For the purpose of fault tolerance it is only these pages that need to be duplicated at any given point in the application’s execution. When a page is determined to be no longer in the write working set then the duplicated memory can be reclaimed. We propose the use of a duplication cache that captures and duplicates only the write working set of an application.

Clustering in writes 45%

Average Total Average SPECint Average SPECfp

Fraction of unique pages

40%

P

P

P

P

35% 30%

Cache

Only for Read-only transfers

Cache

25%

Mem Ctrl.

20% 15%

Link Adpt.

Link Adpt.

Mem Ctrl.

10% 5%

DIMM 30 0 1, 00 0 3, 00 0 10 ,0 00 30 ,0 0 10 0 0, 00 30 0 0, 1, 000 00 0, 3, 000 00 0 10 ,00 0 ,0 00 30 ,00 0 ,0 00 10 , 0, 000 00 30 0,0 00 0, 1, 000 00 , 0, 000 3, 000 00 ,0 0 0 10 ,00 0 0 ,0 00 ,00 0 ,0 00 ,0 00

30

10 0

3

10

1

0%

Average time interval between writes

(b) Figure 4. (a) Histogram of number of writes to pages (cutoff at 1000). (b) Frequency distribution of pages written. Also, we collect statistics on the average time intervals between writes to a page shown in Figure 4(b). We notice that for SPEC CPU2000 the average time interval between writes is less than 10 cycles for 40% of pages written. Since the average interval between writes is small; this suggests that the writes to a page are typically bursty. So, if a page has been written around 1000 times and there has been no write for a long time (e.g. 2000 cycles) then the likelihood that it will be written again is very small. Such a page can then be converted to read only and reclaimed. Tying all of it together: It is clear from the graphs presented above that duplicating the entire memory is wasteful because 1) read only pages do not need to be duplicated and 2) a large number of pages are written fewer than 1000 times and these writes typically happen in quick succession - such pages only need to be duplicated during the time they are being actively written (i.e., are a part of the write working set of the application). There is a wide variation between the sizes of the write set of benchmarks; therefore, the amount of memory duplication required depends on the application behavior, which

DIMM

DIMM

V DIMM

Figure 5. System with duplicate cache. Hatched DIMMs represent duplicated memory. Conceptually, the duplication cache is a region of real main memory managed by system software. The duplication cache stores duplicates of the recently written pages. When a page is written for the first time a duplicate page is created in the cache. When adding another page to the cache will exceed its capacity, then the Least Recently Written page in the duplication cache is reclaimed by comparing it with its replica in main memory. If there is no mismatch (as will be the case unless there was an error) then the duplicate page is evicted from the cache, the page permissions for the replica in main memory are set to read only, and space is freed for the new duplicate. In case there is a mismatch between the compared pages, then the system has detected an error before it reached the I/O voter system. In this case, the system software initiates the same recovery mechanism that is initiated when an error is detected at the I/O system. Typically, the recovery mechanism is to roll back execution to a previously stored checkpoint and restart execution from the checkpoint. The memory checkpoint will contain the contents of the main memory and duplication cache at the time of the checkpoint. The duplication cache does not interfere with the recovery mechanism. By choosing an appropriate size of the duplication cache, the amount of memory duplicated can be restricted to roughly

the current write set of the application. The replica manager keeps track of the percentage of physical memory allocated to the duplication cache and to main memory using software counters. Figure 5 shows a block diagram of a CMP-based loose-lockstepped high availability system with a duplication cache. In figure 5 the processors share the memory hierarchy (the write paths are still isolated as described in Section 3.2) and only the hatched DIMMs contain replicated data. Duplication Cache Operation The operation of a duplication cache is managed by the software entity that manages the redundant threads in the high availability system. It could be the operating system similar to the NonStop kernel [21] in NonStop systems or an application transparent virtual machine as in the Stratus systems [20]. In a commodity CMP system, we envision the use of a hypervisor layer to act as replica manager [26]. The replica manager uses the page protection and memory access mechanism to manage the duplication cache. All pages in the physical memory are initially marked as read only. When the two process replicas start executing, they generate virtual addresses for their reads and writes. Figure 6 shows the flowchart of a memory access. We describe the operation for the two types of accesses:

dynamically increase its size at runtime if there are a lot of page evictions from the duplication cache.

3.1

Figure 6. Flowchart of memory accesses. The size of the duplication cache allows trading off cost for performance. The main causes of performance overheads that a duplication cache introduces are – •

The first write to a page: One page protection fault + one page copy – The first write to a page generates a page protection fault. This page protection fault is a minor page fault [28] because the page is already present in memory and need not be brought in from disk. Therefore, the cost of this page fault is several orders of magnitude lower than a conventional page fault [28]. The replica manager then needs to copy the page to the duplication cache.

Write – On a write access the sequence of steps depends on the type of page. There are two cases:



Page eviction: One page comparison – When a page is evicted then it is compared to its replica in main memory.

Write to a read only page – A write to a page that is marked read only results in a protection fault handled by the replica manager. The replica manager checks if the page should have been writeable in the baseline system. If not, then a normal protection fault is signaled. If so, the access is legal, and the replica manager creates a duplicate page in the duplication cache. If the duplicate cache is full (the common case), then the replica manager first evicts a page as described earlier: it identifies the least recently written page, compares it with its replica in main memory, and marks the main copy as read only while releasing the copy in the cache. The replica manager then copies the contents of the page in main memory to the duplication cache and marks both copies of the page as writeable. It then updates the page table and the TLB of one of the duplicated processes to point to the page in duplication cache.



Re-entry of page into duplication cache: One page protection fault + page copy – When a page is evicted from a duplication cache it is made read only. If it is written again it then generates a page protection fault similar to the first write to a page and then requires a page copy to duplicate it in the cache.

Read – On a read request to a ‘read only’ page (any page that has never been written before or one that has been written but evicted from duplicate cache and marked read only) the page table entries of both the replicas point to the same page. The two memory accesses are handled similarly to reads in a conventional memory system.

Write to a write-enabled page – A write to a writeenabled page means that there is already a replica in the duplication cache. The write proceeds to the correct copy using the real address translation stored in the memory access hierarchy (TLB and/or page table). The replica manager can partition the real memory in the system between main memory and duplication cache. Also, it can change this partition dynamically by changing the start and end address of the duplication cache. For example, the replica manager can start with a small duplication cache initially, and

3.2

Duplication Cache Implementation One of the attractive features of the proposed duplication cache is that it is simple to implement in CMP-based high availability systems. All the mechanisms used are well understood by current OSes for optimizations such as Copy on Write (CoW) that are used to increase the memory system utilization and overall system performance. We next outline the changes required in a CMP-based high availability system to implement a duplication cache. Figure 7 shows a detailed physical design of a CMP based high availability system, similar to the one proposed in [22], with a duplication cache implementation. In [22] the RCU either partitions the system into multiple domains or allows unconstrained sharing. We modify the RCU to enable it to either a) allow sharing for read only data or b) partition the interconnect for writes. The detailed design of the RCU is shown in Figure 7(b). Although we show a ring-based system in this paper, the techniques are also applicable to other types of interconnects.

Reads to non-duplicated pages carry a special ‘read only’ bit identifying them to the RCU. The RCU allows these reads to non-duplicated pages to travel on the ring unconstrained to any memory controller in the system – logically shown as the dotted (blue) ring line in the figure. However, when not running in shared mode writes are routed only within their own color (partition) for fault isolation purposes. As shown in Figure 7, the ‘read only’ (blue or dark grey) and write pages (red or green or black and grey in black and white print) can be mixed on a single DIMM because a replica of all the information in a single DIMM exists somewhere in the system (either on disk or the duplicated page). The only requirement is that the duplicate cache and the main memory replica should not be on the same DIMM.

shared data/components. (b) Ring Configuration Unit (RCU). RO means ‘Read Only’ bit and S means that the system is running in a shared configuration. Note that a duplication cache is not possible in traditional board-level redundant high availability systems like NonStop and Stratus because there are no means to share memory. In this sense, CMP-based systems are ideally suited for such overhead reducing optimizations. 3.3 1.

The biggest advantage of a duplication cache is that it allows the overhead of memory duplication to be dynamically customized to the application needs. The applications that do not have a large write working set do not need to pay the constant 100% overhead of memory duplication.

2.

A duplication cache enables a tradeoff between cost and performance, which allows the system designer to satisfy different service level agreements.

3.

Because memory pages that are evicted from the duplication cache are compared with their replicas, some errors can be detected earlier than they would be if the entire memory is replicated and comparisons were only at the I/O level. This can improve the availability of the system.

P0 P1 P2 P3 P4 P5 P6 P7 L1 D1 L1 D1 L1 D1 L1 D1 L1 D1 L1 D1 L1 D1 L1 D1

R C U

B0 Link Adpt

B1

Mem Ctrl

B2 B3 B4 B5

Link Adpt

B6 B7

Mem Ctrl

Mem Ctrl

Link Adpt

Mem Ctrl

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

FBDIMM

Link Adpt

4 EVALUATION

(a) S Data access

RO Read only/ Shared

Mux Read only/ Shared

S

Mux

Data access

RO

Data access

Mux

Read only/ Shared

Read only/ Shared Mux

RO

S Data access

RO

Duplication Cache Advantages A duplication cache offers several advantages:

S

(b) Figure 7: CMP based high availability system (a) Physical design of the configurable memory duplication - green and red are partitioned components and blue represents

We evaluated the memory-sharing optimizations using the HP Labs’ COTSon simulator, a full system x86/x86-64 simulator [16] that can boot an unmodified Windows or Linux OS and execute complex applications. The SPEC 2000 benchmarks were compiled with gcc/g77 version 4.0, at -O3 optimization level. In order to evaluate just the execution of the benchmarks, we restore a snapshot of the system taken just after booting when the machine is idle (except for standard OS housekeeping tasks) and directly invoke the execution of the benchmark from a Linux shell. The timing simulation begins just after the execution command is typed in the OS console and runs till completion. We list the simulator configuration in Figure 8 (a). We first simulated all the benchmarks from the SPEC CPU2000 suite individually and then combined multiple benchmarks to simulate workloads. The most viable and currently used workloads for loosely-lockstepped systems like ours are single threaded [7]. Therefore, for an evaluation consistent with current high availability systems, we wanted to use single threaded workloads with heterogeneous memory requirements which SPEC provides. In order to better model large production workloads, as is commonly done, we scaled down the caches significantly. We divided the individual benchmarks into four categories (Figure 8 (b)) based on their resident set size in memory [19] and chose a base memory configuration for each category. The four base memory sizes – 96 MB, 128 MB, 192 MB, and 256 MB were chosen so that they can comfortably fit the combined application and operating system memory footprint. For the combined workload experiments, we constructed four work-

loads - very large memory, large memory, medium memory and low memory - using 8 benchmarks for each workload. We simulated duplication cache sizes of 10%, 20%, 30% and 40% of each base memory size. For example, if the base memory size is 128 MB for vpr and the duplicate cache is 10% (12.8 MB), then the total memory for vpr simulation is 140.8 MB. The percentage of memory duplication supported affects the number of pages that are evicted from the duplication cache and made read only (page evictions) and the number of pages that were evicted earlier but are written to again and thus reenter the duplication cache (page re-entries). Instruction set x86_64 Core clock 2200 MHz Bus clock 550 MHz Issue width 4 2-lev branch predictor (g-share16384 entries L1 icache size 16 kB L1 icache line size 64 B L1 icache associativity 2-way L1 dcache size 16 kB L1 dcache line size 64 B L1 dcache associativity 2-way L1 itlb entries 32 L1 itlb associativity 8 L1 dtlb entries 32 L1 dtlb associativity 8 L2 cache size 256 kB (1 core), 2 MB (8 core) L2 cache associativity 4-way L2 dtlb entries 512 L2 dtlb associativity 4 Main memory access 220 cycles

(a) Memory size

Benchmarks (resident set size) Individual benchmarks INT - gcc(154), mc(190), gap(192), gzip (180), bzip2 (185), perlbmk (146) 256 MB FP - wupwise(176), swim(191), applu(181), apsi(191) INT - vortex(72) 192 MB FP - lucas(142), fma3d(103) INT - vpr(50), parser(37) 128 MB FP - mgrid (56), galgel (63), equake (49), sixtrack (26) INT - crafty(2), eon(0.6), twolf(3.4) 96 MB FP - mesa(9.4), art(3.7), facerec(16), ammp(26) Workload consolidation 2046 MB (very large) gzip, mcf, gap, bzip2, swim, apsi, applu, wupwise 1536 MB (large) lucas, fma3d, vortex, perlbmk, apsi, swim, crafty, twolf 1024 MB (medium) mgrid, galgel, equake, sixtrack, vpr, parser, vortex, fma3d 768 MB (low) mesa, art, facerec, ammp, crafty, eon, twolf, sixtrack

(b) Figure 8. (a) Simulator configuration. (b) Benchmark and workload description. Performance overhead modeling: The number of page evictions and reentries that a particular size of the duplication cache entails affects the performance overhead imposed by a duplication cache. We developed a simulation based performance overhead model to characterize the overheads. As per the discussion of overheads in section 3.1, the number of overhead cycles can be calculated as:

Overhead cycles = # page evictions * Cost comparing one page + # pages written * (Cost page copy+ Cost page protection fault) + # page reentries * (Cost page copy + Cost page protection fault) We measured the number of page evictions, number of unique pages written and the number of page re-entries for each cache size for all the benchmarks simulated. We estimated the cost of comparing and copying one page of 4kB using published results for STREAM benchmark [27] for three different 64-bit commodity processors (Itanium, Opteron and G5) using standard operating systems. Although our simulator configuration is similar to the 2GHz Opteron with Linux that was benchmarked in the above referred study, we conservatively use the lowest bandwidth numbers reported – those for G5. The copy bandwidth reported was 3012 MB/s and the sum or compare bandwidth was 2577 MB/s. As discussed in section 3.1, the page faults that the duplication cache generates are minor page faults and do not impose the substantial overhead of a conventional page fault that brings in a page from disk. This fault only requires a trap to the replica manager to perform the page copy and then one memory access to mark the page writeable. We assume a conservative 1000 cycles as the cost of a minor page fault. We also performed sensitivity studies with the page fault cost of 2000 and 500 cycles and observed that average change in overhead was less than 0.5%. Section 4.1 and 4.2 discuss results for individual benchmarks and combined workloads respectively. 4.1 Individual Benchmarks Figure 9 shows the number of page evictions and the number of page re-entries per million cycles for various benchmarks classified according to their base memory size. There is a wide variation in the degree of duplication required to reduce the number of page evictions and page re-entries to reasonable numbers. For benchmarks with a lower base memory size, even 10% duplication is enough to hold the write working set, and these benchmarks do not experience many page evictions and re-entries. For example, this is the case with mesa in 96 MB category with approximately only 1 page eviction and re-entry per million instructions. On the other hand, for benchmarks like gap in the 256 MB category even 40% duplication does not reduce the high number of page evictions and re-entries. However, most of the workloads experience a significant decrease in the number of page comparisons and re-entries as the size of the duplication cache is increased and the number of page comparisons and re-entries falls below 10 per million cycles. Also note that the trends for both page evictions and reentries seem similar, but the number of re-entries is smaller than the number of page evictions because any new page written by the application in steady state might cause a page eviction if the duplication cache is full. However, if a page is evicted and never referenced again it will not re-enter the duplication cache.

Page Evictions (96 MB) 12

Page Reentry (96 MB)

10

8 crafty eon twolf ammp art facerec mesa

8

6

4

Page Reentry/ Cycles in M

Page Comparisons/ Cycles in M

9

2

7 crafty eon twolf ammp art facerec mesa

6 5 4 3 2 1

0 10%

20% 30% Memory Duplication

40%

0 10%

Page Evictions (128 MB)

20% 30% Memory Duplication

40%

Page Reentry (128 MB)

25

16

20

15

parser vpr equake galgel mgrid sixtrack

10

5

Page Reentry/ Cycles in M

Page Comparisons/ Cycles in M

18

14 12

parser vpr equake galgel mgrid sixtrack

10 8 6 4 2

0 10%

20% 30% Memory Duplication

0

40%

10%

Page Evictions (192 MB) 20

18

18

16 Page Reentry/ Cycles in M

Page Comparisons/ Cycles in M

40%

Page Reentry (192 MB)

20

14 12 10

vortex fma3d lucas

8 6 4

16 14 12 vortex fma3d lucas

10 8 6 4

2

2

0 10%

20% 30% Memory Duplication

0

40%

10%

Page Evictions (256 MB) 25

bzip2

20

gzip

15

mcf perlbmk

10

applu apsi swim

5

wupwis e

0 10%

20% 30% Memory Duplication

40%

40%

10

gap gcc

20% 30% Memory Duplication

Page Reentry (256 MB)

12

Page Reentry/ Cycles in M

Page Comparisons/ Cycles in M

20% 30% Memory Duplication

bzip2

8

gap gcc gzip

6

mcf perlbmk applu

4

apsi swim wupwise

2

0 10%

20% 30% Memory Duplication

40%

Figure 9. Page evictions and re-entries for various base memory sizes.

load types the number of page evictions and re-entries decreases as the size of duplication cache increases. The rate of decrease in page evictions and re-entries is much higher in low memory workloads when compared with the larger memory workloads.

Average performance degradation 10.0% 9.0%

Percentage decline in IPC

8.0% 10% 20% 30% 40%

7.0% 6.0%

5 DISCUSSION

5.0%

Rich design space of policies: Our techniques for reducing memory duplication and core redundancy overhead open up a rich design space of policies. For example, the memory duplication size can be set statically based on cost or dynamically depending on the write working set of the application or the performance constraints of a system. We plan to explore the various policies and their tradeoffs in future work.

4.0% 3.0% 2.0% 1.0% 0.0% 96 MB

128 MB

192 MB 256 MB Base memory size

All

Figure 10: Average performance degradation across all benchmarks grouped according to base memory size. Figure 10 shows the results across all benchmarks grouped according to their base memory size. The results include the cold-start effect of populating the duplication cache initially. However, the duplicate memory can be populated with writeable pages at start time and the cold start effects can be mitigated. We found that the duplication cache consistently performs well for all benchmark groups. The worst case overhead is around 7.5% for the benchmark group with the largest base memory size (256 MB) and the smallest duplication cache (10%). 4.2

Workload Consolidation Evaluation Figure 11 shows the average performance overheads experienced by our workload consolidation benchmarks. The performance overhead ranges from around 1% for a low memory workload to around 9% for a very large memory workload with 10% memory duplication. The results show that memory duplication can enable low cost high availability without a significant decrease in performance. Performance degradation for combined workloads

Extensibility: We have highlighted the benefits of reducing memory duplication assuming a DMR system. Our scheme can easily be extended to TMR systems with additional cost savings. We have evaluated our techniques for single threaded workloads because current loose-lockstepped high availability systems like NonStop only run single threaded workloads. However, our design does not violate the memory consistency of the processor. The ordering of reads and writes by the two replicas would be the same as in a system with complete memory duplication. Therefore our scheme is applicable to future high availability systems that might run multi-threaded workloads redundantly. Implementation Simplicity: Our techniques provide a tradeoff between application performance and memory cost. A duplication cache can be implemented primarily in software (e.g. hypervisor), with minimal hardware changes required to identify read only requests. Other benefits: We have only focused on the cost and performance tradeoff in this paper. However, reducing the duplication of resources would also provide power savings. Also, reducing the overheads makes the notion of system level availability practical in commodity market segments, especially when the overheads can be controlled dynamically according to the application characteristics.

10%

Percentage decline in IPC

9%

6 RELATED WORK

8% 7% 6%

10% 20% 30% 40%

5% 4% 3% 2% 1% 0% Low

Medium Large Workload size

Very large

Figure 11. Performance overhead for combined workloads. For the sake of conciseness, we do not present the page evictions and page reentry graphs for the workloads. They mirror the trends of individual benchmarks in that across all work-

We discussed most of the related commercial fault tolerant systems and their duplication overheads in Section 2. To the best of our knowledge ours is the first proposal for reducing the memory duplication in loose lock-stepped processors with I/O level voting. The closest related work for duplication cache is the memory shadow scheme used by Sequoia fault tolerant systems [11]. Sequoia is a tightly lock-stepped system that uses shared memory. These systems do not need to duplicate the memory for error detection because they are a tightly lockstepped system with voting for individual components. However, the Sequoia systems use redundant memory paths and duplicate or shadow all the pages written. Both the processors read from the primary page only and the shadow copy is used only for recovery purpose and not for error detection. Furthermore, they duplicate all the pages written by the application all the time and do not manage the degree of duplication with a duplication cache.

Academic work has proposed several variants of integrated checking at the processor level that is similar to the IBM zSeries approach [29, 30]. Several studies have also evaluated core-level fault detection and containment by running redundant processes, either on a separate core or on a separate thread [6, 25, 31, 32] to reduce the overheads. Also, as pointed out in Section 2.2, while fingerprinting [17, 14] is a viable design choice for future customized high availability processors, our techniques do not require any changes to the cache controller and can be implemented in software with minimal hardware changes to identify read-only requests to be used in current commodity systems. Unfortunately many of these previous approaches require custom changes to the processor or do not provide fault tolerance to components other than the core. In contrast, loose lockstepped approaches like ours with I/O level detection are typically less expensive, by virtue of performing fault tolerance at a coarser level. Our work is also different in its approach to leveraging commodity multi-core processors with little additional on-chip support for providing high-levels of availability and fault containment. Reducing memory duplication overheads in loose-lockstepped processors makes them cost competitive with the custom design approaches that do instruction level or memory level comparison. Although our technique is similar to copy-on-write and software DSM [33] there are important fault tolerance differences, such as the page comparison required on eviction of a page from the duplication cache. Copy-on-write and software DSM techniques have not been used for fault tolerance because the underlying hardware was shared and could propagate errors. Also, the memory hierarchy was shared and could not be isolated as is possible in configurable isolation architecture. 7 CONCLUSIONS Expected increases in error rates with technology scaling will make high availability a key challenge for future multicore-based commodity systems. Comparison of the output of redundant execution streams will be needed to address higher transient error rates. However, as discussed earlier, approaches like specialized pipeline duplication (as in the IBM zSeries) or tight lockstepping (e.g., the prior Tandem systems) are impractical in the context of commodity components. A recent study [22] proposed configurable isolation to implement I/O-level comparison of redundant process executions, but suffered from the limitation that all key structures in the system were duplicated - causing a 100% area overhead and a 100% power penalty. In this paper, we seek to address these overheads especially for duplicating memory. Our solution leverages the intuition that ECC and other memory protection techniques provide adequate error coverage once the data gets to memory, and memory duplication is essentially needed only for errors that originate in the processor core or cache and propagate to memory. Correspondingly, we propose a novel solution that seeks to duplicate only the write working set of an application in a duplication cache. The applications that do not have a large write working set do not need to pay the constant 100% overhead of

memory duplication. Also, a duplication cache enables tradeoffs between cost and performance. Using detailed simulation results for SPECint and SPECfp benchmarks, we demonstrate the effectiveness of our solution in significantly reducing the overheads from memory, from a 40% reduction in overhead with a performance impact of 4% on average to a 90% reduction in overhead with a performance impact of 5% on average. Overall, the duplication cache technique significantly reduces the overhead of memory duplication in future high-availability systems built on commodity multicores. As future system designers increasingly grapple with the tension between providing high-availability in the face of increasing error rates, but doing it with the lowest overhead on high-volume components, we expect techniques similar to ours to become increasingly important. ACKNOWLEDGMENT The first author would like to thank Kyle Nesbit for various discussions regarding the duplication cache idea. We would like to thank Prasun Agarwal, Jung-Ho Ahn, Paolo Faraboschi, Matteo Monchiero, Daniel Ortega, Dana Vantrease, and David Wood for help with the simulator and other discussions. REFERENCES [1]

[2] [3]

[4]

[5] [6]

[7]

[8]

[9]

Borkar, S. Challenges in Reliable System Design in the Presence of Transistor Variability and Degradation. IEEE Micro, vol. 25, no. 6, Nov.-Dec. 2005, pp. 10-16. IDC #204815. Worldwide and U.S. High-Availability Server 2006-2010 Forecast and Analysis. Dec ’06. Kongetira, P., Aingaran, K., and Olukotun, K. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2):21–29, 2005. Keltcher, C. N., McGrath, K. J., Ahmed, A. and Conway, P. The AMD Opteron Processor for Multiprocessor Servers. IEEE Micro, vol. 23, no. 2, pp. 66-76, 2003. Intel Quad Core, http://www.intel.com/quadcore/index.htm Vijaykumar, T. N., Pomeranz, I., Cheng, K. Transientfault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002. Bernick, D., Bruckert, B., Vigna, P. D., Garcia, D., Jardine, R., Klecka, J., and Smullen, J. NonStop® Advanced Architecture. Conf. on Dependable Systems and Networks, 2005, 12–21. Harrison, E.S., and Schmitt, E. J.. The structure of System/88, a fault-tolerant computer. IBM journal of IBM Journal of Research and Development. Volume 26, Number 3, Page 293, 1987. Fair, M.L., Conklin, C.R., Swaney, S. B., Meaney, P. J., Clarke, W. J., Alves, L. C., Modi, I. N., Freier, F. , Fischer, W. ,and Weber, N. E. Reliability, Availability, and Serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, Nov, 2004.

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19] [20]

[21]

[22]

[23]

[24] [25]

Ekman, M. and Stenstrom, P. 2005. A Cost-Effective Main Memory Organization for Future Servers. In Proceedings of the 19th IEEE international Parallel and Distributed Processing Symposium (Ipdps'05). Bernstein, P. A. Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing. Computer 21, 2 (Feb. 1988), 37-45. Bartlett, W. and Ball, B. Tandem’s Approach to Fault Tolerance. Tandem Systems Rev., vol. 4, no. 1, Feb. 1998, pp. 84-95. Meaney, P. J., Swaney, S. B., Sanda, P. N. and Spainhower, L.. IBM z990 soft error detection and recovery. IEEE Transactions on Device and Materials Reliability, pp. 419- 427, 2005. Gold, B. T., Smolens, J. C., Falsafi, B. and Hoe, J. C. The Granularity of Soft-Error Containment in Shared Memory Multiprocessors, Proceedings of the Workshop on Silicon Errors in Logic - System Effects (SELSE), 2006. Marty, M. R. and Hill, M. D. 2007. Virtual hierarchies to support server consolidation. SIGARCH Computer Architecture News 35, 2 (Jun. 2007), 46-56 Falcon, A. Faraboschi, P., and Ortega, D. Combining Simulation and Virtualization through Dynamic Sampling. ISPASS-2007. Smolens, J. C., Gold, B. T., Kim, J., Falsafi, B., Hoe, J. C. and Nowatzyk, A. G.. Fingerprinting: Bounding soft-error detection latency and bandwidth. International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), 2004. Aggarwal, N., Ranganathan, P., Jouppi, N. P., Smith, J. E., Saluja, K. K. and Krejci, G. Motivating Commodity Multi-Core Processor Design for System-level Error Protection. Workshop on Silicon Errors in Logic - System Effects (SELSE), 2007. SPEC Benchmark Suite. http://www.spec.org and http://www.spec.org/cpu/analysis/memory/ Bressoud, T. C. TFT: A Software System for ApplicationTransparent Fault Tolerance. International Symposium on Fault-Tolerant Computing (FTCS), 1998. Bartlett, J. F. 1981. A NonStop kernel. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles, 1981 (SOSP '81). Aggarwal, N., Ranganathan, P., Jouppi, N. P., Smith, J.E. Configurable Isolation: Building High Availability Systems with Commodity Multicore Processors. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA-34), June 2007. Aggarwal, N., Ranganathan, P., Jouppi, N. P., and Smith, J. E. Isolation in Commodity Multicore Processors. Computer, vol. 40, no. 6, pp. 49-59, June, 2007. Smith, J. E. and Metze, G. Strongly Fault Secure Logic Networks IEEE Transactions on Computers, 1978. Reinhardt, S. K. and Mukherjee, S. S. Transient fault detection via simultaneous multithreading. In Proceedings

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

of the 27th International Symposium on Computer Architecture, June 2000. Bressoud, T. C. and Schneider, F. B. Hypervisor-based fault tolerance. ACM Trans. Computer Systems 14, 1 (Feb. 1996), 80-107. Purkayastha, A., Guiang, C. S., Schulz, K., Minyard, T., Milfeld, K., Barth, W., Hurley, P. and Boisseau, J. R. Performance Characteristics of Dual-Processor HPC Systems Based on 64-bit Commodity Processors. International Conference on Linux Clusters: The HPC Revolution 2004. Ezolt, P. 2001. A study in Malloc: a case of excessive minor faults. In Proceedings of the 5th Annual Conference on Linux Showcase & Conference - USENIX Association, Berkeley, CA, 17-17. Austin, T. M. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proc. of the 32nd Intl. Symposium on Microarchitecture, November 1999. Qureshi, M. K., Mutlu, O. and Patt, Y. N. Microarchitecture-based introspection: A technique for transient-fault tolerance in microprocessors. In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32), June 2005. Gomaa, M., Scarbrough, C., Vijaykumar, T.N. and Pomeranz, I. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003. Sundaramoorthy, K., Purser, Z. and Rotenberg, E. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 9th Intnl. Conference on Architectural Support for Programming Languages and Operating Systems, October 2000. Protic, J., Tomasevic, M., Milutinovic, V. Distributed shared memory: concepts and systems. Parallel & Distributed Technology: Systems & Applications, IEEE, vol.4, no.2, pp.63-71, 1996.

Suggest Documents