The cache and memory subsystems of the IBM POWER8 processor In this paper, we describe the IBM POWER8i cache, interconnect, memory, and input/output subsystems, collectively referred to as the Bnest.[ This paper focuses on the enhancements made to the nest to achieve balanced and scalable designs, ranging from small 12-core single-socket systems, up to large 16-processor-socket, 192-core enterprise rack servers. A key aspect of the design has been increasing the end-to-end data and coherence bandwidth of the system, now featuring more than twice the bandwidth of the POWER7A processor. The paper describes the new memory-buffer chip, called Centaur, providing up to 128 MB of eDRAM (embedded dynamic random-access memory) buffer cache per processor, along with an improved DRAM (dynamic random-access memory) scheduler with support for prefetch and write optimizations, providing industry-leading memory bandwidth combined with low memory latency. It also describes new coherence-transport enhancements and the transition to directly integrated PCIeA (PCI ExpressA) support, as well as additions to the cache subsystem to support higher levels of virtualization and scalability including snoop filtering and cache sharing.
Introduction The IBM POWER8* processor, shown in Figure 1, is IBM’s latest generation of POWER* processors, boasting significant increases in thread, core, and system performance, scaling to large core-count SMPs (symmetric multiprocessors) in order to support big data analytics, cognitive computing, and transaction processing. These workloads require large memory capacities with high bandwidth, accessed with low latency, and optimally supporting locking and shared data. The performance of the systems, from 12-core single-socket systems, up to the largest 192-core 16-processor-socket SMPs depend on caches, coherence, data interconnects, memory subsystems, and I/O subsystems to provide the cores with all the data they demand. Additionally, the POWER8 core has roughly doubled in performance compared to the IBM POWER7* core, with roughly 1.5 times the single-thread performance, and supporting up to 8-concurrent threads (SMT8) [1]. The POWER processor family has consistently excelled at large SMP systems,
Digital Object Identifier: 10.1147/JRD.2014.2376131
W. J. Starke J. Stuecheli D. M. Daly J. S. Dodson F. Auernhammer P. M. Sagmeister G. L. Guthrie C. F. Marino M. Siegel B. Blaner
and the POWER8 processor builds upon that history [2]. The POWER8 design provides industry-leading memory bandwidth and capacity, allowing the cores to run at full speed, while minimizing average memory latency and power consumption. This paper describes the architectural enhancements made to the POWER8 cache, interconnect, memory, and input/output subsystems, collectively referred to as the Bnest[ (shown in Figure 2), to achieve a balanced and scalable design. A major focus of the design has been increasing the end-to-end data and coherence bandwidth of the system (more than twice that of the POWER7 processor). The paper describes the new memory-buffer chip, called Centaur, with up to 128 MB of eDRAM buffer cache per processor along with an improved DRAM scheduler with support for prefetch and write optimizations, providing industry-leading memory bandwidth combined with low memory latency. It also describes new coherence-transport enhancements to optimize the performance on shared data, and integrated PCI Express** (PCIe**) support to enable higher bandwidth, lower latency I/O (input output) operations.
ÓCopyright 2015 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor. 0018-8646/15 B 2015 IBM
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
W. J. STARKE ET AL.
3:1
Figure 1 Annotated die photo of the POWER8 chip showing 12 processor cores each with local L2 and L3 cache, on-chip interconnect, memory controllers, onboard PCIe, and off-chip SMP interconnects.
The rest of this paper is organized as followed. The BCache hierarchy[ section describes the L2 (level 2) and L3 (level 3) caches. The BMemory subsystem[ section describes the memory subsystem, including the new Centaur chip with L4 (level 4) cache and technology-agnostic memory channel. The BI/O subsystem[ section describes the I/O subsystem, while the BSMP interconnect[ and BOn-chip accelerator[ sections describe the processor data and coherence interconnects with coherence filtering and the nest accelerators available in the POWER8 processor, before providing a brief conclusion.
Cache hierarchy The cache hierarchy for the POWER8 processor builds on the basic organization created for the POWER7 architecture [3]. The cache hierarchy local to a processing core comprises a store-through L1 (level 1) data cache [1] and a relatively small but low-latency and high-bandwidth, private, store-in L2 (level 2) cache built from SRAM (static random access memory). Additionally, the POWER8 processor chip has up to 96 MB of a large, dense, shared L3 (level 3) cache, comprised of 8 MB cache regions built from eDRAM [4] and local to the processor cores. The L3 cache-management algorithm selectively migrates data to the local L3 region attached to the core that is using the data, even cloning heavily used shared data copies as needed and collapsing them as they fall out of use. The same 13-state coherence protocol developed for the POWER6* architecture [5] and the POWER7 architecture is utilized for the POWER8 architecture.
3:2
W. J. STARKE ET AL.
While similar in form, the POWER8 processor’s cache hierarchy has been significantly enhanced to accommodate the computational strength of the core; the core has roughly doubled in performance from the POWER7 processor core for most workloads [6], supports twice as many hardware threads, and increases the L1 data cache capacity from 32 KB to 64 KB [1]. As shown in Table 1, the L2 capacity per core has grown from 256 KB to 512 KB, while the L3 capacity per core has increased from 4 MB to 8 MB. The number of cores has grown from 8 to 12, and the aggregate L3 capacity per chip was extended from 32 MB to 96 MB. The latencies to each level of cache remain roughly the same as in the POWER7 processor [3]. Finally, the POWER8 processor supports up to 128 MB of shared L4 (level 4) cache per processor, included in the new Centaur chip. The L4 memory cache will be detailed in the BMemory Subsystem[ section of this paper. Designed for big data workloads, most data paths throughout the L1 data cache and core-execution pipelines are twice as wide (providing twice the data width per processor clock) as those found in the POWER7 processor [1]. As described in [1], this double-wide dataflow extends through the L2 load and store paths, L2 cache arrays, local L3 region read/write paths and cache arrays, and seamlessly through the on-chip interconnect to the memory read/write interfaces. More in-flight requests must be tracked to manage the increased traffic flow enabled by the double-wide dataflow. Table 1 depicts the growth from the POWER7 processor to the POWER8 processor for the major classes of L2 and L3 cache resources to manage the increased flow. The L2 core store-gather cache consolidates store-thru traffic from the core to optimize updates to the L2 cache. L2 core read/write state machines manage coherence negotiations with the system and L2 cache reads and writes. L2 cache reads and writes can originate from consolidated core-store traffic, core data loads, instruction fetches, and address-translation fetches. L2 castout state machines manage capacity related migration of L2 data to the L3 cache or memory, while L2 system-read/write state machines manage coherence and data requests from the rest of the system. L3 core-read state machines handle core-fetch traffic for L3 hits, while L3 write state machines manage L2 castouts (data evicted from L2), lateral L3-region castouts (data evicted from other L3 regions to another region), and cache injection (data installed directly into the L3 from a remote agent). L3 prefetch-write state machines stage prefetch data from memory to the L3 cache. L3 castout and L3 system-read/write state machines perform the same functions as their L2 counterparts. In addition to the fundamental scaling improvements afforded by increased bandwidth, the POWER8 processor has significant hardware-based improvements to enhance
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
Figure 2 POWER8 Nest is composed of up to 12 cores with local caches, connected through an on-chip coherence and data interconnect to off-chip interconnect, accelerators, PCIe host controllers, and memory controllers. The accelerators include a true random number generator (RNG), cryptography accelerator (Crypto), CAPP (coherent attached processor proxy), and compression accelerator (CMP).
Table 1
Comparison of POWER7 and POWER8 cache hierarchy capacities, resources, and bandwidth.
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
W. J. STARKE ET AL.
3:3
Figure 3 The POWER8 memory subsystem includes a high-speed memory-technology-agnostic memory channel connected to the new Centaur chip, which includes an eDRAM cache and 4 DDR3/DDR4 DRAM ports.
uncontested-lock performance (time to access a lock that is free), highly-contested-lock scaling (handling multiple threads competing for the same lock), and address translation management scaling. For example, uncontested-lock performance benefits from two changes. Atomic operations are able to make binding coherence commit decisions in the L1 data cache instead of the L2 cache, reducing commit latency for uncontested locks. Additionally, hardware transactional memory has been added to the POWER8 processor enabling significant software-scaling capabilities with software lock elision for software that previously would have used locking [7], further reducing uncontested-lock latency. Beyond these foundational protocol improvements to single-thread and system-wide scaling for shared data and volatile-translation contexts, several hardware capabilities have been added to the cache hierarchy to directly enable high-value system software, middleware, and application features, as described in [8]. Such features include micro-partition prefetch and support for multiple concurrent partitions on a core, both of which enable higher performance for many smaller partitions, as well as the hardware-managed reference-history array to improve virtual-memory management.
3:4
W. J. STARKE ET AL.
Memory subsystem The memory subsystem has been a major area of our focus for the POWER8 processor, resulting in a substantial increase in bandwidth (roughly 230 GBs/socket in the POWER8 processor compared to roughly 100 GB/s for the POWER7 processor) while reducing the latency to memory from over 100 ns in the POWER7 processor to around 80 ns in the POWER8 processor. The POWER8 processor chip has 8 memory controllers (MCs) that operate synchronously with the data and coherence interconnects. Figure 3 shows the memory subsystem for one MC. Each memory controller can track up to 32 requests at a time. Memory is interleaved at a cache-line granularity of 128 B across memory controllers to support low-latency operation. In contrast to the POWER7 processor, which spread a cache line over two memory channels, every POWER8 memory channel supplies a full 128 B cache line of data. Memory-technology-agnostic memory channel The POWER7 and POWER8 processors both use a high-speed memory channel and memory-buffer chip to increase both the memory bandwidth and memory capacity above what could be supported by a direct-attached DRAM
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
configuration. The POWER8 Centaur chip itself is manufactured in the same 22 nm SOI (silicon on insulator) process as the main processor chip, using the same design rules, enabling a high-performance solution. Each memory controller can connect to a memory-buffer chip (Centaur chip) over a high-speed, differential, memory channel providing 2 B wide read and 1 B wide write width with a bit rate of 9.6 Gb/s. This results in a peak efficiency of over 28 GB/s of usable memory bandwidth per memory channel. The 8 high-speed memory channels provide bandwidth equivalent to 32 direct-attached DRAM channels. The Centaur chip will be included directly on the custom DIMM (dual-inline memory module) for some systems, and can be soldered onto the backplane or a riser card for use with industry-standard DIMMs (ISDIMMs) for other systems. The Centaur chip also enables higher efficiency on the memory channel, greatly increasing the total usable bandwidth compared to the POWER7 design. The memory-channel efficiency increase is due to architectural improvements. The POWER7 processor had the memory scheduler on the processor chip and the scheduler was DDR3 specific. The POWER7 buffer chip merely received requests from the processor chip, transmitting them to the appropriate DRAM channel, and returning data to the processor chip when it came back from the DRAM. The memory controller on the processor chip was responsible for scheduling all operations, and ensuring that there were no collisions on the channel back to the processor chip. With the scheduling tightly controlled from the processor chip, the efficiency of the memory channel to the buffer chip was tied to the efficiency of the DRAM channel itself. In the POWER8 memory subsystem, the memory channel is agnostic with respect to the memory technology used by the buffer chip. Requests sent to the Centaur chip are high-level commands (e.g., cache-line read, cache-line write) and include a tag identifying the request. The high-level commands contrast with the low-level DDR commands sent across the POWER7 memory channel. The Centaur chip processes each request, and sends back data as quickly as possible with the appropriate tag, possibly sending data back in a different order than the requests arrived. Requests might be reordered because of an L4 hit passing an earlier request. Additionally, the Centaur chip can reorder requests to the DRAM for optimal efficiency. The scheduling flexibility and L4 cache allow us to drive the memory channel to the processor chip at an efficiency of over 90%, representing a significant performance improvement over previous designs. The memory-technology-agnostic memory channel has another benefit: it enables possible upgrades to new memory technologies without requiring a new processor chip. For instance, we could develop a new variant of the Centaur chip to support novel memory technologies (e.g., phase-change memory, STT-MRAM (spin-transfer torque magnetoresistive random-access memory), flash) or a newer DDR specification
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
by changing the scheduler logic. New custom DIMMs with this new buffer chip could be used seamlessly in POWER systems built with POWER8 processors. L4 buffer cache The memory-technology-agnostic memory channel also enables another major change in the Centaur chip: the inclusion of an L4 buffer cache per Centaur providing up to 128 MB of L4 buffer cache per processor chip. The cache is organized as a 16-way set-associative cache, with the data stored in eDRAM, and the directory in SRAM. The cache serves as a buffer for the memory and is only accessed on memory-system accesses. As such, it is not snooped on every bus request like the processor caches, and does not participate in the coherence protocol. The L4 cache provides several memory improvements: lower average memory-read latency, lower write latency, more efficient memory scheduling, lower DIMM power, and prefetch extensions. The latency and power impacts are the most obvious of the improvements. Read requests that are satisfied by the L4 cache merely pay the cost of the eDRAM access (latency and power), and do not have to perform a DRAM access. In general an L4 hit reduces the latency of an L3 miss by over 35 ns and requires less energy to complete. The other features enabled by the L4 cache are also significant. The first feature is reduced write latency. All writes received by the Centaur chip are installed into the L4 cache, whether they hit the cache or not. This allows the write to be retired quickly, freeing the memory-controller resources associated with the write for use for the next command. Additionally, the cache also enables much more efficient scheduling of writes to the DRAM. The Virtual Write Queue [9] introduced the idea of using the last-level cache as a large write queue. We use the L4 cache for the same purpose. We have added machinery, called the cache cleaner, to track how many dirty lines are in the L4 cache, and to scan the cache for lines to write back to memory. The cache cleaner attempts to keep the DRAM write queue mostly full and to schedule bursts of writes to a page on each rank when writes are scheduled. This active process of scanning the cache for page-mode writes enables more efficient use of the DRAM data bus, as multiple writes to the same DRAM page do not hit any DRAM-scheduling constraints, and additionally, the page-mode writes save energy as the page only needs to be activated once. In previous designs, we would let writes accumulate in the DRAM write queue, and perform a burst of writes when either the read queue was empty or if the write queue became too full. In the current design, we try to keep the write queue mostly full. Instead of forcing writes when the write queue is mostly full, the cache cleaner monitors the number of dirty lines in the cache, and switches to a burst of writes when the number of dirty lines in the cache exceeds a threshold. In this way, we allow reads to pass writes and perform efficient bursts of page mode
W. J. STARKE ET AL.
3:5
writes when necessary to drain writes from the system, reducing the likelihood of writes delaying critical reads. Extended prefetching is also enabled by the L4 cache. As prefetch requests are sent from the cores, extra information in the form of hints is included with the prefetch requests. The extra information indicates if the prefetch is part of a stream of prefetches, and the confidence associated with the prefetch stream. For high-confidence streaming prefetches, the Centaur chip can prefetch extra data into the L4 cache before it is requested by the core, reducing latency of future prefetch requests in that stream. The extra prefetching has two forms. The first involves fetching two 128 B lines of data for one request. The second line of data is installed into the L4 cache for later use. Additionally, the two lines can be fetched together from the DRAM, using an open-page access for the second line. The open-page access increases efficiency and lowers DRAM-power usage. The second form involves fetching the next N (e.g., 4) pairs of lines in the stream into the L4 cache. This is done for the highest-confidence streams, and can essentially transform all of the prefetch requests into L4 cache hits. DRAM interface Each Centaur chip has four 9 or 10 B (9 B with up to 1 B of spare data) DRAM ports, supporting the DDR3 (double-data rate) [10] and DDR4 [11] JEDEC standards, with the DRAM scheduler included on the Centaur chip. Each port can address up to 16 logically independent ranks of DRAM, in order to support multiple physical ranks of DRAM as well as stacked DRAM. A rank of DRAM is a collection of DRAM chips (e.g., 8) that are accessed together in lockstep to provide a cache line of data. Additionally, the Centaur chip contains an eDRAM buffer cache, providing up to 128 MB of aggregate L4 memory buffer to the processor chip, enabling lower latency and improved scheduling opportunities. The DRAM ports are accessed in pairs to provide data, fetching a full 128 B cache line from a port pair. The die photo in Figure 4 shows the four DRAM ports arranged around the outside of the Centaur chip, with the high-speed channel to the processor chip taking up the remaining perimeter on the right. The eDRAM cache and the related control structures fill the center of the chip, along with the DRAM scheduling logic. DRAM scheduling must satisfy a number of scheduling constraints, including the need for regular refresh operations. While a refresh operation is occurring, no read or write commands can be issued to the rank in refresh. In traditional systems, this can have a noticeable performance impact. We have included a number of features to mitigate the refresh impact. Before the memory controller issues a refresh to a rank, it does a number of things. First it checks the high-confidence prefetch streams that are being fetched into the L4. If any of the streams will be impacted by the upcoming refresh, the stream is allowed to prefetch further ahead
3:6
W. J. STARKE ET AL.
Figure 4 Annotated die photo of the Centaur buffer chip with high-speed link to the processor chip, eDRAM memory buffer cache, DDR interfaces, and control logic.
than normal, in order to fetch lines from the soon to be refreshed rank. After the prefetcher is allowed to prefetch ahead for a period of time, no new requests to that rank are allowed into the read queue (from the prefetcher or from L4 misses). The reads that remain in the read queue are drained from the queue with high priority, and the refresh is performed after the reads are drained. Extended prefetching before performing refresh allows the stream to continue for longer before being disrupted by the refresh. Draining the read queue has a different effect: It keeps the read queue from becoming filled with reads that cannot be serviced at the expense of reads that can be serviced. Memory RAS POWER7 systems had industry leading RAS (reliability, availability, and serviceability) properties, using a 72 B marking code to support chipkill (failure of a complete DRAM chip) and sparing. For POWER7 the full 128 B access used two memory channels, and the ECC (error correcting code) correction was performed on the processor chip. In the POWER8 processor using the memory-technology-agnostic memory channel, the full 128 B cache line comes from one memory channel, allowing the ECC check and correction
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
to be performed on the buffer chip. The Centaur chip supports 10 B wide DRAM ports, consisting of 72 b of data and ECC plus 8 b of spare data. The data is arranged into a 72 B code word for 64 B of data. The DRAM is accessed in a burst length of 8 beats, with the first four beats of data from the two DRAM ports forming one code word, and the second four beats of data forming a second code word. The spare data can be multiplexed in and used in place of a failed chip (chipkill) discovered by the ECC. The ECC is able to detect and correct a complete DRAM chip failure. When a DRAM chip is determined to be faulty, a mark is placed and the chip is spared out. The spare data is steered into the place of the failed chip and, for 4 bit wide DRAM, supports steering at 4 bit wide granularity. In this way, the ECC is capable of handling three chipkill events. In addition to protecting against failures in the DRAM itself, the rest of the path to memory is also protected. The high-speed memory channel to the processor chip is protected by a CRC (cyclic redundancy check) code with a retry protocol. Transmissions over the memory channel are buffered until confirmed, and any transmission that encounters an error is resent removing the error from the system, and protecting the memory channel with the same level of protection as the rest of the system. Additionally, all the primary data paths through the chip are protected with a SECDED (single-error correcting, dual-error detecting) ECC code, while the eDRAM cache is protected by an ECC code with a fuse-controlled repair facility. The POWER8 processor supports an additional RAS feature called Selective Memory Mirroring (SMM), originally introduced in the POWER7 processor [3]. SMM enables mirroring of critical data regions such as hypervisor data. Protected data is mirrored across two separate memory channels, protecting against a complete memory channel failure.
I/O subsystem The I/O subsystem of the POWER8 processor saw a complete refresh compared to its predecessor designs. This refresh increased bandwidth, significantly lowered memory latency, lowered power consumption, and enables easy adoption by third parties, especially for those that want to use the coherent coupling features using the Coherent Accelerator Processor Interface (CAPI) protocol (see the BOn-chip accelerator[ section). Starting with the POWER4 processor [12] and z10* [13] processors, the IBM I/O subsystem was built upon a proprietary ecosystem of I/O chips [14] attached to IBM’s proprietary GX bus and later GX+ and GX++ buses to achieve higher I/O throughput than possible with the standardized interfaces available at the time. Different break-out I/O hub chips provided connectivity to PCI-X**, PCIe devices, and InfiniBand** or Ethernet fabrics.
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
The POWER8 processor is the first generation of POWER processors that features integrated, industry-standard, PCIe Gen 3.0 [15] host controllers. PCIe was chosen because of its maturity, industry-leading bandwidth, and widespread adoption by third party vendors. The POWER8 processor chip integrates 3 PCIe host bridges (PHBs) per chip, with a maximum of 16 lanes per PHB, resulting in a maximum bidirectional bandwidth of 32 GB/s per PHB. The 12-core POWER8 processor chip provides a total of 32 PCI Express lanes. A single POWER8 processor chip thus offers a maximum bidirectional bandwidth of 64 GB/s. The move towards an integrated PCIe-based I/O subsystem allows for more flexibility in adapting to specific application needs. The POWER7 I/O subsystem [16] was primarily built around external PCIe I/O hub chips that split the GX bus bandwidth of up to 20 GB/s into multiple, but slower x8 PCIe 2.0 [17] buses. In contrast, the POWER8 processor chip now offers the increased bandwidth of 32 GB/s per controller, but also the flexibility to either couple PCIe devices with very high-bandwidth requirements directly to the processor chip or to split the bandwidth among multiple slower devices using external PCIe switches. Another advantage of moving the PHB onto the processor chip is the significant reduction in memory-access latency, providing very low DMA (Direct Memory Access) read roundtrip times that are crucial for high bandwidth and low latencies with high-performance network adapter devices. The on-chip PHBs achieve time-to-first-byte latencies for DMA reads as low as 200 ns, which is less than one third of the time-to-first-byte latency feasible with the POWER7 GX-attached PCIe hub chip. I/O memory management unit The I/O memory management unit (IOMMU) provides enhanced protection and reliability, availability, and serviceability (RAS) features. The protection mechanisms allow freezing all traffic to or from a device in an isolation domain on an error condition, without affecting other domains. The protection mechanisms also enable recovery without having to reset the PHB. Therefore multiple devices can be attached to the same PHB securely without sacrifice in device reliability and availability. The IOMMU resided on the GX hub chip in POWER systems built using the POWER7 processor, and has been moved onto the POWER8 processor chip. A number of enhancements were made to the IOMMU to meet the increased I/O requirements of the POWER8 processor, and to further exploit the inclusion of the unit in the processor chip. The IOMMU’s virtualization and protection capabilities were increased to meet the higher fan out requirements of the POWER8 processor. Each POWER8 PHB is capable of handling 256 error isolation domains, which allow devices to continue operation even when another device sharing the same PHB encounters a failure
W. J. STARKE ET AL.
3:7
and has to be reinitialized. Moreover every POWER8 PHB can manage 512 virtual address spaces in order to provide enough resources to fully support and isolate single root I/O virtualized (SRIOV) [18] PCIe devices. The address-translation structures in main memory used by the IOMMU were completely rearranged and optimized to take full advantage of the new location of the IOMMU, being now on-chip and thus able to profit of core-like low-latency memory access. The new structures thus provide the optimal balance between the number of lookups needed, translation latency, and area spent on cache or array structures in each PHB. Also, the IOMMU was extended to support multi-level translation tables in addition to the traditional single-level translation scheme used in AIX, in order to better support the Linux** memory management schemes. Throughput optimization The design was optimized to fully utilize the available link bandwidth, achieving close to the theoretical limits with more than 95% raw link utilization and roughly 90% data utilization, even with traffic using the full address translation and protection capabilities of the IOMMU. The high sustained link speeds are made possible by a combination of data and coherence interconnect and micro-architectural enhancements. An improved streaming mode exclusively used for I/O data is implemented in the data interconnect, based on the streaming mode introduced in the POWER7 processor [3]. It allows data bandwidths of more than 14 GB/s in each direction for reads and writes, while still maintaining the strict ordering requirements imposed by the PCIe Specification. Frequently achieving high bandwidths on PCIe buses also requires the use of relaxed ordering to reduce dependencies between reads and writes wherever possible. The POWER8 processor does not require these ordering relaxations due to the high bandwidth provided by the streaming mode. This is critical as allowing relaxed ordering would create a potential threat to the error isolation concept of the IOMMU. In addition, various micro-architectural optimizations were made within the PHB to enable higher link efficiency. Address translation prefetching allows early inspection of PCIe addresses close to the link interface and early fetches of translation control entries (TCE) [19] in order to hide most of the lookup latency behind stack queuing and packet processing. In combination with the low latency to memory, this allows for more efficient receive buffer use, and achieves four times higher bandwidth as the prior generation external I/O hub chip, even with the same-sized link receive buffers. The PHB further applies prioritization mechanisms in many areas of the design to optimize the scheduling and arbitration especially of reads for control and payload data. It also relaxes ordering requirements wherever possible by splitting up streams from the
3:8
W. J. STARKE ET AL.
different error isolation domains to prevent stalls, but without compromising on domain isolation capabilities. Beside the basic PCIe standard functionality, the PHB now also supports the use of Translation Layer Packet (TLP) hint bits on the PCIe link to control the cache injection mechanism on the coherent interconnect.
SMP interconnect POWER systems are symmetric multiprocessor (SMP) systems, providing snooping based cache coherence across all the cores in the system. On the POWER8 processor chip, the SMP data and coherence interconnects consist of on-chip and off-chip interconnects, plus memory-coherence additions to enable efficient scaling up to 192 cores. The SMP interconnect is responsible for transferring the current value of memory locations to and from the processor cores as well as implementing the coherence protocol and enabling efficient lock handling. On-chip SMP interconnect As shown in Figure 2, the POWER8 processor on-chip interconnect consists of a multi-ported coherency interface and a highly distributed data interconnect between the processor cores each with associated L2 and L3 caches, the memory subsystem, and the I/O subsystem. As in the POWER7 processor chip, the on-chip data connect consists of eight 16 B buses that span the chip horizontally, and are broken into multiple segments to handle the propagation delay across the chip, and allows the interconnect to be pipelined. Four of the buses flow left to right, and four flow right to left, and the buses operate at up to 2.4 GHz. Micro-architecture improvements were made to the internal data interconnect to reduce request latency by resolving on-chip data-routing decisions using a distributed-arbitration scheme instead of a centralized data arbiter in previous generations. The on-chip interconnect also contains an adaptive-control mechanism to support independently controlled core frequencies while optimizing coherence-traffic bandwidth. As core frequencies are adjusted upward or downward, the coherence-arbitration logic will throttle up or throttle down the rate at which commands are issued based on the frequency of the slowest core. Off-chip SMP interconnect The off-chip interconnect, shown in Figure 5, is a highly scalable multi-tiered fully-connected topology redesigned for the POWER8 processor chip to reduce latency. POWER systems built on the POWER7 processor required up to 3 hops between chips to get from one chip to another, but POWER systems built on the POWER8 processor require no more than 2 hops. By eliminating a chip hop, the POWER system has been flattened, significantly reducing the latency between the furthest ends of the SMP (an approximately 25 ns reduction). The first-level system is a single chip.
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
had two coherence scopes to limit the coherence traffic on the SMP links by filtering the traffic that needs to be broadcast to other chips and groups [3]. The POWER8 architecture extends the POWER7 architecture’s coherence scopes, adding a third scope.
Figure 5 POWER8 SMP topology consists of up to 4 POWER8 processors connected in an all-to-all manner to form a 4-chip group, and up to 4 groups with each chip in a group connected to the matching chips in the other groups.
The second-level system is fully connected with 10 B single-ended SMP links running at 4.8 Gbps. The POWER8 processor chip has three such links, enabling direct connection to each of 4 other processor chips, in order to create a four-chip group. The third-level system connects each processor chip in a group to its corresponding processor chip in each other group. Three of the inter-group links are provided per chip supporting a total of four groups, each containing four processor chips. A full four-group system of four chips per group comprises a maximum system of 16 processor chips and a 192-way SMP. The inter-group level SMP link uses a 22 bit high-speed differential bus running at 6.4 Gbps. The reliability, availability, and serviceability (RAS), and the cost of the SMP links have also been improved in the POWER8 processor. The SMP links have the ability to dynamically detect and repair bit lanes without removing functional lanes from operation. In other words, SMP-coherence traffic continues to function normally while a bad lane is repaired. Another major change for the POWER8 processor was to replace the SMP backplane with a cabled SMP link connection, significantly reducing the system cost. System configuration is discussed in more detail in [2]. Coherence scopes The data and coherence interconnects described above provide a large amount of data and command bandwidth. However, with up to 192 cores and a snooping memory-coherence protocol, the cores could easily saturate the links with coherence traffic. The POWER7 architecture
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
Coherence filtering Commands are issued to the coherence interconnect with one of three scopes specified: chip, group, and system, where the scopes match the tiers described above. Commands issued with chip or group scope are not broadcast to the complete system, but rather only to the local chip or local group respectively. Since commands issued with chip or group scope only access a portion of the system, they require much less command bandwidth and much lower latency to complete. However, the system must still ensure memory coherence across the complete system, including the portions not snooped. The command must be seen by any cache holding the line. Hardware prediction mechanisms and software collaboration allows the use of the smallest scope to maintain coherence, while hardware checks ensure that scope includes all relevant caches. Memory-Domain Indicators (MDIs) are included for each line of data in the system in order to determine if the scope of the command is adequate to find all caches holding the requested line. MDI are assigned classes. Chip-class MDI indicate if the line has moved off the line’s home chip where the Memory Controller (MC) owning the line is attached and have been included in POWER systems since the POWER6 processor [5]. Group-class MDI are new in the POWER8 processor and indicate if the line has moved off the line’s home group. The new group class MDI are kept in a directory structure called an MCD (memory-coherence directory), and included on the processor chip. Each MCD is associated with the full address range of the memory served by that processor chip’s memory controllers. In the POWER8 processor chip, the MCD is a coarse-grained directory with each MDI Bgranule[ representing a 16 MB granule of memory. The MCD snoops commands issued with group and system scope. It participates in the address tenure of commands issued with group scope to indicate if the line specified by the command has moved off the home group (i.e., when the MCD MDI is marked Bremote[). At system scope, the MCD determines during the address tenure if the line is going to be moved off the group. If the destination of the line is off the group, the MCD sets the MDI for the line’s granule to Bremote.[ When the MCD snoops a command at group scope and indicates that the line is Bremote,[ the master may be required to retry the operation using system scope, or may be allowed to complete the operation and be directed to perform a background kill operation using system scope. Using both the chip and group MDI, the coherence protocol may direct a requestor that initially issues a request
W. J. STARKE ET AL.
3:9
with chip scope, to increase its scope to group, and then to system. Scope-prediction logic within the requestor is used to determine the initial scope for the operation. For example, if the line is in the local cache in the IG state, the predictor can determine that the line was last sent off the chip and that the request should have group or system scope. A cache prefetching consecutive lines could use the scope needed to complete the operation for one line in the stream for subsequent lines in the stream. The MC holds chip-class MDI for each 128 B cache line it holds and can reset the MDI bit to Bhome[ when a line is removed from all caches in the system. However, each MCD group-class MDI entry holds the status for a 16 MB granule, consisting of 128 K lines and requires a recovery mechanism to reset the MDI. A specialized coherence-protocol command exists to determine if any cache holds the line or holds a reservation for the line, and this command is used to implement the recovery mechanism. Data routing through topology Data needs to be routed over the data interconnect to get to its destination. On the larger SMP systems, there are multiple routes using inter- and intra-group interconnect buses that could be used to route the data. Routing normally follows a preferred shortest route between source and destination, but may use alternate routing in the case of congestion. Optimistic non-blocking coherence flow The POWER8 processor SMP interconnect utilizes a non-blocking snooping protocol, which has highly scalable, low-latency characteristics. This has an advantage over directory-based coherence protocols by decentralizing coherency and thus vastly reducing latency. The non-blocking snooping protocol also has an advantage over message-passing snooping coherence protocols because it is temporally bounded, i.e. snoopers respond in a fixed time called Tsnoop. Furthermore, message-passing snooping protocols rely on queuing structures and communication bandwidth. These become more constrained as the SMP system scales to larger n-way systems. In previous POWER server generations, coherence operations were evenly divided using time-division multiplexing in order to ensure that each processor chip’s coherency bandwidth does not exceed a limit, when all processor chips are issuing requests. However, limiting the broadcast rate for worst-case conditions often leaves significant coherence bandwidth unused. The POWER8 SMP interconnect introduced the capability to oversubscribe each processor chip’s allotment of bandwidth in order to take advantage of this unused bandwidth, without impacting the base non-blocking operation. The optimistic coherence flow’s oversubscription uses the existing POWER coherence-retry protocol. When a coherence operation is received and the resource required to
3 : 10
W. J. STARKE ET AL.
resolve coherence is unavailable, the protocol allows for the snooper to retry the operation. Oversubscription provides the same retry capability but at a chip level scale. This chip level retry can occur when the coherency operations received from the incoming SMP links exceeds the chip’s snoop bandwidth. This implies that the coherence operation was not broadcast to any downstream processor chips in the SMP topology. Using oversubscription, a coherence operation can still complete successfully even if the broadcast did not reach all chips in the SMP system, so long as the broadcast was received by the appropriate subset for that request. Not all requests are handled in the same manner when they are retried. Speculative commands, such as prefetch, do not incur as high a penalty if retried, while critical coherence requests will suffer a much higher penalty. A priority was added to requests to determine which requests to drop, when a request must be dropped. The priority of a command can be specified as low, medium, or high. Low-drop-priority commands are the first commands to be dropped, while high-priority commands are the last commands to be dropped. Furthermore, the interconnect issues requests at different rates depending on their priority. Low-drop-priority commands have a higher issue rate than high-drop-priority commands as they may be more freely dropped. Dropped requests that are retried, are retried with higher priority. The priorities and request issue rates allow the interconnect to manage the number of coherence operations concurrently in flight, allowing most high priority commands to succeed regardless of SMP system coherency traffic while speculative, low-priority commands utilize the remaining coherence bandwidth, increasing overall coherence-bandwidth efficiency. The coherence interconnect issues requests at different rates based on commands’ scope, priority, and current interconnect usage. The rates are dynamically controlled in hardware using feedback from dropped commands, in order to optimize bandwidth use and minimize dropped requests.
On-chip accelerators In POWER7+*, IBM introduced on-chip accelerators for cryptography and active memory expansion (AME) [20] and provided a true-hardware random-number generator for cryptographic applications [21]. In POWER8, these accelerators have been carried forward and improved, adding new capabilities, such as providing data-prefetching hints to the memory controller to reduce cache-line-access latency and improve throughput. POWER8 introduces the Coherent Accelerator Processor Interface (CAPI), described in detail elsewhere in this issue of the IBM Journal of Research and Development [22]. CAPI enables off-chip accelerators to be plugged into an up to 16-lane PCIe slot and participate in the system memory-coherence protocol as a peer of other caches in the system at high bandwidth. Accelerators use effective
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
addresses to reference data structures in memory just like applications running on the cores. Accelerator designers are given the freedom to implement their PCIe-card based designs using field-programmable gate arrays (FPGA) application-specific integrated circuit (ASIC) chips, semi-custom integrated circuits, and so forth.
2.
3.
Conclusion In this paper, we have described the IBM POWER8 cache, interconnect, memory, and input/output subsystems collectively referred to as the Bnest.[ Systems built from the POWER8 processor and nest represent a substantial increase in single thread and throughput computing. The paper focused on the enhancements made to the nest to achieve balanced and scalable designs, ranging from small 12-core single-socket systems, up to large 16-socket, 192-core enterprise rack servers. The local-cache hierarchy has been improved to accommodate the computational strength of the core (which has roughly doubled in performance), with twice the L2 and L3 cache capacity, twice the bandwidth, more in flight commands, more efficient locking and translation support, while maintaining similar access latencies as the POWER7 cache. The memory subsystem was a major area of focus for the POWER8 processor providing industry-leading bandwidth, while simultaneously reducing the latency to memory significantly. The memory subsystem uses the new Centaur chip, including a new L4 memory-buffer cache and containing the DRAM scheduler, enabling higher memory-channel efficiency. The I/O subsystem of the POWER8 processor chip saw a complete refresh compared to its predecessor designs, bringing 32 lanes of the PCIe 3.0 onto the processor chip, significantly increasing bandwidth while lowering DMA memory latencies considerably. The data and coherence interconnects were improved to increase end-to-end bandwidth by increasing on-chip efficiency, increasing intra-group and inter-group bandwidth, reducing the maximum number of hops, and improving the coherence-scope filtering. Finally, the on-chip accelerators from POWER7 were extended, featuring the inclusion of the new Coherent Accelerator Processor Interface (CAPI). *Trademark, service mark, or registered trademark of International Business Machines Corporation in the United States, other countries, or both.
4.
5.
6.
7.
8.
9.
10. 11. 12. 13.
**Trademark, service mark, or registered trademark of PCI-SIG, InfiniBand Trade Association, or Linus Torvalds, Inc., in the United States, other countries, or both. 14.
References 1. B. Sinharoy, J. A. Van Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky, J. W. Bishop, M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbach, T. Karkhanis, and K. M. Fernsler, BIBM POWER8 processor core
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
15.
microarchitecture,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 2, pp. 2:1–2:21, 2015. J. Cahill, T. Nguyen, M. Vega, D. Baska, D. Szerdi, H. Pross, R. Arroyo, H. Nguyen, M. Mueller, D. Henderson, and J. Moreira, BIBM Power Systems built with the POWER8 architecture and processors,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 4, pp. 4:1–4:10, 2015. B. Sinharoy, R. R. Kalla, W. J. Starke, H. Q. Le, R. Cargnoni, J. A. Van Norstrand, B. Ronchetti, J. Stuecheli, J. Leenstra, G. L. Guthrie, D. Nguyen, B. Blaner, C. F. Marino, E. Retter, and P. Williams, BIBM POWER7 multicore server processor,[ IBM J. Res. & Dev., vol. 55, no. 3, paper 1, pp. 1:1–1:29, May/Jun. 2011. S. Narasimha, P. Chang, C. Ortolland, D. Fried, E. Engbrecht, K. K. Nummy, P. Parries, T. T. Ando, M. Aquilino, N. Arnold, R. Bolam, J. Cai, M. Chudzik, B. Cipriany, G. Costrini, M. Dai, J. Dechene, C. Dewan, B. Engel, M. Gribelyuk, D. Guo, G. Han, N. Habib, J. Holt, D. Ioannou, B. Jagannathan, D. Jaeger, J. Johnson, W. Kong, J. Koshy, R. Krishnan, A. Kumar, M. M. Kumar, J. Lee, X. Li, C. Lin, B. Linder, S. Lucarini, N. Lustig, P. McLaughlin, K. Onishi, V. Ontalus, R. Robison, C. Sheraw, M. Stoker, A. Thomas, G. Wang, R. Wise, L. Zhuang, G. Freeman, J. Gill, E. Maciejewski, R. Malik, J. Norum, and P. Agnello, B22 nm high-performance SOI technology featuring dual-embedded stressors, Epi-Plate High-K deep-trench embedded DRAM and self-aligned via 15LM BEOL,[ in 2012 IEEE Electron. Devices Meeting (IEDM), 2012 IEEE Int., Dec. 10–13, 2012, pp. 3.3.1–3.3.4. H. Q. Le, W. J. Starke, J. S. Fields, F. P. O’Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden, BIBM POWER6 microarchitecture,[ IBM J. Res. & Dev, vol. 51, no. 6, pp. 639–662, Nov. 2007. A. Mericas, N. Peleg, L. Pesantez, S. B. Purushotham, P. Oehler, C. A. Anderson, B. A. King-Smith, M. Anand, B. Rogers, L. Maurice1, and K. Vu, BIBM POWER8 performance features and evaluation,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 6, pp. 6:1–6:10, 2015. H. Q. Le, G. L. Guthrie, D. E. Williams, M. M. Michael, B. G. Frey, W. J. Starke, C. May, R. Odaira, and T. Nakaike, BTransactional memory support in the IBM POWER8 processor,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 8, pp. 8:1–8:14, 2015. B. Sinharoy, R. Swanberg, N. Nayar, B. Mealey, J. Stuecheli, B. Schiefer, J. Leenstra, J. Jann, P. Oehler, D. Levitan, S. Eisen, D. Sanner, T. Pflueger, C. Lichtenau, W. E. Hall, and T. Block, BAdvanced features in IBM POWER8 systems,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 1, pp. 1:1–1:18, 2015. J. Stuecheli, D. Kaseridis, L. John, D. Daly, and H. C. Hunter, BCoordinating DRAM and last-level-cache policies with the virtual write queue,[ IEEE Micro, vol. 31, no. 1, pp. 90–98, Jan./Feb. 2011. JEDEC Solid State Technology Association, JEDEC Standard: DDR3 SDRAM, 2010. [Online]. Available: www.jedec.org/ sites/default/files/docs/JESD79-3E.pdf JEDEC Solid State Technology Association, JEDEC Standard: DDR4 SDRAM, 2012. [Online]. Available: www.jedec.org/ sites/default/files/docs/JESD79-4.pdf J. M. Tendler, J. S. Dodson, J. S. Fields, H. Le, and B. Sinharoy, BPOWER4 system microarchitecture,[ IBM J. Res. & Dev., vol. 46, no. 1, pp. 5–25, Jan. 2002. C.-L. K. Shum, F. Busaba, S. Dao-Trong, G. Gerwig, C. Jacobi, T. Koehler, E. Pfeffer, B. R. Prasky, J. G. Rell, and A. Tsai, BDesign and microarchitecture of the IBM System z10 microprocessor,[ IBM J. Res. & Dev., vol. 53, no. 1, pp. 1:1–1:12, Jan. 2009. E. W. Chencinski, M. A. Check, C. DeCusatis, H. H. Deng, M. Grassi, T. A. Gregg, M. M. Helms, A. D. Koenig, L. Mohr, K. Pandey, T. Schlipf, T. Schober, H. Ulrich, and C. R. Walters, BIBM System z10 I/O subsystem,[ IBM J. Res. & Dev., vol. 53, no. 1, pp. 6:1–6:13, Jan. 2009. PCI Express Base Specification, Revision 3.0, Nov. 2010. [Online]. Available: http://www.pcisig.com/specifications/pciexpress/base3/
W. J. STARKE ET AL.
3 : 11
16. R. X. Arroyo, R. J. Harrington, S. P. Hartman, and T. Nguyen, BIBM POWER7 systems,[ IBM J. Res. & Dev., vol. 55, no. 3, pp. 2:1–2:13, May/Jun. 2011. 17. PCI Express Base Specification, Revision 2.1, Mar. 2009. [Online]. Available: http://www.pcisig.com/members/downloads/ specifications/pciexpress/PCI_Express_Base_r2_1_04Mar09.pdf 18. Single Root I/O Virtualization and Sharing Specification, Revision 1.1, Jan. 2010. [Online]. Available: http://www. pcisig.com/specifications/iov/single_root/ 19. Logical Partition Security in the IBM eServer pSeries 690, IBM white paper. [Online]. Available: http://www-1.ibm.com/servers/ eserver/pseries/hardware/whitepapers/lpar_security.html 20. B. Blaner, B. Abali, B. M. Bass, S. Chari, R. Kalla, S. Kunkel, K. Lauricella, R. Leavens, J. J. Reilly, and P. A. Sandon, BIBM POWER7+ processor on-chip accelerators for cryptography and active memory expansion,[ IBM J. Res. & Dev, vol. 57, no. 6, pp. 3:1–3:16, Nov./Dec. 2013. 21. J. S. Liberty, A. Barrera, D. W. Boerstler, T. B. Chadwick, S. R. Cottier, H. P. Hofstee, J. A. Rosser, and M. L. Tsai, BTrue hardware random number generation implemented in the 32-nm SOI POWER7+ processor,[ IBM J. Res. & Dev, vol. 57, no. 6, pp. 4:1–4:7, Nov./Dec. 2013. 22. J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel, BCAPI: A Coherent Accelerator Processor Interface,[ IBM J. Res. & Dev., vol. 59, no. 1, paper 7, pp. 7:1–7:7, 2015.
Received March 17, 2014; accepted for publication April 12, 2014 William J. Starke IBM Systems and Technology Group, Austin, TX 78758 USA (
[email protected]). Mr. Starke joined IBM in 1990 after graduating from Michigan Technological University with a B.S. degree in computer science. He currently serves as the IBM Distinguished Engineer and Chief Architect for the Power processor storage hierarchy, and is responsible for shaping the processor cache hierarchy, symmetric multi-processor (SMP) interconnect, cache coherence, memory and I/O controllers, accelerators, and logical system structures for Power systems. He leads a large engineering team that spans multiple geographies. Mr. Starke has been employed by IBM for almost 25 years in several roles, spanning mainframe and Power systems performance analysis, logic design, and microarchitecture. Over the past decade, he has served as the storage hierarchy Chief Architect for POWER6, POWER7, POWER8, and follow-on design points that are currently in development. Mr. Starke holds approximately 200 U.S. patents. Jeff Stuecheli IBM Systems and Technology Group, Austin, TX 78758 USA (
[email protected]). Dr. Stuecheli is a Senior Technical Staff Member in the Systems and Technology Group. He works in the area of server hardware architecture. His most recent work includes advanced memory architectures, cache coherence, and accelerator design. He has contributed to the development of numerous IBM products in the POWER architecture family, most recently the POWER8 design. He has been appointed an IBM Master Inventor, authoring about 100 patents. He received B.S., M.S., and Ph.D. degrees from The University of Texas Austin in electrical engineering. David M. Daly IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY 10598 USA (
[email protected]). Dr. Daly was a Research Staff Member at the IBM T. J. Watson Research Center. He received a B.S. degree in computer engineering from Syracuse University in 1998, and M.S. and Ph.D. degrees in electrical engineering from the University of Illinois at Urbana-Champaign in 2001 and 2005, respectively. He subsequently joined IBM at the IBM T. J. Watson Research Center, where he worked on next generation POWER microprocessor design, including the POWER8 microprocessor, from 2008 to 2014. The main focus of his work has been the design and performance analysis of the memory subsystem using trace driven, queuing, and spreadsheet analysis.
3 : 12
W. J. STARKE ET AL.
Additional work has focused on system scalability, memory coherence, and network interconnects, as well as system reliability and availability. Dr. Daly is a senior member of the Institute of Electrical and Electronics Engineers (IEEE).
J. Steve Dodson IBM Systems and Technology Group, Austin, Texas 78758 USA (
[email protected]). Mr. Dodson is a Senior Engineer in the POWER microprocessor development organization at IBM Austin. He received a B.S.E.E. degree from the University of Kentucky in 1982. He subsequently joined IBM where he has worked on the development of I/O bridge chips, cache controllers, and memory controllers for POWER processors from the earliest POWER systems through the POWER8 processor. Mr. Dodson is currently a memory stack hardware architect for the POWER8 processor. He is an IBM Master Inventor and coauthor of over 200 issued U.S. patents.
Florian Auernhammer IBM Research - Zurich, Switzerland (
[email protected]). Dr. Auernhammer is a Research Staff Member in the Cloud and Computing Infrastructure department at IBM Research - Zurich. He received an M.S. degree in general engineering from Ecole Centrale Paris in 2005, and Dipl.-Ing. and Ph.D. degrees in electrical engineering from the Technical University of Munich in 2005 and 2011, respectively. He joined IBM at the IBM Research - Zurich Lab in 2006, where he has worked on I/O host bridges for InfiniBand and PCI Express, their efficient integration into coherent fabrics, and I/O virtualization. He is author or coauthor of six issued patents in addition to several currently pending. Dr. Auernhammer is a member of the Institute of Electrical and Electronics Engineers (IEEE). Patricia M. Sagmeister IBM Research - Zurich, Switzerland (
[email protected]). Dr. Sagmeister is a Research Staff Member in the Cloud and Computing Infrastructure department at IBM Research - Zurich. She received a Dipl.-Inf. degree in computer science from the University of Passau in 1993, as well as a Ph.D. degree in computer science from the University of Stuttgart in 2000. In 2013, she received an M.B.A. degree from Warwick Business School. She joined IBM in 1999 at IBM Research - Zurich, where she has worked on various aspects of datacenter optimization. In 2008 and 2009, she had an assignment at the IBM Thomas J. Watson Research Center working specifically on POWER8 I/O architecture. She is author or coauthor of several patents and technical papers. Dr. Sagmeister is a senior member of the Institute of Electrical and Electronics Engineers (IEEE). Guy L. Guthrie IBM Systems and Technology Group, Austin, TX 78758 USA (
[email protected]). Mr. Guthrie is a Senior Technical Staff Member on the POWER Processor development team in the Systems and Technology Group and is an architect for the IBM POWER8 cache hierarchy, coherence protocol, SMP interconnect, memory and I/O subsystems. He served in a similar role for POWER4, POWER5, POWER6, and POWER7 programs as well. Prior to that, he worked as a hardware development engineer on several PCI (Peripheral Component Interconnect) Host Bridge designs and also worked in the IBM Federal Systems Division on a number of IBM Signal Processor Development programs. He received his B.S. degree in electrical engineering from Ohio State University in 1985. Mr. Guthrie is an IBM Master Inventor and holds 184 issued U.S. patents.
Charles F. Marino IBM Systems and Technology Group, Austin, TX 78758 USA (
[email protected]). Mr. Marino received his B.S. degree in electrical and computer engineering from Carnegie-Mellon University. He is a Senior Engineer in the IBM Systems and Technology Group. In 1984, he joined IBM in Owego, New York. Mr. Marino is currently the interconnect team lead for the IBM POWER8 servers.
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
Michael Siegel Systems and Technology Group, Research Triangle Park, NC 27709 USA (
[email protected]). Mr. Siegel is a Senior Technical Staff Member in the IBM Systems and Technology Group. He currently works as the hardware architect of coherent bus architectures developed for POWER system applications. In 2003, Mr. Siegel joined the PowerPC* development team to support the high performance processor roadmap and creating standard products for both internal and external customers. Roles included memory controller design lead and coherency bus design lead and architect, leading to the development of the PowerBus architecture in use in POWER processor chips starting with POWER7. He supported multiple projects including IBM POWER7 and POWER8 processors, and assisting in customer discussions for future game and joint chip development activity, incorporating new functions into the architecture as system requirements evolved. While working on the POWER8 PowerBus architecture, he worked as a hardware architect of the processor side of the Coherent Attached Processor Interface, working with development teams spanning the processor chip and first generation coherently attached external coprocessor by specifying hardware behavior and micro code architecture. Prior to his work in the IBM Systems and Technology Group, Mr. Siegel worked in NHD developing the Rainier network processor, the IEEE 802.5 DTR standard, IBM token ring switch products based on the standard. He started working for IBM in Poughkeepsie, New York, on the 3081 I/O subsystem and the ES/9000 Vector Facility. Mr. Siegel is an IBM Master inventor and holds over 70 patents issued by the U.S. Patent Office.
Bart Blaner IBM Systems and Technology Group, Essex Junction, VT 05452 USA (
[email protected]). Mr. Blaner earned a B.S.E.E. degree from Clarkson University. He is a Senior Technical Staff Member in the POWER development team of the Systems and Technology Group. He joined IBM in 1984 and has held a variety of design and leadership positions in processor and ASIC development. Recently, he has led accelerator designs in POWER7+ and POWER8 platforms, including the Coherent Accelerator Processor Proxy design. He is presently focused on the architecture and implementation of hardware acceleration technologies spanning a variety of applications for future POWER processors. He is an IBM Master Inventor, a Senior Member of the Institute of Electrical and Electronics Engineers (IEEE) and holds more than 30 patents.
IBM J. RES. & DEV.
VOL. 59
NO. 1
PAPER 3
JANUARY/FEBRUARY 2015
W. J. STARKE ET AL.
3 : 13