PicoServer: Using 3D Stacking Technology to Build ... - EECS @ UMich

Chapter 9

PicoServer: Using 3D Stacking Technology to Build Energy Efficient Servers Taeho Kgil, David Roberts, and Trevor Mudge

Abstract With power and cooling increasingly contributing to the operating costs of a datacenter, energy efficiency is the key driver in server design. One way to improve energy efficiency is to implement innovative interconnect technologies such as 3D stacking. Three-dimensional stacking technology introduces new opportunities for future servers to become low power, compact, and possibly mobile. This chapter introduces an architecture called Picoserver that employs 3D technology to bond one die containing several simple slow processing cores with multiple memory dies sufficient for a primary memory. The multiple memory dies are composed of DRAM. This use of 3D stacks readily facilitates wide low-latency buses between processors and memory. These remove the need for an L2 cache allowing its area to be re-allocated to additional simple cores. The additional cores allow the clock frequency to be lowered without impairing throughput. Lower clock frequency means that thermal constraints, a concern with 3D stacking, are easily satisfied. PicoServer is intentionally simple, requiring only the simplest form of 3D technology where die are stacked on top of one another. Our intent is to minimize risk of introducing a new technology (3D) to implement a class of low-cost, low-power, compact server architectures.

9.1 Introduction Datacenters are an integral part of today’s computing platforms. The success of the internet and the continued scalability in Moore’s law have enabled internet service providers such as Google and Yahoo to build large-scale datacenters with millions of servers. For large-scale datacenters, improving energy efficiency becomes a critical

T. Kgil (B) Intel, 2111 NE 25th Ave, Hillsboro, OR, 97124, USA e-mail: [email protected]

Y. Xie et al. (eds.), Three-Dimensional Integrated Circuit Design, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-0784-4_9, C Springer Science+Business Media, LLC 2010

219

220

T. Kgil et al.

task. Datacenters based on off-the-shelf general purpose processors are unnecessarily power hungry, require expensive cooling systems, and occupy a large space. In fact, the cost of power and cooling these datacenters will likely contribute to a significant portion of the operating cost. Our claim can be confirmed in Fig. 9.1 which breaks down the annual operating cost for datacenters. As Fig. 9.1 clearly shows, the cost in power and cooling servers is increasingy contributing to the overall operating costs of a datacenter.

Spending (billions of dollars)

Power and cooling New server spending Installed base of servers

100

45 40 35 30

75

25 50

20 15

25

10 5 0

19

96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04 20 05 20 06 20 07 20 08 20 09 20 10

0

Installed base of servers (millions)

50

125

Fig. 9.1 IDC estimates for annual cost spent (1) power and cooling servers and (2) purchasing additional servers [52]

One avenue to designing energy efficient servers is to introduce innovative interconnect technology. Three-dimensional stacking technology is an interconnect technology that enables new chip multiprocessor (CMP) architectures that significantly improve energy efficiency. Our proposed architecture, PicoServer,1 employs 3D technology to bond one die containing several simple slow processor cores with multiple DRAM dies that form the primary memory. In addition, 3D stacking enables a memory processor interconnect that is of both very high bandwidth and low latency. As a result the need for complex cache hierarchies is reduced. We show that the die area normally spent on an L2 cache is better spent on additional processor cores. The additional cores mean that they can be run slower without affecting throughput. Slower cores also allow us to reduce power dissipation and with its thermal constraints, a potential roadblock to 3D stacking. The resulting system is ideally suited to throughput applications such as servers. Our proposed architecture is intentionally simple and requires only the simplest form of 3D technology where die is stacked on top of one another. Our intent is to minimize risk of realizing a

1 This

chapter is based on the work in [32] and [29]

9

PicoServer: Using 3D Stacking Technology to Build Energy Efficient Servers

221

class of low-cost, low-power, compact server architectures. Employing PicoServers can significantly lower power consumption and space requirements. Server applications handle events on a per-client basis, which are independent and display high levels of thread-level parallelism. This high level of parallelism makes them ill-suited for traditional monolithic processors. CMPs built from multiple simple cores can take advantage of this thread-level parallelism to run at a much lower frequency while maintaining a similar level of throughput and thus dissipating less power. By combining them with 3D stacking we will show that it is possible to cut power requirements further. Three-dimensional stacking enables the following key improvements: • High-bandwidth buses between DRAM and L1 caches that support multiple cores – thousands of low-latency connections with minimal area overhead between dies are possible. Since the interconnect buses are on-chip, we are able to implement wide buses with a relatively lower power budget compared to interchip implementations. • Modification in the memory hierarchy due to the integration of large capacity on-chip DRAM. It is possible to remove the L2 cache and replace it with more processing cores. The access latency for the on-chip DRAM2 is also reduced because address multiplexing and off-chip I/O pad drivers [47] are not required. Further, it also introduces opportunities to build nonuniform memory architectures with a fast on-chip DRAM and relatively slower off-chip secondary system memory. • Overall reduction in system power primarily due to the reduction in core clock frequency. The benefits of 3D stacking stated in items 1 and 2 allow the integration of more cores clocked at a modest frequency – in our work 500-1000 MHz – on-chip while providing high throughput. Reduced core clock frequency allows their architecture to be simplified; for example, by using shorter pipelines with reduced forwarding logic. The potential drawback of 3D stacking is thermal containment (see Chapter 3). However, this is not a limitation for the type of simple, low-power cores that we are proposing for the PicoServer, as we show in Section 9.4.5. In fact the ITRS projections of Table 9.2 predict that systems consuming just a few watts do not even reequire a heat sink. The general architecture of a PicoServer is shown in Fig. 9.2. For the purposes of this work we assume a stack of five to nine dies. The connections are by vias that run perpendicular to the dies. The dimensions for a 3D interconnect via vary from 1 to 3 μm with a separation of 1 to 6 μm. Current commercial offerings

2 We

will refer to die that is stacked on the main processor die as “on-chip,” because they form a 3D chip.

222

T. Kgil et al. DRAM 4

DRAM die #5 DRAM die #4

DRAM 3

DRAM die #3

DRAM 2

DRAM die #2 logic die #1 CPU0 CPU7 MEM ... MEM CTRL CTRL

DRAM 1 NIC

IO CTRL

Logic Heat sink

Fig. 9.2 A diagram depicting the PicoServer: a CMP architecture connected to DRAM using 3D stacking technology with an on-chip network interface controller (NIC) to provide low-latency high-bandwidth networking

can support 1,000,000 vias per cm2 [26]. This is far more than we need for PicoServer. These function as interconnect and thermal pipes. For our studies, we assume that the logic-based components – the microprocessor cores, the network interface controllers (NICs), and peripherals – are on the bottom layer and conventional capacity-oriented DRAMs occupy the remaining layers. To understand the design space and potential benefits of this new technology, we explored the tradeoffs of different bus widths, number of cores, frequencies, and the memory hierarchy in our simulations. We found bus widths of 1024 bits with a latency of two clock cycles at 250 MHz to be reasonable in our architecture. In addition, we aim for a reasonable area budget constraining the die size area to be below 80 mm2 at 90 nm process technology. Our 12-core PicoServer configuration which occupies the largest die area is conservatively estimated to be approximately 80 mm2 . The die areas for our 4- and 8-core PicoServer configurations are, respectively 40 mm2 and 60 mm2 . We also extend our analysis of PicoServer and show the impact of integrating Flash onto a PicoServer architecture. We provide a qualitative analysis of two configurations that integrate (1) Flash as a discrete component and (2) directly stacks Flash on top of our DRAM + logic die stack. Both configurations leverage the benefits of 3D stacking technology. The first configuration is driven by bigger system memory capacity requirements, while the second configuration is driven by small form factor. This chapter is organized as follows. In the next section we provide background for this work by describing an overview of server platforms, 3D stacking technology, and trends in DRAM technology. In Section 9.3, we outline our methodology for the design space exploration. In Section 9.4, we provide more details for the PicoServer architecture and evaluate various PicoServer configurations. In Section 9.5, we present our results in the PicoServer architecture for server benchmarks and compare our results to conventional architectures that do not employ 3D stacking. These architectures are CMPs without 3D stacking and conventional highperformance desktop architectures with Pentium 4-like characteristics. A summary and concluding remarks are given in Section 9.6.

9


223

9.2 Background This section provides an overview of the current state of server platforms, 3D stacking technology, and DRAM technology. We first show how servers are currently deployed in datacenters and analyze the behavior of current server workloads. Next, we explain the state of 3D stacking technology and how it is applied in this work. Finally, we show the advances in DRAM technology. We explain the current and future trends in DRAM used in the server space.

9.2.1 Server Platforms 9.2.1.1 Three-Tier Server Architecture Today’s datacenters are commonly built around a 3-tier server architecture. Figure 9.3 shows a 3-tier server farm and how it might handle a request for service. The first tier handles a large bulk of the requests from the client (end user). Tier 1 servers handle web requests. Because Tier 1 servers handle events on a per-client basis, they are independent and display high levels of thread-level parallelism. For requests that require heavier computation or database accesses, they are forwarded to Tier 2 servers. Tier 2 servers execute user applications that interpret script languages and determine what objects (typically database objects) should be accessed. Tier 2 servers generate database requests to Tier 3 servers. Tier 3 servers receive database queries and return the results to Tier 2 servers.

Fig. 9.3 A typical 3-tier server architecture. Tier 1 – web server, Tier 2 – application server, Tier 3 – database server

For example, when a client requests a Java Servlet Page (JSP web page), it is first received by the front end server – Tier 1. Tier 1 recognizes a Java Servlet Page that must be handled and initiates a request to Tier 2 typically using remote message interfaces (RMI). Tier 2 initiates a database query on the Tier 3 servers, which in turn generate the results and send the relevant information up the chain all the way to Tier 1. Finally, Tier 1 sends the generated content to the client.

224

T. Kgil et al.

Three-tier server architectures are commonly deployed in today’s server farms, because they allows each level to be optimized for its workload. However, this strategy is not always adopted. Google employs essentially the same machines at each level, because economies of scale and manageability issues can outweigh the advantages. We will show that, apart from the database disk system in the third tier, the generic PicoServer architecture is suitable for all tiers. 9.2.1.2 Server Workload Characteristics Server workloads display a high degree of thread-level parallelism (TLP) because connection-level parallelism through client connections can be easily mapped to thread-level parallelism (TLP). Table 9.1 shows the behavior of commercial server workloads. Most of the commercial workloads display high TLP and low instruction-level parallelism (ILP) with the exception of decision support systems. Conventional general-purpose processors, however, are typically optimized to exploit ILP. These workloads suffer from a high cache miss rate regularly stalling the machine. This leads to low instructions per cycle (IPC) and poor utilization of processor resources. Our studies have shown that except for computation intensive workloads such as PHP application servers, video-streaming servers, and decision support systems, out-of-order processors have an IPC between 0.21 and 0.54 for typical server workloads, i.e., at best modest computation loads with an L2 cache of 2 MB. These workloads do not perform well because much of the requested data has been recently DMAed from the disk to system memory, invalidating cached data that leads to cache misses. Therefore, we can generally say that single-thread-optimized out-of-order processors do not perform well on server workloads. Another interesting property of most server workloads is the appreciable amount of time spent in kernel code, unlike SPECCPU benchmarks. This kernel code is largely involved in

Table 9.1 Behavior of commercial workloads adapted from [38] Attribute

Web99

JBOB(JBB) TPC-C

SAP 2T

SAP 3T DB TPC-H

Application category Instruction-level parallelism Thread-level parallelism Instruction/data working-set Data sharing I/O bandwidth

Web server

Server java

OLTP∗

ERP†

ERP†

DSS‡

low

low

low

med

low

high

high

high

high

high

high

high

large

large

large

med

large

large

low med high low (network)

∗ OLTP : online transaction processing † ERP : enterprise resource planning ‡ DSS : decision support system

high med high high (disk) med (disk) high (disk)

med med (disk)

9


225

interrupt handling for the NIC or disk driver, packet transmission, network stack processing, and disk cache processing. Finally, a large portion of requests are centered around the same group of files. These file accesses request access to memory and I/O. Due to the modest computation requirements, memory and I/O latency are critical to high performance. Therefore, disk caching in the system memory plays a critical part in providing sufficient throughput. Without a disk cache, the performance degradation due to the hard disk drive latency would be unacceptable. To perform well on these classes of workloads an architecture should naturally support multiple threads to respond to independent requests from clients. Thus intuition suggests that a CMP or SMT architecture should be able to better utilize the processor die area. 9.2.1.3 Conventional Server Power Breakdown Figure 9.4 shows the power breakdown of a server platform available today. This server uses a chip multiprocessor implemented with many simple in-order cores to reduce power consumption. The power breakdown shows that 1/4 is consumed by the processor, 1/4 is consumed by the system memory, 1/4 is consumed by the power supply, and 1/5 is consumed by the I/O interface. Immediately we can see that using a relatively large amount of system memory results in the consumption of a substantial fraction of power. This is expected to increase as the system memory clock frequency increases and system memory size increases. We also find that despite using simpler cores that are energy efficient, a processor would still consume a noticeable amount of power. The I/O interface consumes a large amount of power due to the high I/O supply voltage required in off-chip interfaces. The I/O supply voltage is likely to reduce as we scale in the future but would not scale as fast as the core supply voltage. Therefore, there are many opportunities to further reduce power by integrating system components on-chip. And finally, we find that the power supply displays some inefficiency. This is due to the multiple levels of voltage it has to support. A reduced number of power rails will dramatically improve the power supply efficiency. Three-dimensional stacking technology has the potential to

Processor 16GB memory I/O

Disk Service Processor Fans

Fig. 9.4 Power breakdown of T2000 UltraSPARC executing SpecJBB

AC/DC conversion Total Power 271W

226

T. Kgil et al.

(1) reduce the power consumed by the processor and the I/O interfaces by integrating additional system components on-chip and also (2) improve power supply efficiency by reducing the number of power rails implemented in a server.

9.2.2 Three-Dimensional Stacking Technology This section provides an overview of 3D stacking technology. In the past there have been numerous efforts in academia and industry to implement 3D stacking technology [17, 40, 37, 44, 57]. They have met with mixed success. This is due to the many challenges that need to be addressed. They include (1) achieving high yield in bonding die stacks; (2) delivering power to each stack; and (3) managing thermal hotspots due to stacking multiple dies. However, in the past few years strong market forces in the mobile terminal space have accelerated a demand for small form factors with very low power. In response, several commercial enterprises have begun offering reliable low-cost die-to-die 3D stacking technologies. In 3D stacking technology, dies are typically bonded as face-to-face or faceto-back. Face-to-face bonds provide higher die-to-die via density and lower area overhead than face-to-back bonds. The lower via density for face-to-back bonds result from the through silicon vias (TSVs) that have to go through silicon bulk. Figure 9.5 shows a high-level example of how dies can be bonded using 3D stacking technology. The bond between layer 1 (starting from the bottom) and 2 is faceto-face, while the bond between layer 2 and 3 is face-to-back. Using the bonding techniques in 3D stacking technology opens up the opportunity of stacking heterogeneous dies together. For example, architectures that stack DRAM and logic are manufactured from different process steps. References [43, 24, 16] demonstrate the benefits of stacking DRAM on logic. Furthermore, with the added third dimension from the vertical axis, the overall wire interconnect length can be reduced and wider bus width can be achieved at lower area costs. The parasitic capacitance and resistance for 3D vias are negligible compared to global interconnect. We also note that the size and pitch of 3D vias only adds a modest area overhead. Three-dimensional via pitches are equivalent to 22λ for 90-nm technology, which is about the size of a 6T SRAM cell. They are also expected to shrink as this technology becomes mature.

Bulk Si Active Si Through silicon vias

Bulk Si Active Si

Die to die vias Active Si

Fig. 9.5 Example of a three-layer 3D IC

Face to Back Bond

Bulk Si

Face to Face Bond

9


227

The ITRS roadmap in Table 9.2 predicts deeper stacks being practical in the near future. The connections are by vias that run perpendicular to the dies. As noted earlier, the dimensions for a 3D interconnect via vary from 1 to 3 μm with a separation of 1 to 6 μm. Current commercial offerings can support 1,000,000 vias per cm2 [26]. Table 9.2 ITRS projection [12] for 3D stacking technology, memory array cells, and maximum power budget for power aware platforms. ITRS projections suggest DRAM density exceeds SRAM density by 15–18× entailing large capacity of DRAM can be integrated on-chip using 3D stacking technology as compared to SRAM

Low-cost/handheld #die/stack SRAM density Mbits/cm2 DRAM density Mbits/cm2 at production Maximum power budget for cost-performance systems (W) Maximum power budget for low-cost/handheld systems with battery (W)

2007

2009

2011

2013

2015

7 138 1,940

9 225 3,660

11 365 5,820

13 589 9,230

14 948 14,650

104

116

119

137

137

3.0

3.0

3.0

3.0

3.0

Table 9.3 Three-dimensional stacking technology parameters [26, 13, 44]

Size Minimum pitch Feed through capacitance Series resistance

Face-to-back

Face-to-face

RPI

MIT 3D FPGA

1.2μ × 1.2μ

PicoServer: Using 3D Stacking Technology to Build ... - EECS @ UMich

PicoServer: Using 3D Stacking Technology to Build ... - EECS @ UMich

Suggest Documents

PicoServer : Using 3D Stacking Technology To ... - EECS @ Michigan

USING DIRECTED INFORMATION TO BUILD ... - EECS @ UMich

Toward Communicating Simple Sentences Using ... - EECS @ UMich

Optimizing Non-Monotonic Interconnect using ... - EECS @ UMich

Using Homomorphisms to Transfer Options across ... - EECS @ UMich

The Learning Edge - EECS @ UMich

Random matrix theory - EECS @ UMich

APPLICATION-SPECIFIC ARCHITECTURE ... - EECS @ UMich

RESEARCH ARTICLE Using a cognitive architecture ... - EECS @ UMich

Event Information Extraction Using Link Grammar - EECS @ UMich

Exploring SIMD for Molecular Dynamics, Using Intel ... - EECS @ UMich

A DC-Powered High-Voltage Generator Using a ... - EECS @ UMich

Engine AF and Torque Control using Secondary ... - EECS @ UMich

Tracking of multiple moving sources using recursive ... - EECS @ UMich

Operational Analysis of Parallel Servers - EECS @ UMich

three dimensional shape modeling: segmentation ... - EECS @ UMich

Implementing Weighted )-Matching Algorithms - EECS @ UMich

Synthesis of Maximally-Permissive Liveness ... - EECS @ UMich

MEVBench: A Mobile Computer Vision ... - EECS @ UMich

Learning Geometry from Sensorimotor Experience - EECS @ UMich

Parallel Algorithms for Regular Architectures - EECS @ UMich

Building Facade Detection, Segmentation, and ... - EECS @ UMich

Statistical Tomographic Image Reconstruction ... - EECS @ UMich

Exceptional negative thermal expansion and ... - EECS @ UMich