Optimizing Thread Throughput for Multithreaded Workloads on Memory ...

Optimizing Thread Throughput for Multithreaded Workloads on Memory Constrained CMPs Major Bhadauria and Sally A. Mckee Computer Systems Lab Cornell University Ithaca, NY, USA

[email protected], [email protected] ABSTRACT

1.

Multi-core designs have become the industry imperative, replacing our reliance on increasingly complicated microarchitectural designs and VLSI improvements to deliver increased performance at lower power budgets. Performance of these multi-core chips will be limited by the DRAM memory system: we demonstrate this by modeling a cycle-accurate DDR2 memory controller with SPLASH-2 workloads. Surprisingly, benchmarks that appear to scale well with the number of processors fail to do so when memory is accurately modeled. We frequently find that the most efficient configuration is not the one with the most threads. By choosing the most efficient number of threads for each benchmark, average energy delay efficiency improves by a factor of 3.39, and performance improves by 19.7%, on average. We also introduce a shadow row of sense amplifiers, an alternative to cached DRAM, to explore potential power/performance impacts. The shadow row works in conjunction with the L2 Cache to leverage temporal and spatial locality across memory accesses, thus attaining average and peak speedups of 13% and 43%, respectively, when compared to a state-ofthe-art DRAM memory scheduler.

Power and thermal constraints have begun to limit the maximum operating frequency of high performance processors. The cubic increase in power from increases in frequency and higher voltages required to attain those frequencies has reached a plateau. By leveraging increasing die space for more processing cores (creating chip multiprocessors, or CMPs) and larger caches, designers hope that multithreaded programs can exploit shrinking transistor sizes to deliver equal or higher throughput as single-threaded, singlecore predecessors. The current software paradigm is based on the assumption that multi-threaded programs with little contention for shared data scale (nearly) linearly with the number of processors, yielding power-efficient data throughput. Nonetheless, applications fully exploiting available cores, where each application thread enjoys its own cache and private working set, may still fail to achieve maximum throughput [5]. Bandwidth and memory bottlenecks limit the potential of multi-core processors, just as in the single-core domain [2]. We explore memory effects on a CMP running a multithreaded workload. We model a memory controller using state-of-the-art scheduling techniques to hide DRAM access latency. Pin limitations affect available memory channels and bandwidth, which can cause applications not to scale well with increasing cores. It limits the maximum number of cores that should be used to achieve the best ratio between efficiency and performance. Academic computer architects rarely model the variable latencies of main memory requests, instead assuming a fixed latency for every memory request [9]. We compare differences in performance between modeling a static memory latency and modeling a memory controller with realistic, variable DRAM latencies for the SPLASH-2 benchmark suite. We specifically examine CMP efficiency, as measured by their energy delay squared product. Our results demonstrate the insufficiency of assuming static latencies: the behaviors caused by such modeling choices not only affect performance but the shapes of the performance curves with respect to number of threads. Additionally, we find that choosing the optimal configuration instead of that with the most threads can result in significantly higher computing efficiency. This is a direct result of variable DRAM latencies and the bottlenecks they cause for multi-threaded programs. This bottleneck is visible at a lower thread count than in earlier research [25] when the multi-dimensional structure of DRAM is modeled. To help mitigate these bottlenecks, we introduce a shadow

Categories and Subject Descriptors C.4 [Performance of Systems]: Design Studies ; C.1 [Processor Architectures]: Parallel Architectures

General Terms Design, Performance

Keywords Performance, Power, Memory, Efficiency, Bandwidth

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CF’08, May 5–7, 2008, Ischia, Italy. Copyright 2008 ACM 978-1-60558-077-7/08/05 ...$5.00.

INTRODUCTION

1

4

8

16 Threads

1.2

Normalized Speedup

row of DRAM sense amplifiers. The shadow row keeps the previously accessed DRAM page accessible (in read-only mode) without closing the current page, and thus it can be fed to the core when it satisfies the pending memory request. This leverages the memory’s temporal and spatial locality characteristics to potentially improve performance. The shadow row can be used in conjunction with normal caching or can be used to implement an alternative noncaching policy, preventing cache pollution by data unlikely to be reused. The contributions of this work are threefold.

1

0.8

0.6

0.4

0.2

• We extend a DDR2 memory model to include memory optimizations exploiting the shadow row. • We find significant non-linear performance differences between bandwidth-limited static latency simulations and those using a realistic DRAM model and memory controller. • We find that the maximum number of threads is rarely the optimal number to use for performance or power.

2. THE MULTI-DIMENSIONAL DRAM STRUCTURE Synchronous random access memory (SDRAM) is the mainstay to bridge the gap between cache and disk. DRAMs have not increased in operating frequency or bandwidth proportionately with processors. One innovation is double data rate SDRAM (DDR-SDRAM) that transfers data on both the positive and negative edges of the clock. DDR2 SDRAM further increases available memory bandwidth by doubling the internal memory frequency of DRAM chips. However, memory still operates an order of magnitude slower than the core processor. Main memory (DRAM) is partitioned into ranks, where each is composed of independently addressable banks, and each bank contains storage arrays addressed by row and column. When the processor requires data from memory, the request is steered to a specific rank and bank based on physical memory address. All banks on a rank share command and data buses, which can lead to contention when memory requests arrive in bursts. Banks receive an activate command to charge a bank of sense amplifiers. A data row is fetched from the storage array and copied to the sense amps (a hot row or open page). Banks are sent a read command to access a specific column of the row. Once accesses to the row finish, a precharge command closes it so another can be opened: reading from DRAM is destructive, thus the precharge command writes data back from the hot row to DRAM storage before precharging the sense amps. This prevents pipelining accesses to different rows within a bank, but commands can be pipelined to different banks to exploit parallelism within the memory subsystem. Although banks are not pipelined, data can be interleaved across banks to reduce effective latency. For data with high spatial locality, the same row is likely to be accessed consecutively multiple times. Precharge and activate commands are wasted when the same row previously accessed is closed and subsequently reopened. Modern memory controllers usually keep rows open to leverage spatial locality, operating in “open page” mode. Memory controllers can also operate in “closed page” mode, where a row is always closed after it is accessed.

0

cholesky

fft

lu

ocean

radix

Figure 1: Performance for Open Page Mode with Read Priority Normalized to No Priority

3.

EXPLOITING GREATER LOCALITY

To take advantage of high spatial locality present in memory access patterns of many parallel scientific applications, we make a simple modification to our DRAM model. We add a second “hot row” of sense amplifiers per DRAM bank, storing the second-most recently accessed data page. When the main row is sent a precharge command, its data are copied into this shadow row while being written back to the bank’s storage array. The shadow row supplies data for read requests, but data are not written to it, since it does not perform write-backs to the main storage array. The shadow row cannot satisfy write requests, but programs are most sensitive to read requests. Since most cores employ a store buffer, this seems a reasonable tradeoff between complexity and performance. We examine performance improvements of prioritizing reads over writes in satisfying memory requests in a standard DRAM controller (now a common memory controller optimization). Figure 1 shows significant speedups by giving reads priority over writes for some benchmarks. Clock cycles are averaged for multi-threaded configurations, and then normalized to their counterparts that give reads and writes equal priorities. Differences in clock cycles among different threads are negligible. Giving reads priority generally improves performance for all but a few benchmarks. For these, our scheduling can lead to thread starvation by giving newer reads priority over older outstanding writes. Prior research [8] addresses this by having a threshold at which writes increase in priority over reads. Only one set of sense amplifiers responds to commands at any time (i.e., either the shadow or the hot row); avoiding parallel operation simplifies power issues and data routing. We make minimal modifications to a standard DRAM design, since cost is an issue for commodity DRAM manufacturing. If manufacturing cost restrictions are relaxed, the hot row can perform operations in parallel while the shadow row feeds data to the chip. Our mutual-exclusion restriction results in minimal changes to internal DRAM chip circuitry. Figure 2 illustrates a DRAM chip with modifications (shaded) to show a potential organization for incorporating shadow rows. To further reduce costs and power overhead,

reordering can achieve multiple open-page hits. In contrast, we explicitly try to maintain high queue depth pressure, thereby increasing the possibilities of page hits, which reduce latency to be less than that of the fixed latency case.

4

3 2 1

: : =

#$ % #$ & #$ ' #$ ( #$ ) #$ * #$ +

; - ; +) 8 ; - 8 #; +8 #; * 0

6 5

4.

, , #$ -

. / . / 9. 9;. 9;. 9

7

Figure 2: Internal DRAM Circuitry with Shadow Row

shadow row sense-amplifiers can be replaced by latches, since the shadow row reads data from sense amplifiers and not from the (lower-powered) storage array. We introduce no additional buses, adding minimal logic overhead. The shadow row shares the bus with the original hot row. Since the “column address strobe” (CAS) command uses fewer input pins than the “Row Address Strobe” (RAS), the extra pins can be used to indicate whether to read data from the hot row or the shadow row. Effectively utilizing the shadow row thus requires slight memory controller modifications. Note that the shadow row does not require sense-amplifiers, and can use simple flipflop memory elements. Sense-amplifiers are large and power hungry, especially compared to digital CMOS memory elements. Thereby, the addition of an extra page of buffers results in negligible power and area overhead. Yamauchi et al. [27] explore a different approach to increasing the effective number of banks without increasing circuit overhead. They pipeline and overlap bank structures (such as row and column decoders and output sense amplifiers) over several data arrays. They implement 32 such banks, while only using the hardware overhead of four banks. They thus trade area costs for increased complexity in the memory controller; the memory controller must ensure banks sharing data structures do not conflict with one another when satisfying memory requests. Increasing the number of banks reduces the queue depth pressure, reducing the chance that

EXPERIMENTAL SETUP

We use SESC [18], a cycle-accurate MIPS ISA simulator, to evaluate our workloads. We incorporate a cycle-accurate DRAM memory controller that models transmitting memory commands synchronously with the front side bus (FSB). We assume memory chips of infinite size (really 4 GB), and no page faults. Our workloads are memory-intensive programs from the SPLASH-2 suite, which should scale to sixteen or more cores [25]. We choose programs with large memory footprints exceeding our L2 cache. Our baseline is an aggressive CMP with four-issue out-oforder cores with relevant design parameters listed in Table 1. The baseline design incorporates L1 and L2 caches of 32 KB four-way and 1 MB eight-way associativity, respectively, to model realistic silicon die area utilization, reduce program dependence on main memory, and fit large portions of the working set on chip. The load-store queues are an important characteristic of modern architectures, as they prevent stores from causing stalls. The fixed latency used for accessing main memory incorporates the time to perform an activate and read/write command. The precharge command time is omitted, since the memory request can be satisfied without or in tandem with the row being closed. A request takes 300 cycles, which accounts for the time to open a new row and read the data and transmit it over the FSB (Front Side Bus). When data are transmitted over the FSB, the transmission is limited by the bandwidth of the memory channel. For dynamic power consumption, we conservatively assume that cross communication between cores and clock network power is negligible. Based on current technology trends, the static power accounts for at least 50% of total core power [12], and increases linearly with number of cores and execution time. Our DRAM chips are 800MHz 4-4-4-14 memory chip timings for CAS, RAS, RAS precharge commands, with a precharge to row charge delay of 18 memory clock cycles, respectively. Our variable latency calculations for the closed and open page mode variants use discrete command timings for the row access strobe (RAS), column access strobe (CAS), and the precharge (PRE). Our memory controller intercepts requests from L2 cache. The requests are partitioned into discrete DRAM commands. If the request’s page is already open, then only a CAS command is sent to memory. Otherwise PRE and RAS commands are sent, as well. Once the CAS command completes, the memory request is satisfied, and the next command is processed. Our controller implements out of order memory scheduling and read priority, thus later requests accessing an open page are given preference over earlier requests. Read requests are have preference over older write requests when choosing the next request to satisfy. We use memory buffers that are filled and satisfied on a first come first served basis (FCFS), however the FCFS priority of a request can be preempted if either of the aforementioned two conditions are satisfied. We use an infinite queue size for rescheduling accesses to remove as many configuration specific bottlenecks from our memory controller. We use a 32-bit memory space, with all addresses 32 bits long. The memory addresses are based on addresses being mapped to row address, memory

Technology Number of Cores Execution Issue/Decode/Commit Width Instruction Fetch Queue Size INT/FP ALU Units Physical Registers LSQ Branch Mispredict Latency Branch Type L1 Icache L1 Dcache Cache Coherence Shared L2 Cache Main Memory

70nm 1/4/8/16 Out of Order 4 8 2/2 80 40 2 Hybrid 32KB 2-Way Associative 1-cycle Access 32B Lines 32KB 4-Way Associative 2-cycle Access 32B Lines MESI 1024KB 8-way Associative 9-cycle Access 32B Lines 300 cycles static latency

Table 1: Base Architectural Parameters APPLICATION Cholesky FFT Lu Ocean Radix

INPUT PARAMETERS tk29.0 20 Million 512x512 matrix, 16×16 blocks 258x258 ocean 2M keys

Table 2: SPLASH-2 Large Inputs channel, bank, and column address from most significant bit to least significant bit respectively. This is found to be the most optimal configuration [24] because we want columns to have the most variability and rows the least. Additionally, our bit mappings spread requests across the most banks, and reduce the number of row conflicts that occur to any given bank. For our SPLASH-2 multi-threaded benchmarks, we use a large input set. Since SPLASH-2 is rather dated, we use aggressive inputs to keep the benchmarks competitive in evaluating current performance bottlenecks on modern systems. Input parameters for our programs is outlined in Table 2. The larger input sets offset the initialization, program partitioning overhead and program portions that are not parallelized.

5. EVALUATION We quantitatively measure computing efficiency as the lowest energy delay squared product (ED2 ) of our configurations for each benchmark. Researchers find this to be a good metric for measuring power-efficient computing [21]. Figure 3 (a) shows performance of programs as n umber of threads increase from one to 16. Assuming a fixed latency for every main memory request, all programs achieve a modest reduction in execution time by increasing threads. However, even with a static latency, not all benchmarks exhibit linear improvement in performance with increasing threads. This is attributed to: a) programs being limited by available bandwidth, and b) contention for shared data, exhibited by locks/barriers and increases in main memory accesses by contention for the shared cache. Researchers find that for a 32 processor run, a 258 × 258 ocean simulation acquires and releases thousands of locks during execution [25]. Other programs such as radix, lu and fft have no locks and few barriers, and are instead possibly limited by bandwidth. The

16 thread ocean run performs only 2% better than the eight thread version. Bandwidth contention plays a large role here, with threads having a large concentration of misses with increasing thread counts. Figure 3 (b) shows the most efficient configuration for every benchmark is usually the one with the most threads. This illustrates that when assuming a static latency with peak bandwidth (every memory access being a page hit), the SPLASH benchmarks increase in speed with increases in number of threads and processors. These results imply increasing threads as the salvation to increasing performance efficiently, since frequency has plateaued. ED2 results indicate increasing threads leads to improved or equivalent efficiency for most programs. The exception being ocean, in which the 16 thread case has a 2.6% increase in ED2 over its eight thread counterpart, likely due to overheads of scaling and increases in cache misses. As shown in later graphs, these results are not accurate with what behavior would be truly exhibited in a realistic system constrained by accurate memory latencies and bandwidth. For a single access, a closed page system and the static latency assumption both satisfy a request within the same number of clock cycles. However, once multiple accesses occur, a DRAM controller utilizing the closed page scheme quickly incurs a backlog of requests to open and close rows, often even the same one. This results in a substantial degradation of performance. Figure 4 (a) shows the performance for a system using a closed page memory system normalized to a single thread fixed latency simulation (the ideal case). Configurations ranging from a single thread to 16 threads are graphed. Closed page mode single threaded benchmarks perform worse than the static latency case due to the multidimensional structure of DRAM. Clearly the disadvantage of this system is not only performance but power consumption as the same rows pre-charged are subsequently charged. In closed page mode, almost all of our benchmarks are memory limited after eight or fewer threads. The exception being lu which improves performance with increases in number of processors. Closed page simulations show a 50+% difference (degradation) in performance compared to the static latency scenario. The difference is also non-uniform, with all benchmarks showing a degradation in performance with just one thread, but this trend changes with increasing threads depending on benchmark. Hence, no normalization can map results with no cycle-accurate DRAM timing model to simulations where one is used, because of the non-linear structure of DRAM memory. Figure 4 (b) illustrates the ED2 inefficiency of closed page mode for four to 16 threads normalized to the single threaded closed page mode configuration. It indicates that the four threaded configuration is often the most efficient or close to it for most benchmarks. Lu is the only benchmark that offers strong evidence to the contrary. The differences for cholesky and fft make increasing number of threads debatable. Figure 5 (a) graphs open page-mode variable latency. Performance is normalized to the static latency of a single threaded configuration. Again, we see that non-uniform results, with some single-threaded open page applications showing little difference in performance (cholesky, lu), and others showing degradations (fft, ocean), and one showing significant improvements (radix). Open page mode leverages the temporal and spatial locality inherent within programs, but it fails to achieve the performance that static latency with peak bandwidth assumes would be available. This is surprising, since

4

8

16 Threads

4

14

Normalized Inefficiency

12

Normalized Speedup

8

16 Threads

0.25

10 8 6 4

0.2

0.15

0.1

0.05

2 0

0

cholesky

fft

lu

ocean

radix

cholesky

fft

lu

ocean

radix

(b) Energy Delay (ED2 ) Inefficiency (lower is better)

(a) Performance

Figure 3: Results for Static Main Memory Latency Normalized to One Thread

1

4

8

16 Threads

4

2.5

8

16 Threads

5


Normalized Speedup

4.5 2

1.5

1

0.5

4 3.5 3 2.5 2 1.5 1 0.5

0

0

cholesky

fft

lu

ocean

radix

cholesky

fft

lu

ocean

radix


(a) Performance

4

8

16 Threads

4

16 Threads

2.5


6

Normalized Speedup

8

5.59

1 7

15.1

Figure 4: Results for Variable Closed Page Latency Normalized to One Thread

5 4 3 2

2

1.5

1

0.5

1 0

0 cholesky

fft

lu

ocean

(a) Normalized Performance

radix

cholesky

fft

lu

ocean

radix


Figure 5: Results for Variable Open Page Latency Normalized to One Thread

1

4

8

1

16 Threads

4

8

16 Threads

1.5 1.4

1.4

1.2

Normalized Speedup

Normalized Speedup

1.3 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4

1.2 1 0.8 0.6 0.4

0.3 0.2

0.2 0.1

0

0 cholesky

fft

lu

ocean

radix

Figure 6:

cholesky

fft

lu

ocean

radix

Shadow Row Performance Normalized to Open Page Read Priority

Figure 8: 16 Bank Performance Normalized to Eight

the aggressive out of order memory scheduling should ensure maximizing possible open page hits, but clearly it is insufficient. Examining open page performance trends, we see this issue is exacerbated with increasing threads as memory pressure increases on the memory channel, leading to programs bottle-necking at lower thread counts, not because of limited bandwidth, but because of latency of accessing a multi-dimensional DRAM structure. Figure 5 (b) depicts the inefficiency of programs running with four to 16 threads, normalized to their single threaded counterparts (which is not shown). Surprisingly, the maximum number of threads is rarely the most efficient. For radix the single threaded case is the most efficient, while for fft, cholesky and ocean it is the four thread configuration. If one was to always choose the maximum number of threads for each benchmark, they would end up having 339% worse ED2 on average than the optimal value, resulting in increased energy and often delay as well. We implement the shadow row to reduce the effective latency of DRAM. Figure 6 shows speedup with the shadow row, where performance is normalized to a baseline state of the art memory scheduler that reorders memory accesses, giving priority to open page hits and reads. Significant reductions in execution time are achieved with every benchmark, except lu, since it is not as memory limited as the other benchmarks. The benchmarks suffer no performance degradation, since shadow row misses do not increase delay over the baseline scheduler. Overall, the shadow row achieves speedups of 12.8%, and peak performance of 43.5%. For some benchmarks, performance gains decrease with increasing threads, since the memory addresses become more sparse. Sparse addresses that span more than the last two hot rows results in an open page miss. We investigate the underlying causes for the difference in performance between the shadow row and the baseline open page memory controller, by examining the percentage of DRAM page conflicts. DRAM page conflicts occur when the memory requested is not on a currently opened DRAM row, requiring it to be closed, and the requested memory’s corresponding row opened. Figure 7 (a) shows memory page conflicts for the open page DRAM controller normalized to the to-

tal number of requests to main memory. Radix shows a significantly high number of DRAM page conflicts, followed by fft, ocean and cholesky respectively. This indicates that for some benchmarks, performance degradation is due to DRAM page conflicts rather than from limited bus bandwidth. We validate these observations in the latter section where we examine benchmark sensitivity to bandwidth. Figure 7 (b) graphs DRAM page conflicts normalized to the open page DRAM memory controller. There was only negligible differences in DRAM memory accesses between the open page memory controller and the one with the shadow row, so it is not graphed. For most benchmarks, there are significant reductions in DRAM page conflicts compared to the baseline, with cholesky showing the highest reductions in conflict misses. On average, a 30% reduction in conflict misses is achieved from keeping the previous row open. These page conflict reductions result from our shadow row performance improvements. Some performance benefits are masked by the bandwidth bottleneck, which we explore in the next section. For example, radix shows the most performance improvement, but it does not enjoy the same reductions in page misses. This is due to the other benchmarks (cholesky and fft) being simultaneously bottlenecked by memory bandwidth, reducing performance gains. We investigate the sensitivity of benchmark performance due to number of banks and bus bandwidth speed. Although current chips consist of eight banks we are interested in sensitivity to bank size as this increases to 16, and the expected performance improvements. Increasing number of banks reduces contention on a single bank, since memory requests might get spread across more banks. Figure 8 graphs performance of 16 banks normalized to a baseline memory controller that utilizes read priority, out of order memory scheduling of an eight bank configuration. Each 16 bank configuration is normalized to its respective eight bank configuration. Performance increases from the original eight bank configuration for most benchmarks. However, some benchmarks actually degrade in performance. For example, the eight thread fft run increases in delay by 4.8%. This is because fewer accesses are waiting in the queue for reordering, resulting in fewer accesses to the same hot row. This

Bank Open Page Read Priority

1

4

8

16 Threads

1

4

8

16 Threads

110%

Normalized DRAM Page Conflicts

Normalized DRAM Page Conflicts

80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00%

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

0.00%

cholesky

fft

lu

ocean

cholesky

radix

(a) Page Conflicts Normalized to Total DRAM Accesses

fft

lu

ocean

radix

(b) Shadow Row Conflicts Normalized to Open Page Conflicts

Figure 7: DRAM Page Conflict Differences Between Open Page and Open Page with Shadow Row

1

4

8

16 Threads

1.6

Normalized Speedup

1.4 1.2 1 0.8 0.6 0.4 0.2 0

cholesky

fft

lu

ocean

radix

Figure 9: Increasing Bandwidth Performance Normal-

controller, which we use as our baseline. Some benchmarks show speedups (cholesky, fft, ocean) by increasing with number of threads due to increasing memory pressure. Two benchmarks show no improvements for two different reasons. Lu was not originally constrained by memory, so increasing bandwidth does not improve its performance. Radix shows negligible improvements at lower thread counts, because it suffers more from page conflicts than from limited bandwidth. This is foreshadowed in the earlier graph, which shows multi-threaded radix having very high DRAM page conflict misses (over 50%). In such a scenario, it is more important to reduce page conflicts than to improve available bandwidth, since the time for satisfying a DRAM conflict is so high. This is also verified by speedups for radix in Figure 6, where the shadow row significantly improves performance by providing more open page hits than the baseline open-page memory controller.

ized to Open Page Read Priority

6.

RELATED WORK

results in less contention but fewer hits to a hot row where the access latency is even less than the static latency configuration. Comparing Figure 8 to Figure 6 indicates that the shadow row actually performs better for every benchmark and thread configuration except for the radix four and eight thread runs. Additionally, the best performance for cholesky and fft is still the eight thread configuration, and for radix and ocean it is the four thread configuration. Thereby doubling existing banks fails to adequately change the thread performance curve. In fact, ocean saturates even faster now (at four threads rather than the eight threads with the previous eight bank configuration). We examine performance improvements that can be expected by reducing the amount of time data occupies the bus (effectively speeding the bus). Specifically, we assume we can double the bus speed or widen communication lanes, such that data only occupies the bus for half the number of CPU clock cycles they previously did. We graph the results of applying this optimization on our open-page memory

The literature contains a wealth of research on memory devices and subsystems. We focus on DRAM latency-reduction and bandwidth-enhancing techniques here. Researchers previously found the wide gap in performance between memory and processor resulted in a memory wall [26]. This was the observation that the processor, regardless of operating speed would be limited by the slower speed of the memory it was dependent on for data. Previous work has examined this issue for single-threaded benchmarks on single core processors [20]. One way to alleviate the speed gap between memory and processor is efficient DRAM memory scheduling. Earlier examined for streaming benchmarks [16], memory scheduling strategies have been examined for high performance computing [19], with several scheduling methods culminating from this work. One idea is to process outstanding memory accesses that map to the same row consecutively, out of the order they may have arrived at the memory controller. With this strategy, emphasis is given on first processing memory requests that map to the currently open page. Another strategy is to satisfy read accesses out of order over write requests, since programs can usually continue

processing with outstanding writes pending. However, both strategies need to have some heuristic to ensure outstanding requests don’t starve because of the continued promotion of later requests. Researchers find DRAM does not effectively capture temporal locality of memory [28]. One solution is to leverage the temporal locality of DRAM data by using cached DRAM, where some cache is stored in the DRAM along with the DRAM arrays. Memory requests are first handled by the cache. A miss in the DRAM cache results in a memory controller on the DRAM forwarding the request onto the main DRAM arrays. We don’t investigate this implementation, as it requires a memory controller as well as the implementation of cache on the DRAM. Current trends indicate multiple memory controllers will be integrated on chip, increasing in tandem with the number of cores [7]. Enhanced SDRAM (ESDRAM) is a similar technology to cached DRAM, utilizing a SRAM cache row for each bank between the IO channel and the sense-amplifier [4]. The SRAM cache rows store an entire open page. The rows can be closed and another row opened without disrupting the data being read from the SRAM row, effectively keeping the previous row open longer. Additionally, the row can be closed while the data from that row is still available for accessing, pipelining the overhead of the subsequent precharge command. However, unlike the shadow row it cannot keep two different pages open and feed data from both rows of the same bank to the chip. ESDRAM’s cache row functions like a normal open page, and can be written to and read from. Prior research examined prefetching data to reduce the effects of the memory wall. One idea is to prefetch cache blocks when the DRAM channels are idle [15]. This method does not increase pressure on the DRAM channel since the prefetching only occurs when the channels are free. Other work dynamically modifies the TLB and cache [1] to improve memory hierarchy performance, reducing the effects of the memory wall. Another idea is to remap non-contiguous physical addresses using directives by the application and compiler. Researchers [3] find this improves cache locality and bus utilization, as well as being an effective tool for prefetching from the memory controller. The adaptive history based (AHB) memory controller reschedules memory accesses, accounting for extended dimensions of the DRAM memory hierarchy [8]. The AHB uses the same bank and row hardware conflict management schemes previously discussed. However, rank, port and channel conflicts are avoided using an adaptive history based algorithm that reschedules memory accesses based on what previous memory accesses are already in the pipeline. This is done to reduce the conflict of switching communication buses, which have significant latencies in their configuration. In the CMP domain, recently Nesbit et al. [17] looked at running serial programs in unison on a CMP. They find that the first come first serve policy that worked in the serial domain is no longer optimal. They apply theoretical techniques used for quality of service for networks to reduce variance in bandwidth utilization and improve system performance. Since our work is orthogonal to the above methods, they could be used in conjunction to achieve higher throughput and efficiency. Researchers at the University of Maryland examined why it is worthwhile to study DRAM at the system level [9].

This work culminated in the development of DRAMsim, a tool for modeling the effects of variable latency [23]. Currently, DRAMsim does not support partitioning memory requests into discrete memory commands. However, future versions are scheduled to implement discrete memory controller commands, making it a viable possibility for our future research. It also can be currently used by architects to examine the potential performance of their architecture when the DRAM is accurately modeled. Recently, they compared FBDIMM and DDR performance, finding good FBDIMM performance being dependent on bus utilization and traffic [6]. Additionally, Jaleel et al. find [11] [10] that systems with larger reorder buffers, once modeled accurately with a memory controller, can actually degrade in performance. They find that increasing reorder buffer sizes beyond 128 entries can increase the frequency of replay traps and data cache misses, resulting in performance degradation. We examine the effects increasing the number of banks has on performance. Previous research found increasing banks results in 18% performance improvement [24], however their memory controller did not perform any memory re-ordering. Memory reordering that results in multiple hits to an open page results in lower latency than the static latency case. Although currently there are no high density 16 bank DDR DRAMs being shipped, researchers have shown their feasibility [13]. Another method of reducing the memory bottleneck is incorporating multiple memory controllers on chip, which is outside the scope of this study. The main disadvantages for multiple memory controllers are space limitations on chip and providing the pin count for adequate bandwidth. However, recent advances in photonics have reduced the latency for on-chip networks, making them a potentially feasible alternative in the distant future. With the potential for large amount of bandwidth and many memory channels without the resulting explosive increases in pin count, using photonics to feed future processors could solve the memory wall problem. Researchers have looked at the potential gains of using photonics on-chip [14], and if there are enough banks, bandwidth contention could become trivial.

7.

CONCLUSIONS

We find that the memory wall limits multi-threaded program performance, and this issue is exacerbated with increasing threads. We demonstrate memory intensive programs are limited by DRAM latencies and bandwidth as number of cores and program threads increase. Even programs that show performance gains with increasing thread counts and limited bandwidth actually degrade considerably when the multi-dimensional structure of DRAM is accounted for. This bottleneck results in severely limited computing efficiency, and (depending on application) it also increases energy or delay costs. Attempts at using state of the art memory scheduling techniques to hide DRAM latencies are not sufficient in reversing the trend. We find using the optimal number of threads versus the maximum results in average efficiency improvements of 19.7% with memory intensive programs from the SPLASH-2 benchmarks. We implement closed and open page mode memory controllers and demonstrate their effects on performance. We introduce a shadow row technique for reducing main memory latency, and attain average delay reductions of 13%. We examine performance sensitivity to DRAM parameters: specif-

ically increasing number of banks on a single channel and increasing number of bus lanes. Unfortunately neither of these achieve consistently higher thread performance. Our shadow row achieves competitive performance with other solutions that attempt to reach the maximum theoretical bandwidth of the channel through reduced conflict misses, without the associated hardware overhead. It specifically alleviates issues that increasing bandwidth does not, chiefly being page conflict misses. All results can be further tuned by using a ratio of old requests to new requests and reads to writes. Future work will examine further ways to reduce the latency of main memory so program throughput can continue to grow in tandem with number of cores and threads. We will also examine the behavior of SPEC OMP workloads [22] and investigate running several single and multi-threaded programs simultaneously on a CMP.

[12]

[13]

[14]

8. REFERENCES [1] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Dynamic memory hierarchy performance optimization. In Workshop on Solving the Memory Wall Problem, held at the 27th International Symposium on Computer Architecture, June 2000. [2] D. Burger, J. Goodman, and A. K¨ agi. Memory bandwidth limitations of future microprocessors. In Proc. 23rd IEEE/ACM International Symposium on Computer Architecture, pages 78–89, May 1996. [3] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a smarter memory controller. In Proc. Fifth Annual Symposium on High Performance Computer Architecture, pages 70–79, Jan. 1999. [4] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. A performance comparison of contemporary DRAM architectures. In Proc. 26th IEEE/ACM International Symposium on Computer Architecture, pages 222–233, May 1999. [5] M. Curtis-Maury, K. Singh, S. McKee, F. Blagojevic, D. Nikolopoulos, B. de Supinski, and Schulz. Identifying energy-efficient concurrency levels using machine learning. In Proc. 1st International Workshop on Green Computing, Sept. 2007. [6] B. Ganesh, A. Jaleel, D. Wang, and B. Jacob. Fully-buffered dimm memory architectures: Understanding mechanisms, overheads and scaling. In Proc. 13th IEEE Symposium on High Performance Computer Architecture, pages 109–120, Feb. 2007. [7] H. Hidaka, Y. Matsuda, M. Asakura, and K. Fujishima. The cache dram architecture: A dram with an on-chip cache memory. IEEE Micro, 10(2):14–25, 1990. [8] I. Hur and C. Lin. Adaptive history-based memory schedulers. In Proc. IEEE/ACM 38th Annual International Symposium on Microarchitecture, pages 343–354, Dec. 2004. [9] B. Jacob. A case for studying DRAM issues at the system level. IEEE Micro, 23(4):44–56, 2003. [10] A. Jaleel. The effects of aggressive out-of-order mechanisms on the memory sub-system. Ph.D. dissertation, University of Maryland, July 2005. [11] A. Jaleel and B. Jacob. Using virtual load/store

[15]

[16]

[17]

[18] [19]

[20] [21]

[22]

[23]

[24]

[25]

queues (vlsqs) to reduce the negative effects of reordered memory instructions. In Proc. 11th IEEE Symposium on High Performance Computer Architecture, pages 266–277, Feb. 2005. N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin, M. Kandemir, and V. Narayanan. Leakage current: Moore’s law meets static power. IEEE Computer, 36(12):68–75, Dec. 2003. T. Kirihata, G. Mueller, B. Ji, G. Frankowsky, J. Ross, H. Terletzki, D. Netis, O. Weinfurtner, D. Hanson, G. Daniel, L.-C. Hsu, D. Sotraska, A. Reith, M. Hug, K. Guay, M. Selz, P. Poechmueller, H. Hoenigschmid, and M. Wordeman. A 390-mm2, 16-bank, 1-gb ddr sdram with hybrid bitlinearchitecture. IEEE Journal of Solid-State Circuits, 34(11):1580–1588, 1999. N. Kirman, M. Kirman, R. Dokania, J. Martinez, A. Apsel, M. Watkins, and D. Albonesi. Leveraging optical technology in future bus-based chip multiprocessors. In Proc. IEEE/ACM 40th Annual International Symposium on Microarchitecture, pages 492–503, Dec. 2006. W. Lin, S. Reinhardt, and D. Burger. Reducing DRAM latencies with an integrated memory hierarchy design. In Proc. 7th IEEE Symposium on High Performance Computer Architecture, pages 301–312, Jan. 2001. S. McKee, W. Wulf, J. Aylor, R. Klenke, M. Salinas, S. Hong, and D. Weikle. Dynamic access ordering for streamed computations. IEEE Transactions on Computers, 49(11):1255–1271, Nov. 2000. K. Nesbit, N. Aggarwal, J. Laudon, and J. Smith. Fair queuing memory systems. In Proc. IEEE/ACM 40th Annual International Symposium on Microarchitecture, pages 208–222, Dec. 2006. J. Renau. SESC. http://sesc.sourceforge.net/index.html, 2002. S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens. Memory access scheduling. In Proc. 27th IEEE/ACM International Symposium on Computer Architecture, pages 128–138, June 2000. R. Sites. It’s the Memory, Stupid! Microprocessor Report, 10(10):2–3, Aug. 1006. M. Stan and K. Skadron. Guest editors’ introduction: Power-aware computing. IEEE Computer, 36(12):35–38, Dec. 2003. Standard Performance Evaluation Corporation. SPEC OMP benchmark suite. http://www.specbench.org/hpg/omp2001/, 2001. D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A memory-system simulator. In Computer Architecture News, volume 33, pages 100–107, Sept. 2005. D. T. Wang. Modern dram memory systems: Performance analysis and a high performance, power-constrained dram-scheduling algorithm. Ph.D. dissertation, University of Maryland, May 2005. S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. 22nd IEEE/ACM International Symposium on Computer Architecture, pages 24–36, June 1995.

[26] W. Wulf and S. McKee. Hitting the wall: Implications of the obvious. Computer Architecture News, 23(1):20–24, Mar. 1995. [27] T. Yamauchi, L. Hammond, and K. Olukotun. The hierarchical multi-bank dram: A high-performance architecture for memory integrated with processors. In Conference on Advanced Research in VLSI, pages 303–320, Nov. 1997. [28] Z. Zhang, Z. Zhu, and X. Zhang. Cached DRAM for ILP processor memory access latency reduction. IEEE Micro, 21(4):22–32, July/August 2001.

Optimizing Thread Throughput for Multithreaded Workloads on Memory ...

Optimizing Thread Throughput for Multithreaded Workloads on Memory ...

Suggest Documents

Optimizing C Multithreaded Memory Management Using Thread-Local

Compiled Multithreaded Data Paths on FPGAs for Dynamic Workloads ...

Dynamic Thread Resizing for Speculative Multithreaded Processors ...

cache-conscious thread scheduling for massively multithreaded ...

Plagiarism Detection for Multithreaded Software Based on Thread ...

SAM: Optimizing Multithreaded Cores for ... - People.csail.mit.edu

SAM: Optimizing Multithreaded Cores for ... - People.csail.mit.edu

SAM: Optimizing Multithreaded Cores for ... - people.csail.mit.edu

SAM: Optimizing Multithreaded Cores for ... - People.csail.mit.edu

SAM: Optimizing Multithreaded Cores for ... - People.csail.mit.edu

Speculative Thread Execution in a Multithreaded Dataflow ...

Enabling High-Performance Memory Migration for Multithreaded ...

A Multithreaded Runtime System With Thread Migration for ...

Memory and Energy Optimization Strategies for Multithreaded ...

Memory model for multithreaded C++: Issues

Memory and Energy Optimization Strategies for Multithreaded

A Thread Monitoring System for Multithreaded Java Programs - ACM ...

Thread-Sensitive Points-to Analysis for Multithreaded Java ...

Optimizing SIEM Throughput on the Cloud Using

Simulating Commercial Java Throughput Workloads: A ... - CiteSeerX

Evaluating Thread Placement Based on Memory Access Patterns for ...

Characterizing and optimizing TPC-C workloads on ... - Springer Link

Optimizing Latency and Throughput for Spawning Processes on ...

On Optimizing Gateway Placement for Throughput in Wireless Mesh ...