Prefetching in Embedded Mobile Systems Can Be Energy ... - CiteSeerX

137 downloads 933 Views 344KB Size Report
browsing, multimedia, gaming, thus demanding perfor- mance that is comparable to that of desktop and laptop machines. This trend implies that, to meet the ...
IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

1

Prefetching in Embedded Mobile Systems Can Be Energy-Efficient Jie Tang, Shaoshan Liu, Zhimin Gu, Chen Liu and Jean-Luc Gaudiot, Fellow, IEEE Abstract—Data prefetching has been a successful technique in high-performance computing platforms. However, the conventional wisdom is that they significantly increase energy consumption, and thus not suitable for embedded mobile systems. On the other hand, as modern mobile applications pose an increasing demand for high performance, it becomes essential to implement high-performance techniques, such as prefetching, in these systems. In this paper, we study the impact of prefetching on the performance and energy consumption of embedded mobile systems. Contrary to the conventional wisdom, our findings demonstrate that as technology advances (e.g. from 90 nm to 32 nm), prefetching can indeed be energy-efficient while improving performance. Furthermore, we have developed a simple but effective analytical model to help system designers to identify the conditions for energy-efficiency.

——————————  ——————————

1 INTRODUCTION

D

ata prefetching has been a successful technique in modern high-performance computing platforms. It accelerates program execution by hiding memory access latency through prefetching data that would be needed for computation. However, it has been found that due to their speculative nature, prefetching significantly increase the total energy consumption and are not suitable for embedded mobile systems [1, 2]. On the other hand, embedded mobile systems have become increasingly powerful in recent years. For instance, modern smart phone applications include web browsing, multimedia, gaming, thus demanding performance that is comparable to that of desktop and laptop machines. This trend implies that, to meet the highperformance requirement, we need to apply highperformance techniques, such as prefetching. To address this dilemma, we study how different hardware prefetching techniques impact energy consumption of embedded mobile systems. Specifically, as modern mobile devices are manufactured with newer technologies that enable smaller transistor feature sizes, the energy consuming behavior may have changed. Considering this trend, we study the energy efficiency of prefetchers implemented with different technologies.

2

small memory region [7]. Correlated prefetchers [5] issue prefetches based on the previously recorded correlations between addresses of cache misses. There had been some studies focuing on offering energy efficiency to hardware prefetching: PARE [12] is one of these techniques, which constructs a power-aware hardware prefetching engine. By categorizing memory accesses into different groups, PARE makes prefetch decisions solely based on the information within group with limited energy consumption.

3 METHODOLOGY In this section, we discuss our methodology to study the performance and energy efficiency of prefetching in mobile embedded systems.

3.1 Benchmarks Modern embedded mobile systems execute a wide variety of workloads, including entertainment, image processing, and data processing etc. To represent these workloads, we select three sets of benchmark listed in Table 1: Table 1: Benchmark Set Xerces-C++ Media Mench II

BACKGROUND

Prefetching has been well studied to reduce the gap between processor and memory speeds. Different from software solution inserting prefetching instructions into execution flow, hardware prefetching uses additional circuitry for prefetching data based on access patterns. Sequential prefetching and its improvement, tagged prefetching leverage spatial locality in data streams [3]. Stride-based prefetching [4] detects the stride pattern in the address stream. Stream prefetchers capture a sequence of nearby misses which follow the same direction in a

PARSEC

SAX DOM JPEG2000 Encode JPEG2000 Decode H.264 Encode H.264 Decode Fluidanimate Freqmine

The first set includes two XML data processing benchmarks taken from Xerces-C++ [13], respectively implementing event-driven and tree-based parsing model. The second set is taken from MediaBench II, which represents multimedia and entertainment workloads [10], respectively based on the ISO JPEG-2000 and ISO JPEG-2000 standard. The third set is taken from the PARSEC benchmark

2

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

suite [11] which is used in many gaming applications and to mine multimedia contents.

each read/write access to the global history buffer dissipates 0.017 nJ of energy.

3.2 Hardware Prefetchers In order to make a comprehensive investigation, we have selected six representative ones from a broad spectrum of existing hardware prefetchers, which employ different prefetching algorithms and coverages. The details of these prefetchers are summarized in Table 2 as follows: cache hierarchy indicates the coverage of prefetcher; prefetching degree shows whether the prefetching degree of the prefetcher is static or dynamically adjusted; trigger L1 and trigger L2 respectively show the trigger set of prefetching at the covered hierarchy, here access stands for access requests from upper memory hierarchy no matter it is a miss or hit; N/A means there is no prefetching applied in this hierarchy. Note that all selected prefetchers are able to filter out redundant access requests. Specifically, P4 is a relatively conservative one with narrow coverage and trigger set; contrarily, P3 behaves aggressively. Table 2: Summary of Prefetchers

4 PREFETCHER PERFORMANCE

cache hierarchy

prefetching degree

trigger L1

25.00% 20.00%

P1

15.00%

P2

10.00%

P3

trigger L2

5.00%

P4

0.00%

P5

P1

L1 & L2

Dynamic

miss

access

P2

L1

Static

miss

N/A

P3

L1 & L2

Dynamic

miss

miss miss

P4

L2

Static

N/A

P5

L1 & L2

Dynamic

miss

miss

P6

L2

Static

N/A

access

3.3 Energy Modeling To study the performance of the selected prefetchers, we utilize CMP$IM [8], a binary-instrumentation based cache simulator, to model high-performance embedded systems. Simulation parameters are shown in Table 3, which resembles modern Smartphone and e-book systems. Table 3: Simulation Parameters Frequency Issue Width Instruction Window L1 Data Cache L1 Inst. Cache L2 Uniform Cache Memory Latency

In this section, using CMP$IM we study how hardware prefetching improve the performance of embedded systems. In Figure 1, the X-axis shows eight benchmarks as well as the average result (at the right-most); and the Yaxis shows the performance improvement. Overall, prefetching techniques are effective on improving performance by more than 5% on averag. In detail, the effectiveness of prefetchers depends on both prefetching technique itself and natures of applications. For instance, P3 results in the best average performance because it’s the most aggressive prefether which has large coverage and request sets. JPEG2000 decoding and encoding programs can receive up to 22% of performance improvement due to its streaming feature.

1 GHz 4 128 entries 32KB, 8-way, 1cycle 32 KB, 8-way, 1cycle 512 KB, 16-way, 20 cycles 256 MB, 200 cycles

To study the impact of prefetching on energy consumption of memory subsystem, we use CACTI [9] to model energy parameter of different technology implementations. Since hardware prefetcher is usually implemented as a set of hardware tables, its energy consumption can be accurately modeled. For instance, prefetcher P6 consists of a 4096-bit global history buffer and thirty-two 800-bit local history buffers. When implemented using 90-nm technology, the prefetcher consumes 0.033 mW static power and

P6

Figure1. Performance Improvement

5 ENERGY EFFICIENCY Although previous studies have suggested that prefetching is energy-consuming, as new technologies enabling small transistor feature sizes become available, the situation may have changed. We study the energy efficiency of prefetchers implemented with both 90 nm and 32 nm technologies. The results are summarized in Figures 2 and 3 respectively: the baseline is energy consumption without any prefetcher, thus a positive number shows that with the prefetcher the system dissipates more energy and vice versa. For instance, 0.1 means that with the prefetcher, the system dissipates 10% more energy compared to baseline. In 90nm technology, most prefetchers significantly increase overall energy consumption, which confirms the findings of previous studies. Due to the aggresiveness, P2 and P3 are the most energy-inefficient prefetchers, in the worst cases, they consume close to and even more than 100% additional energy. This result shows that while improving performance, aggressive prefetching also consumes a large amount of extra energy. On the other hand, P4 results in energy saving for all benchmarks, which is 2% on average. This is because P4 is a conservative prefetcher that covers only at L2 cache with low prefetching

AUTHOR ET AL.: TITLE

3

degree. Thus, in 90 nm technology, only very conserva conservative prefetchers can be energy efficient. 1.2 1

P1

0.8

P2

extra static and dynamic energy overheads. Therefore, a prefetcher is energy-efficient only if the energy it saves outweighs the energy overhead it incurs. (1)

0.6

P3

0.4 P4

0.2 0

P5

-0.2

P6

Figure 2. Energy efficiency in 90 nm technology In 32 nm technology, P4 is still the most energy energyefficient prefetcher, reducing overall energy by almost 4% on average; when running JPEG 2000 Decode,, it achieves close to 10% energy saving. Different from the 90 nm iimplementations, more prefetchers become come energy energy-efficient, such as P2 achieves 2% energy saving in fluidanimate luidanimate and P1 achieves 7% energy saving in JPEG 2000 Decode Decode, etc. P2 and P3 are still the most energy-inefficient inefficient prefetchers due to their aggressiveness. However, in the worst rst case they only consume 25% extra energy, a four-fold fold reduction compared to the 90 nm implementations. Thus Thus, in 32 nm technology, most prefetchers are able to provide performance gain with less than 5% energy overheads; and P1 and P4 even result in 2% to 5% energy reductions. 0.3 0.25 0.2

P1

0.15

P2

0.1 0.05

P3

0

P4

-0.05

P5

-0.1 -0.15

P6

Figure 3. Energy Efficiency in 32 nm technology

6 Energy Consumption Analysis In this section, we analyze how prefetchers impact energy consumption. As stated in equition 1, the total system energy consumption consists of two contributors: static energy (Estatic) and dynamic energy (Edynamic).. Estatic is the production of overall execution time (t) and the system static power consumption (Pstatic). Edynamic can be derived by multiplying the number of read/write accesses (nm) and the energy dissipated on the bus and memory emory subsystem of each access (E’m). When prefetchers make acce acceleration, static energy consumption of the memory subsystem is reduced as a result of the reduced execution time time. However, prefetchers generate significant amount of extra memory subsystem accesses leading to pure dynam dynamic energy overheads. Besides, prefetching hardwaree itself incurs

E = Estatic + Edynamic = Pstatic × t + nm × Em'

To further understand energy behavior of prefetching, pre we divide the overall energy consumption into four catecat gories in Table 4. Note that here memory refers to the total memory subsystem including L1 cache, L2 cache, and main memory. When counting Dynamic prefetch, prefetch we take into account the energy of accessing the prefetcher hardware, as well as that of the extra cache and memory aca cesses generated by the prefetcher. In n Figures 4, 4 we state the normalized fractions of four catrgories which are the average results from eight benchmarks accelerated by six prefetching schemas (average of 48 data points). Table 4. Energy Category ategory Dynamic memory Static memory Dynamic prefetch Static prefetch

dynamic activities of the memory subsystem memory subsystem static power consumption dynamic activities of the prefetcher prefetcher hardware static power consumption consump

In n 90 nm technology, dynamic energy contributes to up to 66% of the total energy consumption: 14% from the prepr fetcher and 52% from the memory subsystem. Static energy only accounts for 34% of the total energy consumption. consumption Thus, in 90 nm technology, dynamic energy is the domidom nant component with up to 14% pure energy overhead ov from prefetcher. Hence, although the prefetchers are able to reduce execution time, there leaves little room for total energy saving, leading to energy inefficiency efficiency for most prefetchers in 90 nm implementations. On the contrary, in 32 nm technology, static energy contributes over 66% of the total energy consumption: 65% from the memory subsyssubsy tem, and 1% from the prefetcher fetcher hardware. Static energy becomes the dominant component giving more potential for prefertching to save more energy. Only 7% 7 of the total energy consumption comes from the overhead incurred by prefetcher, which has dropped by more than 50% compared to 90 nm technology. Consequently, in 32 nm techtec nology, prefetchers become energy-efficient efficient in many difdi ferent cases.

static prefetch 1 static memory 0.5 dynamic prefetch

0 90nm

32nm

dynamic memory

Figure 4. Energy distribution in 90 nm and 32 nm

7 Energy Efficiency model To identify the conditions under which prefetching results

4

IEEE TRANSACTIONS ON JOURNAL NAME, MANUSCRIPT ID

in energy efficiency, we propose an analytical model. Equation 2 states our core question, whether prefetching can reduce energy consumption: Eno-pref represents the original energy consumption without prefetching; Epref represents the energy consumption with prefetching. Using Equation 1, we expand Equation 2 into Equation 3(t2 represents the new execution time). To simplify the model, we assume there is only one level in the memory subsystem. Compared to Eno-pref, Epref has two more contributors: static energy and dynamic energy consumption coming from prefetcher hardware. Through several mathematical manipulations, we derived Equation 4. The left-hand side shows the performance gain as a result of prefetching. The dividend of right-hand side contains three terms: energy overhead incured by the extra memory accesses as a result of prefetching; dynamic energy generated by accessing prefetcher hardware; and static energy consumption of prefetcher hardware. Their summation comes into the total energy overhead incurred by prefetching. The divisor of the right-hand side represents the static energy of the original design without prefetching. Therefore, as summarized in Equation 5, if a prefetcher needs to be energy efficient, the performance gain (G) it brings must be greater than the ratio of the energy overhead (Eoverhead) it incurs over the original static energy (Eno-pref-static). (2) (3)

Eno − pref > E pref ?

8 CONCLUSIONS As modern embedded mobile applications demand performance that is comparable to desktop machines, it has become imperative to implement high-performance techniques, such as prefetching in mobile systems. Our study has shown that as technology advances, hardware prefetching techniques no longer put a heavy burden on energy consumption. Therefore, it is feasible to implement hardware prefetchers in embedded mobile systems. In addition, we have developed and validated a simple but effective analytical model to estimate prefetcher energy efficiency. This model allows system designers to estimate the energy efficiency of their hardware prefetcher designs with quick turn-around time.

REFERENCES [1]

[2]

[3]

[4] [5]

Pm−static × t1 + nm1 × Em' > Pm− static × t 2 + nm 2 × Em' + Pp− static × t 2 + n p × E p'

(4)

' ' (t1 − t 2) (nm 2 − nm1 ) × Em + n p × E p + Pp − static × t2 > t1 Pm − static × t1

(5)

G>

(6)

EEI = G −

Eoverhead

[6]

[7]

Eno− perf −static

Eoverhead Eno − perf − static

[8]

Furthermore, we define a metric Energy Efficiency Indicator (EEI) in Equition 6. A positive EEI indicates the prefetcher is energy-efficient and vice versa. We generated the EEIs of the selected prefetchers in Table 5 by using a three-level (consists of L1, L2 Caches and main memory) analytical model. In 90 nm, the EEIs show only P4 is energy efficient; and in 32 nm, both P1 and P4 are energy efficient. We have validated the analytical results with the empirical results shown in Section 5, thus indicating the simplicity and effectiveness of our analytical models. Table 5: Energy Efficiency Indicator P1

P2

P3

P4

P5

P6

90 nm

-0.1

-0.5

-0.69

0.03

-0.27

-0.31

32 nm

0.03

-0.05

-0.07

0.05

0.00

-0.14

[9] [10]

[11]

[12]

[13]

Y. Guo, S. Chheda, I. Koren, C.M. Krishna, and C.A. Moritz. Energyaware data prefetching for general-purpose programs. In Proceedings of Power-Aware Computer Systems (PACS’04), Micro-37, 2004. Y. Guo, S. Chheda, I. Koren, C.M. Krishna, and C.A. Moritz. Energy characterization of hardware-based data prefetching. In Proceedings of International conference on Computer Design (ICCD’04), 2004. D.G. Perez, G. Mouchard, and O. Temam, Microlib: A case for the quantitative comparison of micro-architecture mechanisms, In Proceedings of International Symposiums on Microarchitecture (MICRO), 2007. J. Fu and J. Patel, “Stride directed prefetching in scalar processors”, MICRO-25, 1992. M. Charney and A. Reeves. “Generalized Correlation Based Hardware Prefetching,” Technical Report EE-CEG-95-1, Cornell University, February 1995. C.F. Chen, S.-H. Yang, B.Falsafi, and A. Moshovos.Accurate and complexity-effective spatial pattern prediction. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2004. Santhosh Srinath, Yale N. Patt, “Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In Proceedings of the International Symposium on HighPerformance Computer Architecture (HPCA), 2007. A. Jaleel, R. S. Cohn, C. K. Luk, and B. Jacob. CMP$im: A Pin-Based On-The-Fly Multi-Core Cache Simulator. In MoBS, 2008. P. Shivakumar, N.P. Jouppi, CACTI3.0: an integrated cache timing, power, and area model, WRL Research Report 2001 C. Lee, M. Potkonjak, and W.H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communications systems, in Proceedings of the international symposium on Microarchitecture (MICRO), 1997 C. Bienia, S. Kumar, J.P. Singh, and K. Li, The PARSEC benchmark suite: characterization and architectural implications, Princeton University Technical Report TR-811-08, 2008. Y Guo, MB Naser, CA Moritz, “ PARE: a power-aware hardware data prefetching engine” International Symposium on Low Power Electronics and Design Proceedings of the 2005 international symposium on Low power electronics and design . Xerces-C++ XML Parser: http://xerces.apache.org/xerces-c/

Suggest Documents