An Approach of Processor Core Customization for Stencil Computation

4 downloads 144 Views 92KB Size Report
This paper presents an approach to enhance a preliminary customizable core with some ... features include several effective software/hardware co- optimizing ...
An Approach of Processor Core Customization for Stencil Computation Yanhua Li1, Youhui Zhang1, Jianfeng Yang1, Wayne Luk2, Guangwen Yang1, Weimin Zheng1 Department of Computer Science and Technology, Tsinghua UniversityˈBeijing, China ^liyh09@mails|zyh02@|yjf11@mails|zwm-dcs@}.tsinghua.edu.cn 2 Department of Computing,Imperial College ,London, UK [email protected]

1

Abstract—Architecture customization is believed as one of the most promising methods to meet ever-increasing computing needs and power density limitations. This paper presents an approach to enhance a preliminary customizable core with some common architecture features, to adapt to the specific applications while keeping the programming flexibility. Those features include several effective software/hardware cooptimizing strategies, such as loop tiling, pre-fetching, cache customization, customized Single Instruction Multiple Data (SIMD) and Direct Memory Access (DMA), as well as the necessary ISA extensions. Currently we select stencil computation as the research target. Detailed tests of powerefficiency to evaluate the effect of all these optimizations comprehensively shows impressive performance speedup and power efficiency, even compared to X86, GPU and FPGA platforms. All these proposed customizations here could be applied to other computing applications. Keywords—customizable processor; stencil computation; software/hardware co-design; high-performance computing

B. Customized SIMD We implement the floating-point add and floating-point multiply operations based on the integer operations by employing the ISA extension technique of customizable processor, to enhance the floating-point processing capability. C.

DMA combined with memory on-chip solutions A simple 7-stage Xtensa LX core is used as the DMA engine and connected to the computing core with two TIE ports. To support high-speed DMA transferring, a doublebuffer of local SRAM on chip is used. D. Bandwidth optimizations To reduce the required bandwidth, we employ two types of optimization: (1) hardware/software pre-fetch mechanisms; and (2) adjusting the cache strategy during the run-time to avoid the stencil-specifically unnecessary data accesses. III.

I. INTRODUCTION Customizable processors can be regarded as a promising method for exploiting power-efficient high-performance computing. From the aspect of micro-architecture, customization usually means to adapt some applicationspecific features to improve the running efficiency. This strategy will limit the applicability more or less. This paper is different: it presents a systematic approach to enhance a preliminary customizable core with some common architecture features for the specific applications. In detail, we enhance a single and very simple core with those needed hardware features combined with software optimizations, including data pre-fetch, on-chip memory, cache-strategy customization, customized DMA, and SIMD methods.

EVALUATION AND ANALYSIS

Based on the IDE provided by Tensilica, we have carried out evaluations for each above-mentioned customization step from the aspects of core performance, power consumption and chip area. The stencil array scale is set as 512 x 512 x 3. We also compare the design with other architectures. Fig. 1 shows the normalized performance speedup, energy consumption and performance per watt of each step of our customization approach. Naïve version uses the preliminary Tensilica LX4 core without relative software/hardware optimizations. Tiling version is the version using stencilspecific cache configuration (16KB, 2-way set-associative, 64bytes cache-line) which performed best in our experiment among those different cache configurations. No-allocate version is another optimization about cache configuration. We found that, to set the cache attributes of Array C as write-backbut-no-allocate-on-store improved the performance by about 1.6% and decreased the energy consumption by 2% over writeback-allocate mode. The combination of these customization steps has improved the application performance by 341% while the energy consumption has been decreased by 35%. Fig. 2 shows the performance speedup and power efficiency of our customization approach on other applications. Except the matrix multiplication, other three applications(7-point stencil, vector-inner-products and vector scaling) gain over 3 times performance improvement and 70% energy consumption decrease.

II. CUSTOMIZATION DESIGN AND IMPLEMENTATION We use the Xtensa Xplorer IDE [1] as the design / evaluation tools and use the LX4 core as the starting point of enhancement for a 19-point stencil computation [2].According to the characteristics of the stencil computation, we design the customization approach as follows: A. Loop tiling and array padding with configurable cache The key point here is to combine software optimization (tiling and array padding based on Pluto[3] ) with the configurable feature of cache in customizable processor. The work is supported by the High Technology Research and Development Program of China under Grant No. 2013AA01A215.

978-1-4799-3609-0/14/$31.00 © 2014 IEEE

182

ASAP 2014

Fig.1. Normalized performance speedup, energy consumption and performance per watt of each customization step

Fig.2. Normalized performance speedup and energy consumption of the proposed core customization on different applications [3]

The preliminary comparison with an X86 X5560 core(in Table I) also showed that the design could achieve an order of magnitude higher performance per watt; its application performance was about 43%~54% of that of X86 core while the chip area and energy consumption was under 3%. Comparison with the GPU and FPGA (in Table II) also shows the energy efficiency of the customization approach. IV.

[4] [5] [6]

CONCLUSIONS [7]

From the results, we can get the following conclusions: y

The customizable core can obtain high powerefficiency and high area-efficiency for some scientific applications, owing to keeping the core simple while augmenting necessary architecture features.

y

From the various aspects of architecture design (regardless of optimizing bandwidth or extending instructions), hardware and software co-design is an efficient way for improvement.

y

[8]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In Proc. SIGPLAN.2008, 43(6):101–113. K. Datta et al., Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, in Proc Supercomputing 2008. E. Phillips and M. Fatica, Implementing the himeno benchmark with CUDA on GPU clusters, in Proc. IPDPS, 2010,pp. 1-10. Y. Yang, H. Cui, X. Feng, and J. Xue, A hybrid circular queue method for iterative stencil computations on GPUs, Journal of Computer Science and Technology, vol. 27, pp. 57–74, 2012. M. Araya-Polo et al., Assessing accelerator-based HPC reverse time migration, IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 147–162, Jan. 2011. X. Niu, Q. Jin, W. Luk, Q. Liu, & O. Pell. Exploiting run-time reconfiguration in stencil computation. In Proc. FPL 2012, pp. 173-180. COMPARISON WITH INTEL X5560 CORE

TABLE I.

Power (W)

Last but not least, those hardware customizations are common optimization technologies, thus we believe they can be widely used for more applications.

Customization Tensilica Core

0.5

Chip area (mm^2) 0.914

X5560

24

66

TABLE II.

REFERENCES [1] [2]

Our Design

http://ip.cadence.com/ipportfolio/tensilica-ip/xtensa-customizable. O’Reilly, Randall C., and Jeffrey M. Beck. A family of large̺stencil discrete Laplacian approximations in threedimensions. Int. J. Numer. Methods Engineer,2006, pp. 1-16.

Throughput

Frequenc y (Ghz) 1.395

Application Performance (s) 33

2.8

18

COMPARISON WITH GPU AND FPGA GPU

FPGA

[4]

[5]

[6]

[7]

[8]

0.82

36

51.2

64.5

35.7

102.8

1660

76.5

n/a

n/a

n/a

785.0

(GFlop/s)

Efficiency (MFlop/s /W)

183

Suggest Documents